Synchronization of multiple camera videos using audio-fingerprinting and onsets

Document Sample
Synchronization of multiple camera videos using audio-fingerprinting and onsets Powered By Docstoc
					                                       CHAPTER 1

       Many people shoot videos at social occasions like parties, concerts, weddings, vacations,
etc., using not only camcorders but also mobile phones, PDAs and digital-still cameras.
Generally, raw home videos are considered boring because they are long and lack visual appeal.
Moreover the recordings captured by multiple cameras in an event contain overlapping content.
Considering the total length of all the recordings, it is not only boring but also impractical to
watch them all. Multiple cameras provide representations of the event from different viewing
angles. Combining segments from different recordings introduces views from different angles
into one video stream. These multiple views and camera transitions give an impression of
dynamics in video. Therefore, such combined videos have a longer attention span in comparison
to the single camera videos and are perceived as visually appealing. Similarly, the aesthetic
quality of the combined-videos can be optimized by selecting the best quality audio-video
segments from different recordings. For example, at a concert, a camera captures a good view of
the stage while two other cameras capture clear audio and a good coverage of the audience. The
combination of these three recordings would create a more desirable video than each of the
separate recordings.

1.1 Need for new technology

       One of the first problems encountered while combining audiovisual segments from
different cameras is that the recordings require precise synchronization, also called time-
alignment. Even a slight misalignment between two videos results in discontinuity of the time
flow. In professional productions, the use of multiple cameras is very common. The
synchronization of the recordings is ensured by physically connecting the cameras to a device
called “jamsync” or by the use of a “clap” at each camera-take, which is a sequence of frames
captured by a video camera from the moment the camera is clicked on to the moment it is clicked

off. However, in the case of home video recordings, control and coordination among different
cameras are not feasible. The cameras are set independently and turned on and off at the will of
their users.

        In current applications involving multiple cameras, the synchronization offset between
the recordings is searched manually. The search is considered very time consuming and difficult.
It involves finding an instant in all the available recordings containing a distinctive object motion
or sound, for example, the sound of a clap or a step in dance. Then further careful observation in
both of the recordings is required to accurately locate the synchronization point, for example
when the audio-signal is at peak due to the clap. Automated synchronization is applicable in
different domains where multiple cameras are involved and requires interaction among the
recordings. Video editing tools such as Adobe Premiere Pro, Final Cut Pro, and Ulead, which
currently require manual synchronization, could offer automatic solutions in their existing
products. In stereo or multi-view coding standards, where synchronization is listed as a
requirement , automatic synchronization could be applicable to 3-D video and free-viewpoint
video technologies. Another useful application of automated synchronization would be for large
video repositories like YouTube, Google, and Yahoo! that contain multiple clips of an event
recorded by different people. Presently, the clips are not organized according to an event and
each of the clips can only be accessed individually. Automated synchronization of multiple clips
available from an event facilitates easy managing and simultaneous watching of the clips.

        Other applications, such as annotating the official- meeting videos, improving audio
quality using multiple recordings, which use multiple recordings captured simultaneously in a
room and require synchronizing the             recordings, can also benefit from automated
synchronization. In this paper we propose a novel automated synchronization approach to
synchronize multiple camera recordings. The approach is based on detecting and matching audio
and video features extracted from the recorded content. Depending on the features used, we
introduce two realizations of the proposed approach using audiofingerprints and onsets.We
assess them experimentally and recommend
their usability based on practical cases.

1.2 Proposed approach

       The most intuitive and simple approach to synchronize two audio or video recordings
would be to compare the audio-visual signals and look for the best match. However, an object
recorded at the same time by multiple cameras may look or sound different because of camera
position (next to a light source, noisy surrounding), quality of the camera components (lens,
resolution, microphone), camera settings (white-point, gamma, audio gain), user handling (shaky
hands, jerky movements). Therefore, the raw audio-video signals are not suitable for matching
purposes. Our approach to find a synchronization offset between two recordings involves
extracting features and searching for an approximate match between the features corresponding
to the recordings. The features should be accurate in representing time in high resolution,
compact in size, easy to extract and match. They should be robust against different camera
characteristics and surrounding noises. Furthermore, the features should not be case-specific but
applicable to recordings from various situations such as weddings, concerts and official-

       Based on these considerations we selected two features, onsets and fingerprints in the
audio domain. In comparing the features, it is very likely that multiple matches occur, which may
result in different synchronization offsets. To determine the most reliable synchronization offset,
a voting scheme is applied among the possible offsets. In cases where there are more than two
recordings, the synchronization offset is calculated by comparing all possible pairs. The
calculated synchronization offset were checked against the ground truth offsets computed
manually by listening and watching the recordings. The proposed synchronization approach is
applicable to recordings with different frame rates. However, depending on the accuracy required
by the application, there is a limitation in the required minimum frame rate of the recordings.
Since a video frame represents a sampled time instant, the synchronization offset accuracy is
determined by the lowest frame rate among the given recordings. For example, to meet the
synchronization accuracy for lip-sync, the frame rate should be at least 22.2 frames per second.
The synchronization offset calculated between a pair of camera-takes from two recordings is
valid only for those camera-takes.

                                        CHAPTER 2

2.1 General Explanation

       Fingerprint systems are over one hundred years old. In 1893 Sir Francis Galton was the
first to “prove” that no two fingerprints of human beings were alike. Approximately 10 years
later Scotland    Yard accepted a system designed by Sir Edward Henry for identifying
fingerprints of people. This system relies on the pattern of dermal ridges on the fingertips and
still forms the basis of all “human” fingerprinting techniques of today. This type of forensic
“human” fingerprinting system has however existed for longer than a century, as 2000 years ago
Chinese emperors were already using thumbprints to sign important documents. The implication
is that already those emperors     realized that every fingerprint was unique. Conceptually a
fingerprint can be seen as a “human” summary or signature that is unique for every human being.
It is important to note that a human fingerprint differs from a textual summary in that it does
not allow the reconstruction of other aspects of the original. For example, a human fingerprint
does not convey any information about the color of the person‟s hair or eyes. Recent years have
seen a growing scientific and industrial interest in computing fingerprints of multimedia objects.

       Audio-fingerprints are compact and accurate representations of an audio object. In cases
that require comparisons between two audio objects, fingerprints, which are small by design, can
be used instead of using the audio objects themselves, which are typically large in size. This
allows simple comparison and precise temporal match. The prime objective of multimedia
fingerprinting is an efficient mechanism to establish the perceptual equality of two multimedia
objects: not by comparing the (typically large) objects themselves, but by comparing the
associated fingerprints (small by design). In most systems using fingerprinting technology, the
fingerprints of a large number of multimedia objects, along with their associated meta-data (e.g.
name of artist, title and album) are stored in a database. The fingerprints serve as an index to the
meta-data. The meta-data of unidentified multimedia content are then retrieved by computing a

fingerprint and using this as a query in the fingerprint/meta-data database. The advantage of
using fingerprints instead of the multimedia content itself is three-fold:

1. Reduced memory/storage requirements as fingerprints are relatively small;
2. Efficient comparison as perceptual irrelevancies have already been removed from fingerprints;
3. Efficient searching as the dataset to be searched is smaller.

                As can be concluded from above, a fingerprint system generally consists of two
components: a method to extract fingerprints and a method to efficiently search for matching
fingerprints in a fingerprint database.

2.2 Audio Fingerprint System Parameters

   Having a proper definition of an audio fingerprint we now focus on the different parameters
of an audio fingerprint system. The main parameters are:

      Robustness: an audio clip should still be identified after severe signal degradation? In
       order to achieve high robustness the fingerprint should be based on perceptual features
       that are invariant (at least to a certain degree) with respect to signal degradations.
       Preferably, severely degraded audio still leads to very similar fingerprints. The false
       negative rate is generally used to express the robustness. A false negative occurs when
       the fingerprints of perceptually similar audio clips are too different to lead to a positive

      Reliability: A song is often incorrectly identified.For example, “Rolling Stones – Angie”
       being identified as “Beatles – Yesterday”. The rate at which this occurs is usually
       referred to as the false positive rate.

      Fingerprint size: indicates how much of storage is needed for a fingerprint. To enable
       fast searching, fingerprints are usually stored in RAM memory. Therefore the fingerprint

       size, usually expressed in bits per second or bits per song, determines to a large degree
       the memory resources that are needed for a fingerprint database server.

      Granularity:determines how many seconds of audio is needed to identify an audio clip.
       Granularity is a parameter thatcan depend on the application. In some applications the
       whole song can be used for identification, in others one prefers to identify a song with
       only a short excerpt of audio.

      Search speed and scalability: is a measure of how long it takes to find a fingerprint in a
       fingerprint database. What if the database contains thousands and thousands of songs. For
       the commercial deployment of audio fingerprint systems, search speed and scalability are
       a key parameter. Search speed should be in the order of milliseconds for a database
       containing over 100,000 songs using only limited computing resources (e.g. a few high-
       end PC‟s). These five basic parameters have a large impact on each other. For instance, if
       one wants a lower granularity, one needs to extract a larger fingerprint to obtain the same
       reliability. This is due to the fact that the false positive rate is inversely related to the
       fingerprint size. Another example: search speed generally increases when one designs a
       more robust fingerprint. This is due to the fact that a fingerprint search is a proximity
       search. I.e. similar (or the most similar) fingerprint has to be found. If the features are
       more robust the proximity is smaller. Therefore the search speed can increase.

2.3 Audio fingerprint extraction

       Audio fingerprints intend to capture the relevant perceptual features of audio. At the same
time extracting and searching fingerprints should be fast and easy, preferably with a small
granularity to allow usage in highly demanding applications (e.g. mobile phone recognition). A
few fundamental questions have to be addressed before starting the design and implementation
of such an audio fingerprinting scheme. The most prominent question to be addressed is: what
kind of features are the most suitable. A scan of the existing literature shows that the set of

relevant features can be broadly divided into two classes: the class of semantic features and the
class of non-semantic features.

       Typical elements in the former class are genre, beats-per-minute, and mood. These types
of features usually have a direct interpretation, and are actually used to classify music, generate
play-lists and more. The latter class consists of features that have a more mathematical nature
and are difficult for humans to „read‟ directly from music. A typical element in this class is
Audio Flatness that is proposed in MPEG-7 as an audio descriptor tool. For the work described
in this paper we have for a number of reasons explicitly chosen to work with non-semantic

   1. Semantic features don‟t always have a clear and unambiguous meaning. I.e. personal
       opinions differ over such classifications. Moreover, semantics may actually change over
       time. For example, music that was classified as hard rock 25 years ago may be viewed as
       soft listening today. This makes mathematical analysis difficult.

   2. Semantic features are in general more difficult to compute than non-semantic features.

   3. Semantic features are not universally applicable. For example, beats-per-minute does not
       typically apply to classical music.

2.4 Extraction Algorithm

       Most fingerprint extraction algorithms are based on the following approach. First the
audio signal is segmented into frames. For every frame a set of features is computed. Preferably
the features are chosen such that they are invariant (at least to a certain degree) to signal
degradations. Features are mapped into a more compact representation by using classification
algorithms, such as Hidden Markov Models , or quantization . The compact representation of a
single frame will be referred to as a sub-fingerprint. The global fingerprint procedure converts a

stream of audio into a stream of sub-fingerprints. One sub-fingerprint usually does not contain
sufficient data to identify an audio clip. The basic unit that contains sufficient data to identify an
audio clip (and therefore determining the granularity) will be referred to as a fingerprint block.
The proposed fingerprint extraction scheme is based on this general streaming approach. It
extracts 32-bit sub-fingerprints for every interval of 11.6 milliseconds. A fingerprint block
consists of 256 subsequent sub-fingerprints, corresponding to a granularity of only 3 seconds.

                           Figure 1 Overview of fingerprint extraction scheme

       An overview of the scheme is shown in Figure 1. The audio signal is first segmented into
overlapping frames. The overlapping frames have a length of 0.37 seconds and are weighted by a
Hanning window with an overlap factor of 31/32. This strategy results in the extraction of one
sub-fingerprint for every 11.6 milliseconds. The most important perceptual audio features live in
the frequency domain. Therefore a spectral representation is computed by performing a Fourier
transform on every frame. Due to the sensitivity of the phase of the Fourier transform to different
frame boundaries and the fact that the Human Auditory System (HAS) is relatively insensitive to
phase, only the absolute value of the spectrum, i.e. the power spectral density, is retained. In
order to extract a 32-bit sub-fingerprint value for every frame, 33 non-overlapping frequency
bands are selected. These bands lie in the range from 300Hz to 2000Hz (the most relevant

spectral range for the HAS) and have a logarithmic spacing. The logarithmic spacing is chosen,
because it is known that the HAS operates on approximately logarithmic bands (the so-called
Bark scale).

         Experimentally it was verified that the sign of energy differences (simultaneously along
the time and frequency axes) is a property that is very robust to many kinds of processing. If we
denote the energy of band m of frame n by E(n,m) and the m-th bit of the sub-fingerprint of
frame n by F(n,m), the bits of the subfingerprint are formally defined as ( the gray block in
Figure 2 shows an example of 256 subsequent 32-bit subfingerprints (i.e. a fingerprint block),
extracted with the above scheme. A „1‟ bit corresponds to a white pixel and a „0‟ bit to a black

                                 Figure 2 Extracted Fingerprint blocks

         Figure 2 show a fingerprint block from an original CD and the MP3 compressed
(32Kbps) version of the same excerpt, respectively. Ideally these two figures should be identical,
but due to the compression some of the bits are retrieved incorrectly. These bit errors, which are
used as the similarity measure for our fingerprint scheme, are shown in black in Figure.

2.5 Audio Fingerprint matching

In order to find the synchronization offset between two recordings, fingerprint-blocks from the
first recording are compared with those of the second recording. The consecutive fingerprint-
blocks from the first recording are non-overlapping sub-fingerprints, whereas the fingerprint-
blocks from the second recording are overlapping by a factor of 255/256 to achieve maximum
synchronization accuracy. If there are N1 and N2 sub-fingerprints in two recordings, the total
number of fingerprint-block comparisons K is given by


                    Figure 3 Extracted Fingerprint blocks and the bit error between them

       The Figure 3 shows the BER between fingerprint-blocks between two recordings from
different cameras .Figure 3(a) and (b) represent two fingerprint blocks. One is that of an original
music clip and the second is that of its mp3 version.The third fingerprint block shows the bit
error between the first two.

       To find a match between two audio streams, a fingerprint-block consisting of 256
consecutive sub-fingerprints is used as basic unit. Two fingerprint-blocks are considered to be
matching if the Hamming distance or the number of bit errors (BER) between Fig.3(a) and (b)
Fingerprint-blocks from two synchronized recordings from different cameras of an event , (c)
given by black blocks is less than a threshold . It is proven theoretically and experimentally that
the false positive rate of matching fingerprints becomes 3.6*10-20 with a threshold, θ=35% .

               Figure 4 Calculated bit errors from different camera recordings of the same event

       In the figure 4, the darker colours show lower BER. The horizontal line on the colour bar
represents the threshold , such that the blocks darker than the threshold are considered a match.
The periodically appearing dark blocks along the diagonal line in figure indicate matches
between the two recordings.

       The Figure 5 is a zoomed-in view of a seemingly dark block in Figure 4, which shows
gradual change in BER along the consecutive fingerprint-blocks. The gradual change is caused
mainly by the overlapping sub-fingerprints in the consecutive fingerprint-blocks used for
matching, where there is a little change in the neighbouring fingerprint-blocks. The BER values
of multiple neighbouring blocks may fall below the threshold . As seen in Figures, there exist

multiple blocks that satisfy the matching condition of BER less than a particular fixed threshold.
Sometimes there may be outliers or incidents of false match, for example due to moments of
silence in the recordings.

                    Figure 5 A Zoomed in version of a synchronization offset in Figure 4

        Each of the matching blocks refer to a possible synchronization offset, represented by
Δj. The calculation of the possible synchronization offsets and their corresponding BERs EΔj is
given by

                                      Δj= x – y : BER x,y ≤ θ and
                                              EΔj = BER x,y

       To select the most reliable synchronization offset Δsync , we apply a voting scheme on the
unique offsets Δk out of all the possible synchronization offsets. The score of a unique
synchronization offset is proportional to the number of times the offset occurs as a possible
synchronization offset Δk       and the total sum of the difference between and the BERs
corresponding to the offset. The score calculation of a synchronization offset can be
represented as
                                         Score = │Δk│ Ʃ (θ-e)

  Figure 6 An example showing the distribution of scores of different possible synchronization

       │Δk│ is the number of times a synchronization offset Δk is repeated, which is used as a
weight factor for the score calculation. Since the score gets multiplied by the number of
occurrences of a synchronization offset, the offsets due to outliers get low scores as the chance of
their reoccurrence is very low. The difference between θ and BER represents the degree of a
match. Finally, the highest scoring Δk is selected as the synchronization offset Δsync.

       The reliability of audio-fingerprint match is thoroughly described in . The method is
proven to be robust under different amounts of audio noise and distortions. Fig. 6 shows the
scores of different possible synchronization offsets. The difference between the highest score and
the second highest score shows a reliable selection of the synchronization offset in the test data-

                                       CHAPTER 3
                                  ONSETS METHOD

3.1 General Explanation

                     Onsets are the perceived starting points in an auditory event, with increase
in signal power and changes in spectrum .They are mainly used for analyzing rhythm, such as
beat and tempo , in music. Onsets can be seen as flashes in audio. Since the multiple-camera
recordings may contain similar audio, it is expected that two recordings from an event will
contain correlated onsets. The onsets represent only positive changes in energy and not the
energy throughout the signal as fingerprints.

3.2 Onset extraction

       We have used the onset extraction method described. In this method, the audio signal is
divided into 24 bands based on the Equivalent-Rectangular-Bandwidth scale, a psychophysical
Loudness -frequency scale. Each of these bands is analyzed in a sample window of 3 s, where an
energy measurement is generated for every 11.6 ms representing an audio frame. Then the
difference in energy is calculated across consecutive frames. If the resulting difference is larger
than a threshold, the audio frame is assigned an onset bit 1, otherwise 0. The threshold is defined
heuristically on the basis of a perceptual test. The overall error rates of the method for one-by-
one onset detection ranges from 42.8% to 56.6%, which is proven in as comparable to existing
state of the art onset detection methods . The onsets from the first band of two synchronized
recordings are shown in Figure. 7.

       The horizontal axis shows the number of audio frames along time. The presence of an
onset in a frame is indicated by 1 in the vertical axis.The present implementation of the onset
extraction method described allows calculation of onsets in the number of bands that are factors
of 24. While extracting onsets in a lower number of bands, the energy computed in the 24 bands
is averaged according to the given number of bands. Due to the averaging of the energy of

multiple bands, the energy variation in each band is reduced. Therefore, the number of onsets per
band increases with the increase in the number of frequency bands. We extracted onsets in
different number of frequency bands: 2, 4, 8, 12, and 24 and tested their performance on
synchronizing multiple-camera recordings.

             Figure 7 Visual representation of the onsets from the first band of two synchronized
                                         recordings from an event .

3.3 Onset Matching

       The onset sequences from multiple frequency bands of a recording were compared with
the sequences from the corresponding bands of another recording using cross-correlation. The
method operates as a “sliding dot-product” between two onset sequences resulting in a sequence
of correlation coefficients of length 2m+1, where m is the length of the longest sequence. The
possible synchronization offsets are computed as differences between the indices where the
maximum cross-correlation coefficient occurs and m+1 . The value of the correlation coefficient
increases with the number of onset matches and decreases with the misses. Figure 8 shows the
cross-correlation coefficients between onset sequences from the first out of four frequency bands
corresponding to two recordings from an event .

        A sharp peak in the coefficients suggests the sequences are correlated, while the
coefficients from uncorrelated sequences resemble noise. The correlation between the onset
sequences of corresponding frequency bands from two recordings result in a set of possible
synchronization offsets. To select the most reliable synchronization offset, we applied a scoring
scheme on all the unique possible synchronization offsets. The score is directly proportional to
the number of times the unique offset reoccurs. The score computation for the unique
synchronization offset can be represented as Δk.

                                            Score Δk = │Δk│

                 Figure 8 Resulting coefficients of cross-correlation between onset sequences
                  from the first out of four frequency bands corresponding to two recordings

       The prominent peak indicates a possible synchronization offset between the
recordings.The highest scoring offset is selected as the synchronization offset, provided the score
is equal to at least half of the number of bands used in the onset extraction. This threshold on the
score is set after careful observation in the test runs to avoid false positives. For example, if there
are eight bands, an offset should score at least 4 to be selected as the synchronization point.

   Figure 9 Examples of possible synchronization offsets Δk and their scores Score Δk from the onset sequence

        Figure. 9 shows three typical examples of possible synchronization offsets and their
scores from the onset sequence correlation between 8 corresponding frequency bands from the
recordings of an event. The Figure 9(a) shows successful computation of the synchronization
offset, where one of the possible offsets scores five while the rest of the offsets score 1. Figure
9(b) shows an unsuccessful synchronization, where different frequency bands result in different
offsets and all of them score 1. Figure 9(c) shows another successful computation of the
synchronization offset, where all frequency bands result in the same offset with the score
8.Hence , the synchronization offset is accomplished in (a) and in (c), but failed in (b).

        The difference in score between the synchronization offset and the second highest scoring
offset gives a measure of reliability of the computed synchronization offset. For example, the
synchronization offset computed in Figure 9(c) is more reliable than the one computed in Figure
9(a). In order to test the robustness of the onset cross-correlation method on synchronization and
the reliability of the computed synchronization offsets, we randomly added and deleted a
predefined number of onsets in two onset sequences corresponding to recordings of an event .
The onsets were extracted in eight frequency bands, where one sequence contained 70–584
(average 251) onsets per band and another sequence contained 195–612 (average 330) onsets per

band. The changes, due to addition and deletion, in the number of onsets were applied equally in
both sequences and the test was repeated over 2100 iterations for each amount of change. The
number of iterations was chosen to allow observing convergence in the scores.

              Figure 10. Scores obtained by the first and the second highest scoring offsets in an
                        event represented by a bold line and a thin line, respectively.

       Figure 10(a) shows the average scores obtained by the first and the second highest
scoring offsets for different amounts of deleted onsets. The score corresponding to the highest
score decreases and the second highest score increases gradually with the amount of the onsets
deleted such that they become very close when 80% of the onsets are deleted. The reliability of
the synchronization offset is effected only when high amounts of onsets are deleted.

       The reliability of the synchronization offset is effected only when high amounts of onsets
are deleted. Figure 10(b) shows the scores corresponding to the two highest scoring offsets when
different amounts of onsets are added. The difference in scores between the highest and the
second highest scores remain above seven times in all the cases. The reliability of                  the
synchronization offset is not influenced even when the additional onsets are doubled. The
reliability of the synchronization offset started to decrease when adding 300% of the onsets.

                        Fig.ure 11 Number of times the correct synchronization offset is obtained in an event

       Figure 11 shows the number of successful synchronization over 2100 iterations when
different amounts of onsets are randomly added or deleted in the same onset sequences as used in
Figure 10 .The synchronization was successful in all cases until the onsets were added or deleted
by 60%. While there was no effect on synchronization until 100% of the onsets were added,
however, the number of synchronized cases dropped sharply when more onsets were deleted,
such that only 9% of the cases were synchronized when 90% of the onsets were deleted. The
robustness of the onset matching method is affected more by deleted onsets than by added
onsets, when the rate of acceptable addition or deletion is in the range of 60% to 100%. When
similar tests were performed on different numbers of onset bands, we observed that the
robustness of the synchronization offset, both for adding and deleting offsets, increases with the
number of bands. Similarly the rate of successful synchronization of the recordings, both for
adding and deleting offsets, increases with the number of bands.

                                         CHAPTER 4

       The audio-based realizations provide the most reliable synchronization results.Audio
fingerprints represent the signal energy level in 32 bands, while onsets represent only the rise of
the signal energy in the given number of bands. Therefore, audio fingerprints are more
vulnerable to echoes and other additional noises. Whereas in the case of onsets, if the level of
distortions in consecutive frames is about the same, the difference in energy may not be large
enough to influence onsets. Additionally, onsets have more advantages. Since onsets use at most
24 bands, compared to 32 bands in fingerprints, the size of the onset sequences are smaller in
comparison to the fingerprints. For example, the size of a 12 bands onset sequence is 2.6 times
smaller than that of the fingerprints. Furthermore, onset sequences are more compressible
because the distribution of onsets, bit 1, is more sparse than in fingerprints.

       Another advantage is that the onset based realization can be scaled according to the
available computational resources, since onset performance on synchronization improves with
the increase in the number of frequency bands. For example, in cases when the number of videos
to synchronize is very large, such as videos from the same event in YouTube and video
management systems as described in , it is highly recommendable that the synchronization offset
is first computed on 4 band onsets and if some videos cannot be synchronized then a higher
number of bands can be used. In terms of reliability of a match, audio-fingerprint is proven as
highly reliable against different kinds of signal degradation and additional noise.

                                         CHAPTER 5

          The two realizations of the synchronization approach are based on audio-fingerprints and
onsets The fingerprint extraction method described and computed the synchronization offset
based on bit error rate calculation.On the other hand, the onset extraction method tested the
performance of multiple bands of onsets on synchronization. The computation of the
synchronization offset was based on cross-correlating the corresponding bands of onsets
representing the recordings. The audio based realizations can be applied in any multiple-camera
recording that contain at least three seconds of common audio.

          The audio-onsets based realization is found to be the most successful for synchronizing
multiple camera videos. The performance of audio-onsets increases with the increase in the
number of bands. In our experiments, 12 band onsets were able to synchronize 29 out of 30 test
recordings. Increasing the number of bands to 24 did not further improve the result. The audio-
fingerprint based realization was successful in synchronizing 25 out of 30 test recordings and
failed in cases where the audio was of low quality due to noises and echo. In addition to better
performance, the onset sequences are smaller in size and more compressible than fingerprints.
The onset based realization can also be scaled according to the          available computational
resources, since performance on synchronization improves with the increase in the number of
frequency bands. Such scalable realization is useful in synchronizing very large number of

                              6. REFERENCES

1. Synchronization of Multiple Camera Videos using Audio-visual features (IEEE
   Transactions on Multimedia, Vol. 12, No. 1, January 2010) Prarthana Shrestha, Mauro
   Barbieri, Hans Weda, and Dragan Sekulovski

2. Robust Frequency-based Audio Fingerprinting (Institut Telecom, Telecom Paristech,
   CNRS-LTCI) Elsa Dupraz and Gael Richard

3. Performance of Philips Audio Fingerprinting under Desynchronisation (School of
   computer Science and informatics University College, Dublin, Ireland) Neil   J. Hurley,
   F´elix Balado, Elizabeth P. McCarthy, Gu´enol´e C.M. Silvestre