Document Sample
Exhibition_watermark Powered By Docstoc
					Chapter 5
Spatial and Temporal
Registration for Watermarking

Many watermarking applications may benefit from the availability of both the original
and the watermarked media content to perform detection. However, one must take into
account that the watermarked media might have undergone de-formations such that
direct correspondence between original and watermarked signal is not possible anymore.
Extra processing is therefore needed to benefit from the availability of both signals. For
video applications, besides the spatial deformations, one must also consider a possible
temporal structure modification of the watermarked content. In this work, we analyze
the specificities of the digital cinema application. In this context, most spatial
deformations originate from in situ analog acquisition process. Temporal distortions
include frame rate modifications, scene removal and temporal cropping. We propose a
method to perform an automatic registration of spatially and temporally deformed video
sequences. Our approach involves two phases. The first step consists in establishing a
correspondence between automatically detected key-frames in the two sequences to
achieve temporal alignment of the frames. The second step performs spatial registration
of the possibly geometrically distorted frames based on a constrained block-matching
algorithm. Different simulations show that the method can cope with realistic situations.

   Keywords: Registration, digital cinema, deformations.
The design of a specific watermarking scheme is very dependent on the
requirements of the application in which it has to be integrated. A particular
subclass of watermarking systems is composed of methods in which the
unwatermarked original content can be used to perform the watermark detection
(often called non-blind or non-oblivious methods). It may strongly improve the
efficiency of the detection. Many applications have been cited in which this
assumption can be made, among which the digital cinema copy tracing
    The availability of the original content is useful only if a perfect
synchronization of the two signals is achievable. However, most of the time, one
might expect that the watermarked copy has undergone several distortions such
that both the spatial and the temporal synchronizations with the original content
are not guaranteed. Extra processing is therefore needed to benefit from the
availability of the original unwatermarked signal.
    In video applications, one must also consider a possible temporal
de-formation of the watermarked content. Such distortions include frame rate
modifications, scene removal and temporal cropping. To take advantage of the
availability of the original content, one must be able to recover the temporal
alignment of the video frames. The challenge is made more difficult due to the
fact that those temporal deformations are most of the time combined with spatial
    The study of registration methods for watermarking applications is very little
dependent on watermarking schemes. Indeed, the watermark embedding itself
introduces most of the time so small distortions that it does not play any role in
the registration process. Subsequent distortions due to copy are much more
severe than the modifications due to watermark embedding. The detection
scheme does however impose the synchronization recovery precision that must
be achieved by the registration operation.
    The characterization of the signals that need to be registered is, on the
opposite, very specific for particular watermarking applications scenarios. This is
also what differentiates the large number of registration algorithms[9]. Each
method is optimized based on the hypothesis that can be made on the processed
signals. One can distinguish two approaches for registrations: the area based
methods and the feature based methods. In our application scenario, the
temporal structure of the signals can be roughly characterized and this is why we
propose a feature based tempo
5.1 A case figure: Fingerprinting in the Digital Cinema scenario

ral registration. On the opposite, the spatial characteristics of the content are not
well defined and therefore we design an area-based spatial registration method.
    In section 5.1 we describe our our application scenario, the digital cinema
copy tracing. Section 5.2 presents the temporal registration of distorted video
sequences while section 5.3 outlines the spatial registration of video frames which
supposes prior temporal alignment of the frames.

5.1    A case figure: Fingerprinting in the Digital Cinema
The motion picture industry is undergoing a thorough change due to the advent
of the movie digitalization. Several demonstrations throughout the world have
shown that the technology is mature to implement end-toend digital cinema
systems. They have validated the use of digital movie servers, digital projectors,
digital movie transmission through satellites or fiber networks, efficient
compression algorithms and strong encryption algorithms. Among the last
technologies that remain to be demonstrated are the fingerprinting and the
conditional access systems which will take in charge the projection rights
    The combination of cryptographic coding (ciphering) and source coding
(compression) has been described since the early work of Shannon[77]: he has
clearly established that one should compress the source before ciphering it. A
secure and efficient transmission system should therefore lead to a decoding as
depicted in Figure 5.1. Compared to the initial work of Shannon, new tools are
required for establishing the ability to trace usages of a work. The secret key for
deciphering the digital cinema flow should be strongly controlled by a digital
right management system. Each projection should also be identified uniquely,
including time, location, ..., by a fingerprinting system. In this paper, we
introduce the copyright threats in the DC scenario and the related requirements
for the performance of fingerprinting methods. The threats on the fingerprint
consist essentially in geometric and temporal deformations which are due to the
non-optimal acquisition process. We mention the conditions to be satisfied in
order to be able to compensate these transformations and make the fingerprint
retrieval possible.
    The implementation of such a decoder should be secure enough to avoid any
tampering. Hardware systems as depicted in Figure 5.2 are under development
with very strong security characteristics. Particularly,

                                 Decompression Fingerprinting management

                                 Figure 5.1: Secure decoder.

Common Criteria defined at the ISO level [78] allow to define profiles and threats
to be solved adequately. The existence of such tamper resistant de-vice seems
mandatory to solve the problem of controlling the projection rights.

    Fingerprinting and tamper-resistant decoders are not sufficient. The digital
cinema decoder has to be controlled by a strong and flexible conditional access. A
conditional access system is much more than movie encryption or decryption. It
also has to manage all the projection rights and entitlements that are exchanged
between distributors and exhibitors. In other words, it might influence the way
they do business. Distributors and exhibitors are then highly concerned by the
definition of such a system. Their requirements are quite different. Distributors
are mainly preoccupied by the film protection against piracy and the detailed
audit trails of any unplanned projections. Exhibitors are more concerned by the
system flexibility in order to adapt the projection rights to the success of the film,
the practical screen availability, etc. Both require that this sys
5.1 A case figure: Fingerprinting in the Digital Cinema scenario

tem does not modify the actual business rules between distributors and
   Digital watermarking is a technology enabling to hide data into multimedia
contents in a persistent and a non perceptible way. This remarkable functionality
has opened the door to the development of intelligent systems necessitating to
associate additional information to media content. An interesting feature of
watermarking is the independence to the coded representation and the format of
the content. There are already several application fields where watermarking
could play an important role, such as images indexing, audio-video
resynchronization or smart images [79] .
     Even if the range of applications that could be covered by watermarking
technologies is large, most of the systems based on watermarking are dedicated
to the protection of intellectual property. Indeed, the capability to embed data
directly into a multimedia content was originally desired in order to proof the
ownership of this content [80, 81, 82, 83] .
     Watermarking, as a copyright protection technology, can be used in a broader
scope. Several methods have been successfully implemented and integrated in
different copyright protection scenarios. Delaigle et al and Kalker et al have
designed a broadcast monitoring system [84, 85, 86], Bloom et al present a
candidate for the DVD copy protection system in [87]. In the domain of still
images, Herrigel et al presented a system for trading images [88]. Each of these
scenarios has its own list of requirements for the design of an appropriate
watermarking method.
     Progressively, watermarking is being considered interesting for more and
more applications. Today, cinema is turning to digital. In the same manner it
happened in other domains, designers of digital cinema (DC) systems have to
face threatening copyright issues. Watermarking or, more accurately
fingerprinting technologies seem to perfectly match the digital cinema needs in
terms of copyright protection. Of course, this new watermarking application has
its own list of requirements.
     In this paper, we introduce the copyright threats in the DC scenario and the
related requirements for the performance of watermarking/fingerprinting
methods. The threats on the watermark consist essentially in geometric attacks,
which we model as geometric transformations. We mention the conditions to
satisfy in order to be able to compensate these transformations and make the
watermarking retrieval possible. We then describe the functional model needed
to manage the different matters associated to the fingerprint.
5.1.1 Copyright protection in the digital cinema scenario
Digital cinema is the on-line distribution of digital movies from content providers
to movie theaters servers, via satellite, optic fibers or other high speed
communication lines. This distribution is done worldwide to national
distributors. National distributions are subject to national restrictions and
exclusivity. Moreover, dubbing has sometimes to be done at that level. Movie
theaters receive content (movies) from national distributors. They store them and
project the movie in one or more theaters under some contract conditions. Piracy
happens at two levels:
        The first one is obvious and consists in direct bit to bit copies done in the
storage device. The pirated tapes are then sold on the black market. This kind of
piracy can be solved by proper uses of conditional access systems.
        The second one is also the responsibility of the movie theater owners. It
consists in letting a spectator film the projected movie with a handy cam at the
back of the theater. This one is very harmful because the copied movie is severely
degraded and the distortions applied to the image drastically impede watermark

     In this paper, we address the context of this kind of piracy. A solution is to
embed a watermark that allows content owners to identify the leak in the chain,
i.e. the movie theater that let enter somebody with a camera. To achieve this goal,
a fingerprint has to be embedded in the movie theater, we call it an exhibition
     Exhibition fingerprints are applied during each exhibition, these fingerprints
do not exist in the content distribution. It would indeed be quite difficult to
manage the distribution of different specimen of the content to each movie
theater. Moreover, exhibition fingerprints can identify the circumstances of the
exhibition. The fingerprint should include identification of the theater as well as
the exhibition context. These data should hold in    to �����bits.
    Exhibition fingerprint identification data may include:

        Unique identification of playback equipment
        Serial number
        Date stamp
        Time stamp

5.1 A case figure: Fingerprinting in the Digital Cinema scenario

        Playback source identification
        Number of times the source has been played
        Cryptographic authentication information

   This kind of watermarking requires real-time embedding schemes. This
constitutes a severe constraint given the data rate. Moreover, very low distortion
on the media is tolerated as perceptual fidelity is primordial. The most critical
operation is the perceptual masking as it often requires complex content analysis.
One could imagine preprocessing the movie before exhibition, but such approach
would necessitate complementary storage and would probably lower the system
   It should also be studied whether watermarking schemes from different
projection equipment manufacturer could coexist. If so, would the used scheme
be determined by the movie theater or by the movie.
    Of course, to be effective, the means of application of the fingerprint should
be resistant to attempts to disable it. This requires placing the implementation
within a secure device. Moreover, the efficiency of fingerprints may require to
integrate embedding modules in a global protection system, with cryptography,
key management and conditional access.

5.1.2 Conditional Access
The conditional access design has to be based on the recommendations of the
SMPTE DC28 study group for the conditional access system for digital cinema
[89]. This group enforces today’s practice in Film Rental Agreements that are
continuously negotiated between distributors and exhibitors. While a classic
conditional access system will simply prevent unauthorized access to the content,
new advanced conditional access system should include an enlarged set of

        Powerful rights management: the conditional access system offers more
than the basic respect of the Film Rental Agreement. It allows the distributors and
exhibitors to remotely re-negotiate projection rights at any time without having
to send the encrypted movie again.
        •        End-to-end security: the projection rights are wrapped in
entitlements that are sent to theaters through a channel independent from the one
used to send the encrypted film. Entitlements are checked through smart cards.
Films and keys never appear unencrypted.
        Films are only decrypted in the projector, zeroing the chance to copy the
film in clear in an intermediate machine.
        Flexibility: While the system might enforce the Film Rental Agreement,
theaters have also the possibility to update or bypass it. Theaters are provisioned
with a number of test projections or exceptional unplanned projections. The focus
of the CA system is the production of an audit trail that the exhibitor will have to
explain later.
        Simplicity: Exhibitors do not need any knowledge in security to use the
system. The system should be completely transparent for them, except for the
presence of smart cards. On the distributor side, the system handles all the
security aspects, The only visible part is that distributors have the possibility to
define the projection rights and to choose among several encryption or hashing
        Customizability: the system should be based on a modular platform with
standard interfaces. It is straightforward to replace a module by another in order
to tune the system to the customer needs.
          Standards-based: the system should be an open system that extensively
uses standardized elements. For example, the use of the XML representation
offers the possibility to view all the meta-data with a simple Web browser.
          Renewability: The system implements the renewability of system parts
i.e. the easy and dynamic replacement of a system part in order to upgrade the
system’s security.

The goal of an efficient conditional access system is to implement the usual
conditions of today’s practice Film Rental Agreements. Exhibitors and
distributors are negotiating the projection rights together. Once the agreement is
established, the system will ensure the respect of this agreement while preserving
all the exhibitor possibilities to react to unplanned events. This agreement can be
updated at any time if both parties require so. Figure 5.3 gives an overall view of
an advanced conditional access system.
     The system is working with modules located on three different places, one on
the distributor side, the two others on the theater side. The transmission of the
movie and the projection rights management are handled independently. The
distributor can at any time encrypt and package the
5.1 A case figure: Fingerprinting in the Digital Cinema scenario
film and send it to exhibitors. The encrypted film is stored on the theater central
server. At the same time, distributors and exhibitors can negotiate the Film Rental
Agreement. When the negotiation is concluded, the distributor encodes the
projection rights for a given period through userfriendly interfaces. The system
creates the entitlements, protects them and sends them to the exhibitor. The
exhibitor then plans the projections for the given period. The system checks if the
planning is coherent with the available entitlements and stores it in a database.
Some minutes before the planned projection, the system checks if the projection is
compatible with the available entitlements and with the projection history. If all
the conditions are respected, the entitlements are processed to produce a new
entitlement specific to the projector. Depending on the smart card memory,
several distributors can use the same smart card, sharply reducing the number of
smart card switches while keeping a maximum security. At the time of the
projection, the new entitlements are sent with the film to the different players.
Inside the player, the key is decrypted in a secure module and used for the film
decryption and playing. In case of an exceptional projection or for a test
projection, the projection parameters are memorized and an audit trail is securely
reported later to the distributor.

5.1.3 Digital cinema distortion model
The cover-media in this application presents various specificities. It is a very high
volume media. Indeed, its frame resolution currently reaches�������by
                                      �        �       �
����and is evolving toward a ��� by � �

         ����resolution. The frame rate is 24 or 48 frames per second.
    What is of interest to us is the kind of attacks the fingerprint should be
resistant to. Basically, it should be resistant to the handycam attack. This
corresponds to the action of filming a projected movie using a consumer video
camera. This can represent severe distortions. Hopefully, the copied material
characteristics are reasonably well determined. Indeed, one can consider that
media will not undergo other distortions than those caused by the camera
acquisition. In fact, the author of the illegal copy has no reason to try to remove
the fingerprint since it only addresses the responsibility of the theater. It is
however quite impossible for current watermarking scheme to deal with this
kind of image manipulations. Therefore, the undergone deformation will be
estimated by registration with the original content and subsequently inverse
deformation will be applied on the illicit copy before performing watermark
extraction. The original content is considered available, since movies are
generally easy to be identified by their owners.

Temporal distortions

Temporal distortions are mainly limited to the following operations. Firstly a
slightly modified and probably variable frame rate. Secondly a possible number
of scene removal or frame dropping. However these latter modifications should
be very unusual.
    Another distortion that can show up is flickering. It is a deformation specific
to the acquisition of video sequences. It appears in the copy movie because
frames are recorded when the projector shutter is half-open or even completely
shut. Moreover, as most acquisition devices use interlacing, two half frames are
recorded with a little delay. This might causes
5.1 A case figure: Fingerprinting in the Digital Cinema scenario
parts of two different frames to appear on the recorded frame. The consequence
is a frame superposition in the copy movie as we can see it in figure
5.4. This phenomenon has an important effect on the key frame detection used in
the temporal registration process. It also stems for slow temporal variations of the
watermark signal.

Spatial distortions

As we can expect that most distortions are non intentional, many assumptions
can be made on the the spatial distortions. A very likely transformation is the
downsampling of the frame due to the acquisition device resolution. Even if some
spatial cropping occurs, the recorded frames will most probably contain a large
part of the original frame including its center. No occlusion problem will arise as
it is often the case in stereoscopic systems and motion analysis. The deformation
will therefore be modeled by continuous functions. Distortions to deal with are
limited to physical deformations which are determined by the geometry of the
projection and recording devices. This includes however complex geometrical
transforms due to imperfect optics and non planar screen shape.
     The modeling of the non linear transforms due to optics is quite complex and
is presented in detail in [90] . This kind of model presents a high degree of
non-linearities and is very difficult to use efficiently. However, we can make the
assumption that the camera is placed far at the back of the movie theater and not
too far from the center part, the transforms can therefore be modeled by
simplified models. This not far from a real case, since pirates will try to maximize
the commercial value of copies and will therefore maximize their quality.
    So, the resulting pirate copies are subject to geometrical deformations
inherent to the acquisition process. The goal of the compensation technique is to
estimate this deformation on the basis of a model of the de-formation. The
compensation technique tries to find the parameters of the model that optimize
the matching between the reconstructed and the original image. For this purpose,
we need to model the geometrical deformations due to the handycam copy.
    We can consider different families of deformation models. Increasing the
number of degrees of freedom of the model enables better matching with the
deformation but also increases the instability in presence of noise. Here are the
most straightforward models one should consider to describe the deformation.

   • The affine transformation. There are six degrees of freedom in the affine
     transformation, which corresponds to rescaling, translations, rotations and
     shearing. The affine transformation preserves parallelism and relative
     distances between points. An illustration is shown in Fig.5.5.
     Mathematically, the transformation between the old and the new
     coordinates can be modeled by a matrix
                           �����        ���         ��� ��������� �
                                                       �                   �   �

   • The projective transformation. This transform is more generic since it has
     eight degrees of freedom. It describes the motion of a 3D planar surface. A
     square can be transformed into an ordinary quadrilateral.
     Mathematically, the transformation between the old and the new
     coordinates can be modeled by the following expression

              ��       �                                       �

                   � ������������������� �


   • The Q-warping. With 17 degrees of freedom, it models the projection of a
     3D quadratic surface on a plane. This transformation can
5.1 A case figure: Fingerprinting in the Digital Cinema scenario

   model the deformation caused by the screen
        curvature             � � �� � ����
                            �����      �   �    �    �            �(5.3)

                    ������������ �

                    �                                    ���
                                           �     �

                    ��������� �� ������

   In both last presented models, the analytical expressions of the
   transformations have not a linear dependence in the deformation parameters.
   This property can be useful to ease parameters estimation. This is why we
   introduce the following “non-physical” transformation model.
• The pseudo projective transformation. It doesn’t correspond to any physical
  transformation. It is a linear approximation of the projective transform.
                   ���           �            �        �                      �            ���
                                                           �      � ���               � �
                             �        ��          ��           ��                 �                   ���(5.4)
                            �                 �                  � �                   �          �

• The curved transformation. It models the optical transformations due to the
  screen geometry and the projector and camera lenses (e.g. short focal length). It
  is a very simplified model but it is quite a good approximation for small
  deformation amplitudes. Fig.5.6 shows an illustration of this transform.
  Mathematically, the transformation between the old and the new coordinates
  can be modeled by the following expression

                        � ����                                           �����                           ���
                            �           ����                                                ��
                       �     �        ��� ����                   �       ���                            �     �
                        �                                                                                      (5.5)
         ����           � ��                  �            ���
                                                                  ��              �������                ��
• The curved pseudo projective transformation. It is the combination of the
  two transformations and adds up to 12 parameters. We will use this model to
  approximate the distortion taking place by a camera acquisition in the movie

              ��       �     �            �                          �                     � �
                                                       �    �             �
                       � ��          ��       ��           �             �
                   �             �                 �                          �             �
                                          �                �    ��            ���


                                                  �������                 � ��                  ����
                                                                                           � �              �����
                                                                       ��     �
                                         � � ��(5.6)
                                            �   �

   In order to get better insight on the deformation taking place, real condition
acquisition was performed in a theater room. Figure 5.7 shows the observed
deformation of a grid on the screen of the movie theater.

5.2 Temporal alignment of distorted video sequences
Our approach to achieve temporal alignment of distorted sequences consists in
establishing a correspondence between automatically detected key-frames in the
two sequences.
    In section 5.2.1 we describe our key-frame detection scheme which is inspired
from existing methods dealing with scene cuts localization and semantics
(MPEG7-like) scene classification. The main difference in our application is that
we do not need any semantic to be associated to those particular feature-frames.
We aim at finding a method to select featureframes that is robust under spatial
and temporal deformations of the sequence. We show that a method based on the
analysis of luminance histograms meets these requirements.
5.2 Temporal alignment of distorted video sequences
    In section 5.2.2 we propose an alignment algorithm to achieve a
correspondence between pairs of key-frames in the original and the deformed
sequence. It uses both temporal and spatial information associated to the detected
key-frames. The algorithm is similar to a Viterbi tree search in which the
correspondence criteria takes into account the temporal distance between
key-frames and extracted frame characteristics. The algorithm is based on the
consideration that inverting or deleting a scene is not frequent and that the frame
rate is only slowly varying.
    Section 5.2.3 presents the performances of the proposed method for different
test simulations.

5.2.1 Key-frame detection
Since 1996, key-frames detection has been the subject of quite much research.
Many different approaches exist. Some authors use compressed frames
(especially B and P fields in MPEG compression [91]), others work in the
uncompressed domain. Some use a probabilistic way [92], others a neural
approach [93]. Detection can be performed in spatial or transformed domain,
without memory or considering the movie evolution [94].
    A key-frame is a frame which is significantly different from the surrounding
others. For instance, a frame just after a shot boundary (scene cut) is a key-frame
because it is very different from the frames in the preceding shot. Most detected
key-frames are shot boundaries and flashes like explosions.
    There exists four common kinds of shots boundaries: Hard cuts, Fade in,
Dissolve, Wipe. Hard cuts are the easiest to detect. Notice that in our digital
cinema application, we do not need to restrict to shots boundaries. We can for
example also detect some key-frames which are due to flashes in a shot. The
semantic associated to these frames is not important but it is mandatory that they
be detected in both sequences (original and copy). One has to consider the
possibility that the frames be distorted in the copy sequence. The key-frame
detection scheme must be robust against such distortions.
    The shots boundaries detection supposes that there exists a change in the
visual content of the images between the two shots. The goal of the detection is
therefore to find the high values of discontinuity. The processing can be
decomposed in the following steps:

    1. Extract a feature from each frame in the video sequence.

1       Use a metric,     , which measures and quantifies the difference between
                                                 ��� �
the feature at time t and the feature at time       ( is the so called "skip
                                       � ��������
distance"). This discontinuity value                 �is the input of the detector.
                                                   � ��������
2       In the detector, the discontinuity value                  �is compared to a

                             �����                          �
                         ����                                ���
     threshold �. If       �����, the frame number            is considered
     as a key-frame. We can add to this model some a priori information, as
mean shots length distribution, or other additional information like statistics, or
motion compensation to improve the detection. We finally get the diagram
proposed by Hanjalic [95] and presented in figure 5.8.

Feature extraction and metric

A simple, efficient and widely used feature for the detection of key-frames is the
luminance or color intensity histogram. RGB or YUV histograms
       ���       �   �
with         or � � �bins have proved to be simple and yield good results for
hard cut detection. They are little sensitive to camera or object
5.2 Temporal alignment of distorted video sequences


                                              Figure 5.8: Key-frame detector.

    0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Figure 5.9: Discontinuity measure computed by the � norm on the luminance
with 256-bins (a), 64-bins (b). �

motion. In order to compute the difference between two histograms, different
metrics can be used. If ��� and ���� are the histograms at time and
 ��� , the discontinuity value can be obtained by the � norm distance.

                                              ������       � ��� �              �� �� �
                                                          ��            �����

                     ��   �
                      �            ��� � � � (5.7)

Figure 5.9 illustrates the evolution of the discontinuity measure on a test
sequence using two different luminance histograms.
Threshold determination

Many authors are setting the threshold value by an heuristic approach. however,
some automatic threshold determination methods have been proposed.
   A global automatic threshold was introduced in [96]. The threshold
                          �                    �

           �is set at � , where �and are the mean and the standard deviation
of the discontinuity value on the whole film. It assumes that, within a shot, all
changes are due to noise which has a Normal distribution. Under

hypothesis, when setting parameter to 3, as much as �����of the
discontinuity measures within a shot are under the threshold. All greater values
can be considered as key-frames.
    A weak point in this approach is that it does not take into account a possible
evolution of the discontinuity measure statistics along the sequence. It can also
happen that many successive frame be considered as key-frames.
    A better alternative is to work with adaptive thresholds. Yeo and Lin
[97] proposed the use of a temporal sliding window, of size ��

                                                                 ��, centered on
      �                                            �
frame . A hard cut is detected on frame �

                 �    �
                  � �if both the following conditions are met:

   1. The discontinuity value at the center of the window is the maximum value
      within the observation window :

             �������                         ��������         ��
                              �����                      �     �          ���       �
                                                                               �        �
      ��� (5.8)

      In this way, we avoid selecting key-frames too close from one another.

   2. The discontinuity value at the center of the window is higher than
          times the second highest value in the window :

        � �       �
         � �              �         ��
               �          �              (5.9)
                      �       ���

The drawback of this approach is that the parameter must be exper
                                               �imentally chosen. Moreover, in
low and flat regions, it happens that a key-frame is detected within a shot with a
very low discontinuity value.


Our key-frame detection algorithm is a combination of previously presented
approaches and can be summarized as follows:
5.2 Temporal alignment of distorted video sequences

   1. Compute the discontinuity values
                              �     �
                              ��         �

                                  ������         � �      �� �                ���       ��
                                            ��         ��          ���� ��          �        (5.10)
                                                 � �

      where �  �is the 64-bins luminance histogram of frame at position
      number .

   2. Compute the adaptive threshold.

                                                   �               ��
                                               �       �          �         �
                                           �           �            �                 (5.11)
                                           �       �        ���
                 � �           �                                            �����

            ��         ����        ���.
                  �           �

                       �   �
                         � ����
      where             �is the second highest value of ����for
      �� �

   3. Compute the pseudo-global threshold.
                                                              �         �
                                          ��� ���� �� � (5.12)
                                       �           �

                  ��           ���

      where �          �and           �are the mean and the standard deviation of the
                                                                   ��           ���
      discontinuity values, computed in the range                       ��        �
                               � �                     ��

                           �      �
          �� �
                � �����
                                                                                       �       �
   4. Key-frame detection. Frame is a key-frame if ����� �� �.
Parameters ��          must be set experimentally. Nevertheless, the

following considerations can be helpful to determine appropriate values:

   • �is the minimum number of frames between two shots boundaries.

      As in a movie we don’t have shot lengths shorter than one second,
      �will be around 25.
    • � must be high enough to yield reliable statistics for the computation of a
      pseudo-global threshold. � will necessarily be bigger than �.
    �� �� ��

        �       �

    • The value must be taken in the range �to give good results according to
      Yeo and Liu[97].
                                  �       �
                                       � ���
            �                                     �����
    • The value must be taken in the range �            on the basis of the results
      presented by Zhang[96].
5.2.2 Alignment of key-frames
The aim of the technique described in this section is to match correctly a
maximum number of detected key-frames in both sequences. Let �� and �be
the sequences of detected key-frames in the original and the dis ��(resp �)
                                                                 �            �
the position of the
                                           �       �
                                          � �     � �
torted video sequence. Let us denote by       �


    detected key frame in the original sequence (resp. distorted sequence). There
are three main reasons which explain the fact that the position numbers of the
key-frames are not the same in both sequences.

    • The first one is that the hacker will start to film the original movie a little
      before or a little after the film beginning. He may also crop parts of the
      movie. This causes a temporal translation of the frames between both films.
      Let �be this translation parameter.

   • The second reason is that the frame rate of the original and the copy movie
     might not be the same. Movie are projected at 24 fps and cameras film at 25
     fr/s (PAL) or 29.97 fr/s (NTSC). Let’s call �the ratio of the frame rates,

                                               �����                  �          �
                                  �                      �
                                                             �� ���                  (5.13)
                                     �                       �

Note that these two parameters �and �can change along the movie

                                            ��willsequence. However, we can
reasonably consider that the change on ��not be frequent and that �will only
very slowly vary. We can consider the relation between two corresponding
                                                    ��               �as being
            �      ���                                and
           � �    �
elements      �          a first order relation:                  �

                                 �       ���
                             �       ��
                                     �          �        �
                                 �� �      � �� (5.14)

   The principle of the algorithm is to estimate ��and �locally mak
                                                              �ing the
assumption that they stay constant over short time intervals. We work with two
observation windows sliding simultaneously over both sequences ���and �as
depicted in figure 5.10. We define ��as being a slid
                    ��ing window of length on the original sequence �. That
means that, at the first step, ��is filled with the first elements of ��. We call
     �        ��
��     �the        element of ��. Similarly, �is a sliding window of length
�over sequence         � � �

  . We call the set of key-frames of the original sequence matched at the end of
              �             ��
the process;  �is the    element of this set. As illustrated in figure 5.10, the
proposed algorithm consists in an iteration of three phases:
5.2 Temporal alignment of distorted video sequences

                      Detected key−frames in the original sequence:

                       Detected key−frames in the copy sequence:

                                   Targetted alignment

          Consider all matching combinations within the observation windows:

              Select optimal combination and progress along the sequences:

                        Figure 5.10: Principle of the algorithm.

   1. Try all possible correspondences between the elements of ��and �.
1      Select the combination which minimizes a cost function.
2      Shift the observation windows one key-frame ahead the first pair of
key-frames from the selected combination.

    If both windows, ��and � , have the same length,          �, then exactly the
same key-frames must be detected in both sequence because every element of
�must correspond to an element in ��. If �is chosen bigger

than , then ���key-frames from �can be dropped. Nevertheless,

all the elements of ��have to be matched. Choosing �bigger than � has the
opposite effect: all the �elements of �must be matched with an
                  ��element from �. As the key-frame detection does yield
exactly the same key-frames in both movies, an additional parameter is
introduced which determines how many key-frames can be dropped in the
smallest window. This parameter is called omit . In the following, we impose
    The first phase of the algorithm consists in establishing all possible
correspondences between elements of windows ��and � . Not all
combinations of correspondence need to be considered: many can be a priori
discarded. We distinguish two kinds of restrictions. The first ensures temporal
consistence of the combination, while the second checks for feature
correspondence of the matched key-frames.

Temporally consistent combinations This restriction is based on temporal
coherence requirements. One key-frame in the original sequence corresponds to
maximum one key-frame in the copy sequence. Moreover, if both elements
    �           ���
��     �and �         �are matched together, it is not possible to
                              ��� with elements lower (resp. match elements

higher (resp. lower) than �


higher) than         �      �. It is assumed that the hacker does not permute the


or-                             der of the shots. The number of combinations which are

            �   �
       ��� �� ���(5.15)

possible between��and �is then :
                        �   ����                �                      �
                ��� �               � ���� ������ �� �
                                                         ��       ��
                                                ��           ��
                                                         �        �

The number of combinations increases quickly with the size of the windows. In
order to maintain an acceptable processing time, parameters                and �are limited
to 11, and parameter omit to              .
5.2 Temporal alignment of distorted video sequences

60 25 20



                                              −25 0 500 1000 1500 0 500 1000 1500

Figure 5.11: Mean luminance difference of original movie (a) and copy movie (b)

Feature compliant combinations In addition to the temporal restriction, the
number of possible combinations can be further reduced by imposing a feature
matching criterion. A robust luminance characteristic is extracted for each
detected key-frame. As illustrated in figure 5.11, the sign of the mean luminance
difference of two successive frames is a very robust characteristic. All
combinations which do not fulfill this sign criterion for all key-frame pairs can be

Selection of the optimal matching combination between ��and �
                                  �The origin of the algorithm is the

assumption that ��and �values can be considered constant over �and
�observation windows. The algo
rithm proceeds by envisaging each combination which matches ���

key-frames of �with ��key-frames of �. As discussed previ

ously, non-compliant combinations are discarded. �Let’s call ���� �and
            �                     �       �
        �       �            ��                    ����
� �                 �(with            � ��                   ) the selected key-frames in ��and
                �                              �

 � for


the envisaged combination. A correct combination, that is a combination where
the selected key-frames are properly matched, should verify the following
                                                �                              �         �
                                               � �       �                    � �    �
                                 ����                � � � �                       � �� (5.17)
                                                              �       �

The associated ��and �values can be estimated using a least square error
optimization approach. Optimum �and �minimize

                             �                 �                                    �            �
                                              � �                                  � �            �
                              ����                   ��� � �                         ������           (5.18)
                                                                  �       �
mean luminance first derivative

                                  mean luminance first derivative

                                              5 0 −5

                                                  −10 −15
          �� ��

Let’s call �the synchronization error of the combination. Corresponding ��and
�values are given by

                         �                                                  �       �    �
                                         � �                �
                              ���              ����                 �               � ��           (5.19)

                                                                �       �       �
                                     � � �


                                               ��� ���� (5.20)
   Finally, one can impose a relative smoothness in the evolution of �� and
�factors as the sliding windows progress along the sequences. This
       �restriction is taken into account by minimizing a smoothness error

        ��        �              �        �                 ��                          � �         �            �
                 � �         �           � �                                             �         � �
           �             �       ���           ����             �����                        � �       ���
                                                                                              �              �
�                �
 �����               (5.21)

    The selection of the best matching combination of key-frames between

��and � results from the joint minimization of and . One has to consider the
    possible removal of a whole scene in the copied
sequence. This is achieved by monitoring the evolution of as windows
progressively slide along the sequences and by detecting a sudden increase of
synchronization error. In such circumstances, Adequate steps can be taken to
resynchronize observation windows.

5.2.3 Results
The analysis of the performances of our registration method consists in two parts.
Firstly, the key-frame detection algorithm is evaluated. Secondly, the
performances of the key-frame matching algorithm are presented.
     Tests were conducted on two different original video sequences. The first one
is the trailer of "Starwars : Episode I". In this sequence, shots are short and very
different. The second sequence is a "classical" movie sequence.
     The first sequence, star ori, is the trailer of "Starwars : Episode I" extracted
                                         �   � ��
from the DVD. Images resolution is ( �                ) and frame rate is

                                               ��25 fps. The sequence
contains 2400 frames. The copy of the “Starwars” trailer is called star copy. The
video was projected on a 2m/1.5m screen using a consumer video projector, and
                                                        �       � ��
filmed with a fixed camera. Image resolution is ( �                  ) and frame rate is 25
fps. We noticed that the frame

                 ��rate changes at the middle of the trailer. This can be explained
by the fact that the computer didn’t play the movie at a constant rate.
5.2 Temporal alignment of distorted video sequences

    The second test sequence, film ori, is part of a digital movie. This sequence is
quite dark and has much motion. Resolution is (� ��

) and frame rate is 24 fps. The length of the sequence is 10 min, 15000 frames. A
first copy of this second test sequence, film fixed copy, was filmed in a real
                                                                       �   � ��
theater room, with a tripod-mounted camera. Resolution is ( �

                                                                  ��) and rate
is 25 fps. A second copy, film cam copy, was filmed freehand in a real theater
room. The camera is moving, zooming and there are also people walking in front
of the screen.
    A good key-frames detection algorithm detects identical key-frames in both
sequences. In other to characterize the performances of the method, one can use
the recall and the precision criteria which are widely used in shots boundaries
detection reports. They are defined as :
��       ���                                               �
             �����      �
                         �������� �                            (5.22)
                        �                              �
                            ���         ��

                            �       �        ����

                                             ���                    � �
                � ��
                     �������                       ����                 ��
                ��� �

  ���                   �
                            � ��� �                �       �
��                              �             �
           ���      �                    �        �   ��

In our digital cinema application, the "correct" key-frames correspond to
key-frames detected in both sequences. The "missed" key-frames are detected in
the original sequence but not in the copy sequence. And the "false alarm"
corresponds to key-frames detected in the copy sequence which are not detected
in the original sequence. Table 5.1 compares recall and precision results using
different detection threshold determination.
    The same recall and the precision criteria are used to evaluate the key-frame
matching algorithm. “Missed” matches correspond to keyframes that were
detected in both sequences but could not be matched together. “False alarm”
matches correspond to key-frame pairs erroneously matched together. Table 5.2
presents the results of the key-frame alignment for the different test sequences
and the different key-frame detection approaches. One can observe that the
combination of global and pseudoadaptive threshold determination yields the
best results.
    Frames not considered as key-frame are matched by interpolating successful
surrounding key-frame matches. The most important criterion for the key-frame
matching algorithm is thus the precision criterion. Indeed, wrong matches
corrupt the alignment of all surrounding frames.
    Our registration algorithm has shown reliable behavior for a variety of

     • Temporal shift positive (copy film starts before original) and negative (copy
       film starts after original).

            original sequence
                                Sequence to
         star ori
                    star copy
                                76,0 %
                                         76,0 %
         film ori
                                film fixed copy
                                                              34,7 %
                film ori
                                film cam copy
                                                    59,8 %
                                                              39,3 %
                 star ori
                                    star copy
                                                     75,0 %
                                                                 91,7 %
                 film ori

                                 film fixed copy
                                                                 85,9 %
                 film ori
                                 film cam copy
                                                     82,2 %
                                                                 84,5 %
                    star ori
                                        star copy
                                                        84,6 %
                                                                   97,1 %
                    film ori

                                     film fixed copy
                                                        80,8 %
                                                                   86,8 %
                    film ori
                                     film cam copy
                                                        79,5 %
                                                                   85,3 %
                    Table 5.1: Key-frame detection results.

            original sequence
         star ori
                    star copy
                                78,9 %
                                         100 %
         film ori
                                       film fixed copy
                                                             8,1 %
                                                                        16,2 %
                     film ori
                                       film cam copy
                                                                            inf %
                       star ori
                                            star copy
                                                              63,6 %
                                                                            100 %
                       film ori

                                         film fixed copy
                       film ori
                                         film cam copy
                                                              65,0 %
                                                                            100 %
                          star ori
                                                star copy
                                                                   69,7 %
                                                                               100 %
                          film ori

                                            film fixed copy
                                                                   64,4 %
                                                                               100 %
                          film ori
                                            film cam copy
                                                                   62,0 %
                                                                               100 %
                          Table 5.2: Key-frame matching results.

5.3 Spatial registration in presence of geometrical deformations

        Different frame rates (24 fps and 25 fps).
       Temporal cropping + different frame rate.
                                                     � �
                                                    � �         �
       Removing one frame every �frames, with �            �� , all the film

                                                    � ��             �
       Removing one frame every �frames, with �        ���� , starting
removing at the middle of the sequence.
       Rate change during the movie.
       Temporal cropping with less than 300 dropped frames (12 sec) during the

   The matching resolution is smaller that one frame and satisfies watermarking
application requirements. The algorithm was successfully tested to improve
performances of a watermarking scheme working with global mean luminance

5.3    Spatial registration           in    presence       of       geometrical
As explained in section 5.1.1, the fingerprint extraction is done with help of the
unmodified content. Before trying to read the embedded watermark message, we
try to compensate the deformation applied to the analyzed image or video. We
make the hypothesis that a prior temporal synchronization was successfully
achieved. No assumption is made on the content of the frames that need to be
registered. The compensation method we present in this section proceeds as
follow: we first apply a modified blockmatching algorithm in order to find a set
of displacement vectors between the original frame and the geometrically
distorted frame. A transformation model is chosen either with global parameters
or with parameters varying across regions within the image. We then proceed to
a minimum mean square error estimation of the transformation parameters on
basis of the estimated displacements vectors. Eventually, the estimated
transformation is applied to the deformed frame to produce a restored image as
close to the original as possible.
5.3.1 Block matching
Estimating the deformation between two pictures generally consists in
minimizing a function that expresses given constraints. Matching methods are
based on an explicit search for the best matching between two structures, usually
blocks of pixels. We call this technique the Block Matching Algorithm. Since its
introduction by Jain and Jain in 1981, the Block Matching Algorithm (BMA) [99]
has emerged as the motion estimation technique achieving the best compromise
between complexity and quality: a fast estimation procedure allows obtaining a
block-based motion field that is transmitted at low-cost. An appropriate choice of
the block size offers a compromise between adaptation to small moving objects
(performed by small blocks) and robustness against noise (performed by large
blocks). These properties have granted the BMA to be included in most video
standards like H.263 [100], MPEG-1,2 [101] and MPEG-4 [102].
    In our application, BMA is used to estimate displacement-vectors between the
original image and the deformed image. The principle is to apply a translational
motion model to sub-blocks of the image. For every block, the matching measure
is based on the difference between the reference (original) image and the
candidate (distorted) image. At the cost of increased complexity, the translational
motion model can be replaced by a more general affine deformation model. Fuh
and Maragos [103] thereafter consider the BMA and its two sole free parameters
(the translation vector) as a particular case of more elaborate models. The shape
of the “sub-blocks” could also be other than rectangular (e.g. circular).
    Its implementation in most standards follows this procedure (fig. 5.12):

1       the analyzed image is divided into a set of sub-blocks of size       ��;
2       for every sub-block, the origin of the block in the reference image is

                                                          �����             �
      searched for within a search area �����                       , ����� according
      to a given optimization criterion.

    This process provides a field of estimated displacement-vectors
corresponding to a set of locations in the image as illustrated in figure 5.13. In
some case, the method may give wrong estimation for some locations. Using the
assumption of a smooth deformation field, most of those erroneous vectors can
be discarded.
5.3 Spatial registration in presence of geometrical deformations

Search window

() block

( ) block under search
5.3.2 Deformation model: global parameters versus local parameters
In a first approach, one considers that the image has undergone a global
transformation, or in other words that every pixel has been displaced following
the same unique transformation model. The number of degrees of freedom
(number of parameters) is given by the chosen transformation model. One could
also divide the image in many subregions and consider a transformation with
specific parameters in each of those regions. With this approach, the model can
approximate a very wide class of transforms. The number of degrees of freedom
and the total number of parameters is now proportional to the number of
subregions. It would be logical to impose continuity constraints for adjacent
regions and therefore limit the total number of parameters. As the exact global
model of deformation is often complex and difficult to approximate without a
large number of parameters, on can often reach better matching performances
using a partitioned model with limited number of degrees of freedom. One has to
keep in mind that a too large number of parameters leads to noise instability and
is therefore often not desirable.
     The two approaches are illustrated in fig 5.14 for a projective transformation
model and for a 12-parameters curved pseudo-projective transformation model.

5.3.3 Minimum mean square error optimization
                         �     �
Given a set of coordinate �� ���and a set of corresponding coordinate
                               ���                      in the transformed image ���, one

can estimate the transformation parameters using the minimum mean square
error optimization. Let us express the transformation as �p ���, where p is
the vector of parame

ters. ��
                                                �             �     ��     ��
                                                    ���                         �
       �                                                 � p           �    �� (5.24)
5.3 Spatial registration in presence of geometrical deformations


                                    �               ���

                                        �� ��
The function �to be minimized can be written as:



                ����     �����
                       �            �
                        ��      ��              �
                   �p       �           �� (5.26)
where �is a weighting factor representing the confidence attached to the
pair of coordinates. When the transformation �is linearly dependent on the
vector of parameters p, as it is the case for most transformation models presented
in section 5.1.3, we can express �as follow:

                    � �                                       �       �         � �
    ��� �                                   �       ���                        �         � �
          �    ���          �                             �       �
                                        �   ���                            �       � �
                                �                                                            �

                        �           �                     �       �       �

                                                                                     �                                   �
                                                                                                         � � �
                                                                                         �                           �
                                                                                             �       �     �    �




    � �                                                           �                                  � �           � �
 �            � �                                                              �        ���                    �
                                                                      �                              � �
�     � �                                                                                                  �        � �
               �                                                          � �      ���

                                                                 �                        ���
� �                                                                       ��                       �p ����
  �       �                                                          ��              ��


                                                         � �                     �
                                   �f        ��                  �                        �
                               ��                 �
                                                   ��                                         (5.27)
                                        f�              � �           ��             ��
                                         �                                   �

The function to be minimized becomes

   ���        �����
          �               ��                 ��             ��                       ���
                               �                                     �
      � ���f               � ��p�             ���f            � ��p�                           (5.28)
   �                  �                                 �

and the solution is given by a linear system:


                                                                                                   � �       �
                                                                                                        �   ��

                     � � f
                      �       � f



�                     � �
�                         �       �

                � �    �
                 �   ��


                     � � f
                      �       � f


            �                                                      � �
             �                                                         �   �

            ���                                                    � ���



�  ��

                     �          �           �
                          � ���
                      � �           �

                                                       � �
    � �                                         f                   �
        �                                            �
    �                                               � � �          �
�                                               �              �           ��

                           ��                              �

                     ��                                            � ��
                                                               � �


                                                          �       ���
        �                                                 �               �
  �                                                           �
                                                                      � ��    ��
  �         �           �                         ��   ��              �
                �               �        � f          �       ��          �
                                    ���       �
        � �         �       �             �
    �                   �

                ���� �
                                          �                                            �
                                                   ��                                          ��

                        �           ��        ��                              ��       ��
                            ���       �           �       ���� ���                 �       �
                            �                                         �


   Expressions 5.27 to 5.29 correspond to a global deformation model �. Similar
expressions can be obtained for a deformation model in which parameters can
take different values in several subregions of the image.
5.3.4 Results
In order to test the efficiency of our scheme, we simulated a handy cam theater
image acquisition. We worked on high resolution images (1920x1080). This
original data was watermarked and 64 bits were embedded with a previously
presented spatial domain algorithm [32]. The image was then scaled close to the
resolution of a digital video (DV) camera (1024x576) and following this a curved
pseudo-projective transform was applied. Eventually, the result was cropped to
fit the DV format and obtain a image of size (720x576). This process is illustrated
in figure 5.15.
    Our geometrical compensation method was used and the watermark payload
was then extracted. Figure 5.16 shows the BER as a function of the distortion
amplitude for a trapezoidal transform, that is when no curvature is applied. The
distortion amplitude is measured by the amplitude of vector �as illustrated in
figure 5.15.
5.3 Spatial registration in presence of geometrical deformations







0 102030405060 Distorsion Amplitude

Shared By: