Spread-spectrum audio watermarking requirements, applications, and

Document Sample
Spread-spectrum audio watermarking requirements, applications, and Powered By Docstoc
					             Spread-SpectrumAudio Watermarking:
            Requirements, Applications, and Limitations
                        Darko Kirovski and Henrique Malvar
                                 Microsoft Research
                    One Microsoft Way, Redmond, WA 98052, USA

        Abstract - Watermarking has recently been adopted as a technology of choice for
    many applications related to e-commerceof audio content. We present a brief summary
    of a set of spread-spectrum watermarking techniques for effective covert communica-
    tion over an audio signal carrier. Watermark robustness is enabled using redundant
    spread-spectrum for prevention against de-synchronization attacks. We improve wa-
    termark inaudibility by detecting and not watermarking blocb of audio where a spread
    spectrum sequence, if added to the frequency spectrum, would be audible. Finally, we
    overview the security limitations of our technology with respect to parameter selection
    and position it with respect to three main applications of watermarking: ( U ) content
    screening, (b) tracing unlicensed content distribution, and (c) robust metadata.
         Music watermarking schemes rely on the imperfections of the human auditory
    system (HAS) [l]. Commonly, data hiding techniques explore the fact that the HAS
    is insensitive to small amplitude changes in the frequency domain [ 2 ] , [3]. Data
    modulation is usually carried out using: spread-spectrum (SS) [2] or quantization
    index modulation (QIM) [4]. Advantages of S S and QIM watermarking include: (i)
    testing for WMs does not require the original and (ii) it is difficult to extract the
    hidden data using statistical analysis under certain conditions [SI. In addition, S S
    watermark (WM) detection is exceptionally resilient to attacks that can be modeled
    as time- and frequency-axis scaling with fluctuations [6]. In this paper, we overview
    a SS audio watermarking technology that uses several techniques to alleviate the two
    main disadvantages of straightforward applications of both S S and QIM: (i) it is
    robust with respect to common signal processing attacks which take advantage of
    the knowledge of the marking and detection algorithms, (ii) WM detection is self-
    synchronous, and (iii) it has improved WM imperceptibility mechanisms.
         In this paper, we describe a redundancy coding S S that prevents desynchroniza-
    tion attacks 171. We show how WM inaudibility can be improved by detecting and
    not watermarking blocks of audio where a S S sequence, if added to the frequency
    spectrum, would be audible. Next, we discuss the robustness and reliability of WM
    detection. Finally, we quantify the security limitations of our technology and posi-
    tion it with respect to the main applications of watermarking.
         Let us denote as 2 the original signal vector to be watermarked. It represents a
    block of samples from an appropriate invertible transformation on the original sig-
    nal. The corresponding watermarked vector is generated by: F=.Z+G, where the
    WM G is a sequence of elements wi (chips) with two equiprobable values, i.e.
    wi E {-A,+A}, generated independently with respect to X . WMs are generated using
    a pseudo-random bit generator (PRG) initiated using a secret key. WM magnitude
    A is set based on the HAS sensitivity to amplitude changes. A correlation detector
    performs the test for the presence of the WM:
               (                    ,
    c = F.G = + G). 6= 2 .G+ W. = 2 . NA*
               ;                          w+               (1)

0-7803-7025-2/0 1/$10.00 02001 EEE.           219
where N is vector cardinality, and correlation between vectors ii and 5 is defined
as ii.S=&vi.      Since the original clip i can be modeled as a Gaussian random
vector: I = ( , L L ~ , O , ~ ) ,A ,. a normalized correlation test can be presented as:
           N                >> ~ ~
       C        1 -
 t3 = - p -N(O,O, l f i )
      N A ~ A
where p = 1 if WM is present and p = O otherwise. The optimal detection rule is to
d!eclare WM present if Q > T . The choice of threshold T controls the trade-off
tetween false alarm and misdetection probability. Probability that Q > T is equal to:


     Several techniques for significantly improving the basic SS watermarking have
been presented in [l], [2], [3], [6]. We have selected SS as a backbone of our mark-
ing technology because QIM has a number of deficiencies that are difficult to sur-
pass. The most important is that the QIM quantizer set needs to be pseudo-randomly
shifted for every marked element. In case this is not done, a simple histogram of the
content would clearly reveal the quantizer values. This would lead towards a low-
noise obfuscation of the marked data that would prevent the detector to identify the
\VMs. Such pseudo-random shifts introduce additional sensitivity of QIM to desyn-
chronization attacks (e.g., StirMark [7]), that are substantially more difficult to han-
dle in QIM than in SS (see next Section). One significant deficiency of both SS and
QIM,is that by breaking a single player (debugging, reverse engineering, or the
sensitivity attack [SI), one can extract the secret data (the key iG used to generate
the SS sequence or the hidden quantizers in QIM) and recreate the original (for SS)
cr create a new copy that will induce the QIM detector to treat it as unmarked.
     In the developed watermarking system, vector I is composed of the dB magni-
tudes of several frames of a modulated complex lapped transform (MCLT) [9]. After
addition of the WM, we generate the timedomain marked audio signal by combin-
ing the vector j with the original phase of I , and passing those modified frames to
the inverse MCLT. WM amplitude A is set to 1dB. We have passed the “golden
ears test” at A = l S d B [lo]. For the typical 44.lkHz sampling, we use a length-2048
MCLT. Only the coefficients within -2-7kHz (coefficients indexed 200-700) are
nmked and only the audible magnitudes in the same subband are considered during
detection. In order to quantify the audibility of a particular frequency magnitude, we
use a simple psycho-acoustic frequency masking model (PAFM) [ll].
      The correlation metrics from Eqn.2 is reliable only if majority of detection chips
 ’vi are aligned with those used in marking. Thus, a malicious attacker can attempt to
desynchronize the correlation by time- or frequency-axis scaling within the loose
bounds of acceptable sound quality. To prevent from such attacks we use a multi-
test methodology that adds redundancy to the WM chip pattern to enable a reliable
correlation metric in the presence of scale modifications. Robustness from de-
synchronization attacks is provided in two steps.
     Step 1. REDUNDANT ENCODING In the first step, we provide resilience
against fluctuations in playtime and pitch bending (wow-and-flutter) of up to a fixed
parameter W O ~  which describes the maximum fluctuation magnitude independently
along any of these two dimensions. As common standard values for wow-and-flutter

for modern turntables are significantly below wof = 0.01, we have adopted this
value as our robustness limit.
     We represent an SS sequence as a matrix of chips w = { w , } , i = l . . ~ , j = l . . j E ,
where @ is the number of chips per MCLT block and A is the number of blocks of
 4 chips per WM. Within a single MCLT block, each chip w, is spread over a sub-
band of F, consecutive MCLT coefficients. Chips embedded in a single MCLT
block are then replicated along the time axis within TI consecutive MCLT blocks.
An example of how redundancies are generated is illustrated in Figure 1 (with pa-
rameters F, =3, i = l e . @ ,T j =3, j =l..A). Widths of the encoding regions
 4 , i = 1..@ are computed using a geometric progression:

where SF is the width of the decoding region (central to the encoding region) along
the frequency axis and wof' 2 wof is the desired robustness to fluctuated pitch scal-
ing. Similarly, the length of the WM & in groups of constant T~ =T,, ;=I,.,?,,
MCLT blocks watermarked with the same SS chip block is delimited by:
 &Towof '<To -S,, where ST is the width of the decoding region along the time-
axis. Lower bound on the replication in the time domain T, is set to 1OOms for
robustness against cropping or insertion.

           MCLT Block of
                                                               Figure 1. Illustration of
     Frequency Magnitudes                      A    TIME
                                                           ,   geometrically progressed
                                                               redundancies applied to SS
                                                               chips within a single freq-
                                                               spectrum block and along
                                                               the time-axis. Each region
                                                               is encoded with the same
                                                               bit, whereas the detector
                                                               integrates only the center
                                                               locations of each region.

     If WM length of &T, MCLT blocks does not produce satisfactory correlation
convergence, additional MCLT blocks ( A > & ) are integrated into the WM. Time-
axis replication T I , j > ; for each group of these blocks is recursively computed
using a geometric progression (corresponding to Eqn.4). Within a region of &Tj
samples watermarked with the same chip w i j , only the center SFST samples are
integrated in Eqn.2. It is straightforward to prove that such generation of encoding
and decoding regions guarantees that regardless of induced limited wof 5 wof' , the
correlation test is performed in perfect synchronization. Typical redundancy parame-
ters are: (i) constant replication along time axis 8-16 MCLT blocks and ( i i ) geomet-
rically progressed replication along the frequency axis is such that typically 50-120
chips are embedded within the target subband 2-7kHz.
     Step 2. MULTIPLE CORRELATION TESTS - The adversary can combine
wow-and-flutter with a stronger constant scaling in time and frequency. According
to SDMI's test requirements [lo], constant scaling of up to ct < 0.1 along the time
axis and c < 0.05 along the frequency axis can be performed on an audio clip with
relatively preserved fidelity with respect to the original recording. Resilience to
static time- and pitch-scaling is obtained by performing multiple correlation tests as

                                             22 1
(1)     pointer=O
(2) load a buffer with MCLT coefficients from consecutive L ( l + c t ) MCLT blocks starting from
     pointer ( L denotes WM length in MCLT blocks).
(:3) for timascaling = -ct to +ct with step wof ’ / 2
(4) for frequencyscaling = -cf to +cf with step wof’l2
(5)       correlate buffer with WM scaled according to timenscaling frequency.scallng
(6) if WM found pointer+= L else (pointer++ ; goto (2))
   In a typical implementation, for wof’ = 0.02, in order to cover ct = 0.1 and
c = 0.05 the WM detector computes 105 different correlation tests. The search step
along the time axis equals ST,which is a parameter typically between 1 and 4
MCLT blocks. Note that the main incentive for providing such a mechanism to
enable synchronization is the fact that the adversary really cannot move away from
the select constant time and frequency scaling more than wof ’ / 2 within the length
of the WM as such a change would induce intolerable sound quality. If the attacker
is within the assumed attack bounds, the described mechanism enables the detector
to conclude whether there is a WM or not in the audio clip based on the S S statistics
firom Eqn.2 and regardless of the presence of the attack.

      0 08                                                                 Figure 2. An example of audibility of
      0 08                                                                 a S S WM when embedded in the fre-
      OM                                                                   quency domain. The black plot denotes
 B    002                                                                  a single MCLT block of time domain
 E      o                                                                  sample of the original recording, while
 -                                                                         the grey line denotes the corresponding
 E 00.’
 2    004
                                                                           marked recording with audible noise
      0 08                                                                 prior to the signal peak.
      0 08

             0      500   4000   1500   2000   2500   -000   3500   4000
                 4096 time domain samples of a single MCLT block

      SS WMs can be audible when embedded in the MCLT domain even at low
magnitudes (e.g. A = IdB ). This can happen in MCLT blocks where certain part of
t!he block (up to 10ms) is quiet whereas the remainder of the MCLT block is rich in
audio energy. Since the SS sequence spreads over the entire MCLT block, it can
cause audible noise in the quiet portion of the MCLT block (see Figure 2 ) .
     To alleviate this problem, we detect MCLT blocks with dynamic content ac-
cording to a certain empirically determined criteria and do not embed the WM in
them. Fortunately, such blocks do not occur often in audio content; on a large
benchmark set we identified up to 5 < 5% of MCLT blocks per WM as potential
hazard for audibility. By not marking these blocks, the corresponding correlation is
bound to a lower expected value p = 1- { (Eqn.2) which causes a minor distraction
in detector’s decision. The detection of dynamic content is performed on a T=4096-
large MCLT block using the following algorithm:
(‘I) Compute the energy E(a,b) of the signal y(a,b) in each of the following 15 subintervals:
     y ( i T / 8 , ( i + l ) T / 8 ) , i=0..7 and y((2i+l)T/16,(2i+3)T/16), i=0..6.
(21 If there exists E ( a , b ) l ~ and E(O,T)-E(a,b)Lx, do not WM the MCLT block; where
     xo,x,are empirically determined parameters.

     We have designed an audio marking system using the techniques described in
Section 3. Reference implementation of our technology on an x86 platform requires
32 KE3 of memory for code and 100 KE3 for the data buffer. WMs are searched as-
suming maximum wof =0.02, which results in -50 tests per search point. Real-time
WM detection under these circumstances requires about 15 MIPS. WM encoding is
an order of magnitude faster, with smaller memory footprints.
    While image watermarking techniques can be tested with the Stirmark tool [7], a
similar benchmark has not been developed to date for audio. Thus, we have tested
our proposed watermarking technology using a composition of functions from
common PC-based sound editing tools (reverb, echo, denoising, filtering, wow-and-
flutter, time and pitch scaling, etc.) and malicious attacks, including all tests defined
by the SDMI [ 101. In a benchmark dataset (jazz, classical, rock, instrument solos),
there were no errors, and we estimated the error probability (false alarm and misde-
tection) to be below       per clip. The estimations were based on Eqn.3, the result-
ing ox and N for each WM, and the offsets to the Eqn.3 caused by non-
watermarking certain MCLT blocks. Most importantly, error probabilities decrease
exponentially fast with the increase of WM length (see Eqn.3), so it is easy to design
a system with error probabilities below lo-" per single correlation test.
     Although the technology is robust with respect to signal processing attacks that
may leverage on the knowledge of the algorithms, it is still arguable whether it can
be used in all copyright protection scenarios for audio signals. We discuss the limita-
tions of our technology with respect to the three main applications of watermarking:
CONTENT SCREENING. Under this scenario, the copyright owner marks the
original content 2 with a WM G and non-restrictively distributes the marked con-
tent ji to the Internet. A client downloads the marked content and tries to play it on
a computing device. The media player first tries to find a WM in the content, and if
it succeeds, the media player verifies whether the user has fulfilled her e-commerce
act. Although such an application would enable a number of business models for the
copyright owners, it turns out that enabling such a protection system is difficult
because of the following requirements:
   All-platform standard. All audio players in the world need to perform content
   screening because a platform which does not enforce screening would prevail in
   the market as the most appealing to consumers. As a consequence, all operating
   systems and hardware in the world need to suppress unauthorized players from
   talking to soundcard drivers. Global enforcement of content playing rules has al-
   ready been a target of the industry through the involvement with the SDMI [lo].
   Secret hiding at the client. For symmetric watermarking (e.g. SS and QIM), the
   secret hidden by the copyright owner must be present in the detector (i.e. client).
   Disclosure of the secret enables an adversary to recreate the original from the
   marked content and. After breaking a single client, all other clients are enabled to
   play the content as unprotected. Since tamperproof operating systems and hard-
   ware are not likely to be built in the nearest future, this puts a strong bound on the
   applicability of watermarking. Even if such systems are built, the adversary can
   conclude the hidden secret using the inevitable sensitivity attack (of complexity
    O(N)) without breaking the detector [SI. One way of solving this problem is to
   develop asymmetric watermarking mechanisms, where the detection (public) key
   is client-specific, it does not reveal any information about the encoding secret
   (private key), yet it is able to detect the embedded private secret in marked media.

  Media collusion. The adversary can perform media averaging to extract the en-
  coded secret. Successful launch of such an attack assumes that the attacker: (i) has
  large amount of media marked with the same secret j j , =Zc +S,i =l..c and ( i i )
  knows the exact location of the WM in each instance. The optimal estimation of
   3 is computed as: = sigri(Cr=,9,) which yields an estimation error probability:
                              ~. .

    Considered averaged MCLT coefficients typically range within ay [6,15]. To
    create a 90% estimate of the WM, the adversary needs to collude between 60 and
    340 (depending on oy different audio clips. One approach to solving this prob-
    lem is to create a unique WM for each media clip. In this case, the PRG used to
    generate the SS sequence, would be seeded using a unique ID (hash) for each song
    and all its modifications that stay within the bounds of perceptually similar.
TRACING UNLICENSED DISTRIBUTION. Secret hiding at the client is not a
requirement, if watermarking is used as a copyright verification tool. In this sce-
nario, the user records an audio clip, marks it with her secret WM and distributes it
using traditional channels. The user also has a search engine that is searching for all
possible broadcasts or posts (on the Internet or radio stations) of the owned recorded
content. In general, the hidden secret can be used as a proof of authorship. For this
application, it is important that the user deploys solutions that would prevent the
media collusion attack. In this case, the root of the security of the system is the abil-
iiiy of the watermarking scheme to sustain all possible signal processing attacks
which may use the information about the marking and detection algorithms but
which cannot have access to the WM decoder and its output.
ROBUST METADATA. Many applications (not necessarily copyright enforce-
r e n t ) can benefit from having a method for encoding and decoding robust metadata
in an audio clip. Due to file conversion, compression, and other drastic format
changes, it is important that the metadata remains in the audio clip intact as its inte-
gral part. As the removal of the metadata does not benefit any entity, the WMs in
this case need to be robust only to common sound editing procedures [lo].
We thank Dr. Fabien A.P. Petitcolas for suggestions that helped improve this paper.
    Katzenbeisser S., Petitcolas, F.A.P., (eds.): Information Hiding Techniques for Steganography and
    Digital Watermarking. Artech House, Boston (2000).
    Cox, LJ., Kilian, J., Leighton, T., Shamoon, T.: A secure, robust watermark for multimedia. In:
    Information Hiding Workshop, Cambridge, UK, (1996).
    Swanson, M.D., Zhu, B., Tewiik, A.H., Boney, L.: Robust audio watermarking using perceptual
    masking. Signal Processing, 66 (1998), 337-355.
    Chen, B., Wornell, G.W.: Digital watermarking and Information embedding using dither modula-
    tion. In: Workshop on Multimedia Signal Processing, IEEE (1998) 273-278.
    Su, J. IC, Girod, B.: Power-spectrum condition for energy-efficient watermarking. In: Int. Conf.
    Image Processing, IEEE (1999).
    Kirovski D., Malvar H.: Robust Spread-SpeCtrumAudio Watermarking. In: ICASSP, IEEE (2001).
    Anderson, R.J., Petitcolas, F.A.P.: On the limits of steganography. IEEE J. Selected Areas in
    Communications 16 (1998) 474481.
    Linnartz, J.P., and van Dijk, M.: Analysis of the sensitivity attack against electronic watermarks in
    images. In: Information Hiding Workshop, (1998).
    Malvar, H.S.: MCLT and its application to audio processing. In: ICASSP, IEEE (1999).
    The Secure Digital Music Initiative. Call for Proposals, Phase I. Website at http://www.sdmi.org.
    Malvar, H.S.: Auditory masking in audio compression. In: Greennebaum K. (ed.): Audio Anec-
    dotes. Kluwer, New York (2000).