Document Sample
ADSP-10-AC-Psychoacoustics-EC623-ADSP Powered By Docstoc
Audio Coding
   S. R. M. Prasanna

     Dept of ECE,
     IIT Guwahati,

                          Audio Coding – p. 1/4
Acoustics: Study of sounds
Psychoacoustics: Study of perception of sounds
Deals with characterizing human auditory perception
Particularly time-frequency analysis capabilities of inner
Audio coders achieve significant compression by
exploiting the property that perceptually irrelevant
information cannot be heard
Perceptually irrelevant information is identified by
incorporating several psychoacoustic principles

                                                       Audio Coding – p. 2/4
 Human Speech Perception

Figure 1: Cross Section of Human Ear
                                       Audio Coding – p. 3/4
      Functions of Human Ear
Mainly three regions - outer ear, middle ear & inner ear
Outer ear - directs speech pressure variations towards
the middle ear
Middle ear - transforms pressure variations into
mechanical motion
Inner ear - converts mechanical vibrations into electrical
firings in the auditory neurons, which leads to brain
Language decoding and message understanding at the
higher centers of learning in brain which is less

                                                      Audio Coding – p. 4/4
            Inner Ear

Figure 2: Figures Related to Inner Ear
                                         Audio Coding – p. 5/4
Frequency to Place Transformation
 Sound waves to mechanical vibrations by middle ear
 Mechanical vibrations to traveling waves by inner ear
 along the length of basilar membrane
 Neural receptors are connected along the length of the
 basilar membrane
 Traveling waves generate peak responses at frequency
 specific membrane positions
 Therefore different neural receptors are effectively
 tuned to different frequency bands according to their

                                                     Audio Coding – p. 6/4
  Freq. to Place Tfmn. (contd.)
For sinusoidal stimuli, the peak response occurs near
the basilar membrane region with a resonant freq.
equal to input sinusoid freq.
Location of peak is characteristic place for the stimulus
Freq. that best excites a particular place is
characteristic frequency
Thus a frequency to place transformation takes place

                                                      Audio Coding – p. 7/4
  Signal Processing Perspective
Bank of highly overlapping band pass filters
Magnitude responses are asymmetric
Bandwidths increase with frequency

                                              Audio Coding – p. 8/4
   Sound Pressure Level (SPL)
A std. metric that quantifies the intensity of an
acoustical stimulus
SPL gives the level (intensity) of sound pressure in dBs
relative to an internationally defined ref. level
LSP L = 20log10 (p/p0 ) (dB)
where LSP L is the SPL of a stimulus p, which is the
sound pressure in pascals and p0 is the std. ref level of
20 µP a
About 150 dB SPL spans the dynamic range of intensity
for human auditory system
Min value is the limit of detection for low intensity (quiet)
Max value is the threshold of pain for high intensity
(loud) stimuli
                                                        Audio Coding – p. 9/4
Absolute Threshold for Hearing (ATH)
   Amount of energy needed in a pure tone such that it can
   be detected by a listener in a noiseless environment
   ATH is expressed in dB SPL
   ATH is frequency dependent parameter and is given by
   Tq (f ) =
   3.64(f /1000)−0.8 − 6.5e−0.6(f /1000−3.3)2 + 10−3 (f /1000)4

   dB(SP L)
   In the context of signal compression, Tq (f ) could be
   interpreted naively as a maximum allowable energy
   level for coding distortions introduced in the frequency
   domain (Fig 5.1 from Spanias book)
   Use of ATH to shape the coding distortion spectrum
   represents the first step towards perceptual coding.
                                                           Audio Coding – p. 10/4
          ATH Diagram

Figure 3: Absolute Threshold for Hearing
                                           Audio Coding – p. 11/4
           Critical Bands (CB)
Critical band is a function of frequency that quantifies
the cochlear filter passbands
CB tends to remain constant (about 100 Hz) up to 500
Hz and increases to approximately 20% of the center
frequency about 500 Hz
For an average listener the critical bandwidth is given
by BWc (f ) = 25 + 75[1 + 1.4(f /100)2 ]0.69 (Hz)
The function
Zb (f ) = 13tan−1 (0.00076f ) + 3.5tan−1 ((f /7500)2 ) (Bark)
is often used to convert frequency in Hz to Bark scale
Nonuniform Hz spacing of the filter bank is actually
uniform on a Bark scale
One critical band (CB) comprises one Bark. (Table 5.1
and Fig. 5.4)
                                                         Audio Coding – p. 12/4
         Critical Bands

Figure 4: Table Showing Critical Bands
                                         Audio Coding – p. 13/4
  Mapping from Hz to Bark

Figure 5: Mapping from Hz to Bark Scale
                                          Audio Coding – p. 14/4
       Simultaneous Masking
Masking: One sound is rendered inaudible because of
the presence of another sound
Simultaneous masking: When two or more stimuli are
simultaneously presented to the auditory system
Freq. Domain: Relative shapes of the masker and
maskee magnitude spectra determine to what extent
presence of certain spectral energy will mask the
presence of other spectral energy
Time Domain: Phase relationships between stimuli can
also affect masking outcomes
In simple words presence of a strong noise or tone
masker creates an excitation of sufficient strength on
the basilar membrane at the critical band location to
block effectively detection of a weaker (maskee) signal.
                                                    Audio Coding – p. 15/4
Types of Simultaneous Masking
Noise-Masking-Tone (NMT), Tone-Masking-Noise
(TMN) and Noise-Masking-Noise (NMN)
  A NB noise (1 Bark) masks a tone within the same
  CB, provided intensity of masked tone is below a
  predictable threshold
  Signal-to-Mask Ratio (SMR) (dB) is the difference
  between the intensities of masking and maskee
  Min. SMR at the threshold of detection occurs when
  maskee freq is close to center freq of masker and
  will be about 5 dB

                                                Audio Coding – p. 16/4
            TMN and NMN
  Pure tone at the center of a CB masks noise of any
  subcritical BW, provided noise spectrum is below a
  predictable threshold
  Min SMR lie between 21 and 28 dB
  A NB noise masks another NB noise
  Min SMR is nearly about 26 dB

                                                 Audio Coding – p. 17/4
 Masking Schemes

Figure 6: Masking schemes
                            Audio Coding – p. 18/4
       Asymmetry of Masking
The NMT and TMN show asymmetry in masking power
between noise masker and tone masker
In spite of both maskers at same db SPL, associated
threshold SMRs differ by 20 dB
Hence the interest in all types of masking
Knowledge of all three is critical to succeed in the task
of shaping coding distortion
For each temporal analysis interval, a codec’s
perceptual model should identify across the freq
spectrum noise-like and tone-like components within
both the audio signal and the coding distortion
Model should then apply appropriate masking
relationships to obtain global masking threshold

                                                      Audio Coding – p. 19/4
           Spread of Masking
Simultaneous masking is not bandlimited to within the
boundaries of a single CB
Interband masking also occurs, i.e., a masker centered
within one critical band has some predictable effect on
detection thresholds in other CBs.
This effect is known as spread of masking
A triangular spreading function that has slopes of +25
and -10 dB per Bark.
SFdB (x) = 15.81 + 7.5(x + 0.474) − 17.5 1 + (x + 0.474)2
where x in Barks and SFdB (x) is expressed in dB.

                                                     Audio Coding – p. 20/4
Just Noticeable Distortion (JND)
Global masking threshold comprises an estimate of the
level at which quantization noise becomes just
Hence global masking threshold is sometimes referred
to as JND

                                                 Audio Coding – p. 21/4
    Nonsimultaneous Masking
Also termed temporal masking
Masking phenomenon extends beyond window of
simultaneous stimulus presentation
Masking occurs both prior to masker onset and also
after masker removal
Forward (post) and backward (pre) masking are the two

                                                 Audio Coding – p. 22/4

                    Figure 7: Temporal Masking

          Perceptual Entropy
Entropy gives min. no. of bits/sample required to store
or transmit given message block
Johnstan combined notion of psychoacoustic masking
with signal quantization principles to define Perceptual
Entropy (PE).
Perceptual Entropy gives min. no. of bits/sample
required to store or transmit perceptually relevant
information in given audio message block.
While discussing PE, conventional entropy is termed as
statistical entropy.
Statistical entropy employs the statistical properties of
the signal for computing entropy
Perceptual entropy employs both statical and
perceptual properties of signal for computing entropy.
                                                      Audio Coding – p. 23/4
                Basis for PE
Masking threshold indicates amount of quantzn. in freq.
dom. without perceptually corrupting signal.
Assume that step size and no. of levels in the quantizer
for each spectral line could be set independently.
Further choice of step size is such that total noise
injected at each frequency corresponds to masking
threshold i.e., min no of quantization levels are used.
Then no. of bits required to encode entire transform
represents min. no. of bits necessary to transmit that
block of the signal.
The total number of bits divided by the no. of samples in
the transform represents per-sample rate.
This per-sample bit rate is Perceptual Entropy of signal.

                                                     Audio Coding – p. 24/4
                   PE v/s SE
Statistical entropy (SE) exploits signal statistics
Perceptual entropy (PE) exploits signal statistics and
also psychoacoustic masking
No. of quantization levels just to avoid perceptual
distortion due to quantization by exploiting masking

                                                       Audio Coding – p. 25/4
    Steps for PE Computation
DFT computation
Finding Masking thresholds
Calculating no. of bits to quantize DFT spectrum

                                                   Audio Coding – p. 26/4
           DFT Computation
Windowing and frequency transformation
2048 sample DFT by FFT
1024 are considered for further analysis

                                           Audio Coding – p. 27/4
Calculation of Masking Threshold
Critical band analysis
Applying spreading function to critical band spectrum
Calculating Masking Thresholds
Accounting for absolute thresholds
Relating spread masking threshold to critical band
masking threshold

                                                     Audio Coding – p. 28/4
        Critical Band Analysis
DFT spectrum is complex: S(ω) = Re(ω) + Im(ω)
Power Spectrum: P (ω) = Re2 (ω) + Im2 (ω)
P (ω) is partitioned into CBs
Energy in each CB: Bi =         ω=bli   P (ω)
Bi represents CB spectrum

                                                Audio Coding – p. 29/4
      Spreading Function (SF)
CB spectrum threshold is also influenced by adjacent
CBs which is accounted using SF.
SF is used to estimate effects of masking across CBs
SF is calculated for abs(j − i) ≤ 25, where i is bark freq
of masked and j is bark freq of masking and placed into
a matrix Sij
Spread CB Spectrum: Ci = Sij ∗ Bi
Effect of spreading function is to spread peaks in Bi and
also raise threshold values, especially at higher

                                                     Audio Coding – p. 30/4
         Masking Thresholds
TMN is estimated as 14 + i dB below Ci , where i is bark
NMT is estimated as 5.5 dB below Ci uniformly across
CB spectrum

                                                   Audio Coding – p. 31/4
Tone Like and Noise Like Components
  Spect. Flatness Measure: SF M = GM /AM
  GM geometric mean of P (ω) and AM is arithmetic mean
  of P (ω)
  SF MdB = 10log10 (GM /AM )
  Coeff. of tonality: α = min(SF MdB /SF MdBmax , 1)
  SF Mdbmax = −60 dB is used to estimate tonality
  SF MdB = 0 indicate complete noise like
  SF MdB = −30 dB indicates α = 0.5
  SF MdB = −75 dB indicates α = 1.0

                                                       Audio Coding – p. 32/4
    Offset for Masking Energy
Oi = α(14.5 + i) + (1 − α)5.5 (dB), in each band i
Index α is used to geometrically weight the two
Oi is then subtracted from Ci to yield spread threshold
estimate Ti = 10log10 (Ci )−Oi /10
Since spectrum spread fns. do not have normalized
gain, it is normalized by the DC gain for each CB
After normalization, bark thresholds are compared to
absolute thresholds.
Any CB that has bark threshold lower than absolute
threshold is changed to the absolute threshold
This will be the threshold used for computing bit rate.

                                                     Audio Coding – p. 33/4
        Calculation of Bit Rate
No. of quantization levels to follow signal in freq domain
Ti is in power d omain
Quantization energy must be spread across ki spectral
lines in each CB
Assuming noise to spread equally across the entire
band, noise energy will be δ 2 /12
Energy at each spectral freq = Ti /ki
Real and imaginary are quantized independently,
= Ti /2ki
δ 2 /12 = Ti /2Ki =⇒ δ = Ti′ =   (6Ti )/ki
Ti′ is step size.

                                                     Audio Coding – p. 34/4
               Computing PE
NRe (ω) = abs(nint(Re(ω)/Ti′ )) and
NIm (ω) = abs(nint(Im(ω)/Ti′ )) for each ω within CB i.
Let N∗ represents actual (integer) quantized value of
each line
If N(ReorIm) (ω) = 0, then N(ReorIm) (ω) = 0
If N(ReorIm) (ω) = 0, then N(ReorIm) (ω) = log2 (2N∗ (ω) + 1)
This operation assigns a bit rate of zero bits to any
signal with an amplitude that does not need to be
quantized and assigns a bit ate of log2 (no.of levels) to
those that must be quantized.
                   π     ′         ′
Total bit rate =   ω=0 (NRe (ω) + NIm (ω))
Rate per sample, P E = T otalbitrate/2048
                                                        Audio Coding – p. 35/4
Example codec perceptual model
ISO/IEC 11172-3 (MPEG-1) Psychoacoustic Model-1
Determines max. allowable quantization noise energy
in each CB such that it remains inaudible.
Blocking i/p audio into frames
High resolution spectral computation for each frame
For each frame tonal and noise maskers estimation
Decimation and reorganization of maskers
Calculation of individual masking thresholds for
components in each CB
Calculation of global masking thresholds for each CB

                                                    Audio Coding – p. 36/4
                             Spectral Analysis
            512 point DFT computation
            Power Spectral Density (PSD) P (k) estimation, where
            k = 1, 2, . . . , 512

SPL (dB)

                 0   2000   4000   6000   8000 10000 12000 14000 16000 18000
                                             Frequency (Hz)

                                                                          Audio Coding – p. 37/4
Identn. of Tonal and Noise Maskers
 P (k) where k = 1, 2, . . . , 256 are considered
 Local maxima in PSD within a certain Bark by at least 7
 dB are classified as tonal
 Tonal set ST is defined as

   ST = P (k)|P (k) > P (k ± 1)&P (k) > P (k ± ∆k ) + 7dB


           ∆k ∈ 2       2 < k < 63(0.17 − 5.5kHz)
       ∆k ∈ [2, 3]     63 ≤ k < 127(5.5 − 11kHz)
       ∆k ∈ [2, 6]     127 ≤ k ≤ 256(11 − 20kHz)

                                                      Audio Coding – p. 38/4
Tonal and Noise Maskers (contd.)
Tonal maskers PT M (k), are computed from spectral
peaks listed in ST :
         PT M (k) = 10log10          100.1P (k+j) (dB)

For each neighborhood max, energy from three
adjacent peaks combined to form a single tonal masker
For each CB, PN M (k) a single NM is then computed
from (remaining) spectral lines not within the ±∆k
neighborhood of a tonal masker using the sum
         PN M (k) = 10log10    100.1P (j) (dB)
             ∀P (j) = PT M (k, k ± 1, k ± ∆k )
 where k is geometric mean spectral line of CB
                                                         Audio Coding – p. 39/4
       Decimation of Maskers
No. of maskers are reduced using two criteria
First, any tonal or noise maskers below abs. threshold
are discarded, i.e., PT M,N M (k) ≥ Tq (k) are retained.
Next, a sliding 0.5 Bark-wide window is used to replace
any pair of maskers occurring within a distance of 0.5
Bark by the stronger of the two.
Masker freq. bins are reorganized using the decimation

              PT M,N M (i) = PT M,N M (k)
                        PT M,N M (k) = 0

                                                     Audio Coding – p. 40/4
          Decimation (contd.)

i = k,                                1 ≤ k ≤ 48
i = k + (kmod2)                      49 ≤ k ≤ 96
i = k + 3 − ((k − 1)mod4)           97 ≤ k ≤ 232

Net effect is 2 : 1 decimation of masker bins in CBs
4:1 decimation of masker bins in CBs 22-35
With no loss of masking components.
Decimation reduces total no. of tone and noise masker
freq. bins under consideration from 256 to 106

                                                       Audio Coding – p. 41/4
Individual Masking Thresholds
Using decimated set of tonal and noise maskers,
individual tone and noise masking thresholds are
Each individual threshold represents a masking
contribution at freq. bin i due to the tone or noise
masker located at bin j
Tonal Masking Threshold, TT M (i, j) is given by
TT M (i, j) = PT M (j)−0.2757zb (j)+SF (i, j)−6.025(dbSP L)
where, PT M (j) is SPL of tonal masker in freq. bin j ,
zb (j) Bark freq of bin j and SF (i, j) is spreading of
masking from bin j to bin i
Noise Masking Threshold, TN M (i, j) is given by
TN M (i, j) = PN M (j)−0.175Zb (j)+SF (i, j)−2.025(dbSP L)
where, PN M (j) is SPL of noise masker in freq bin j
                                                       Audio Coding – p. 42/4
   Global Masking Thresholds
Individual masking thresholds are combined to estimate
a global masking threshold for each freq. bin
Tg (i) = 10log10 (100.1Tq (i) + L 100.1TT M (i,l) +
   m=1  100.1TN M (i,m) )(db, SP L) where, L and M are the
number of tonal and noise maskers, respectively.
The number of bits are allocated based on the global
masking thresholds and is termed as perceptual bit

                                                     Audio Coding – p. 43/4
Expt. 5-AC- Audio Synthesis using MSE
   Problem No. 2.25 (pp. 49) of Spanias book on Audio
   Signal Processing

                                                    Audio Coding – p. 44/4
 Expt. 6-AC- Audio Synthesis using Psychoacoustics

Problem No. 5.11 (pp. 142) of Spanias book on Audio
Signal Processing

                                                     Audio Coding – p. 45/4

Shared By:
Description: This deals about the digital processing programs