Learning Center
Plans & pricing Sign in
Sign Out

audio Basics of Audio


									Basics of Audio Signal Processing

             Sudhir K

                     Summary Slide

 Digital Representation of Audio
 Psycho-Acoustic principles
 Lossy Compression of Audio (MP3 and AAC)
 Lossless compression of Audio (general principles with example)

       Digital Representation of Audio
 PCM Data
      Sampling audio input at discrete intervals and quantizing into discrete
       number of evenly spaced levels.
      Sampling Frequency
      Bits per sample
      Number of Channels
      Interleaved and block format
 Audio CD
      44.1 KHz, 2 channels , data-rate is 1.4 Mbits per second

                            Digital                           speakers
             ADC                                DAC

           Psycho-Acoustic Principles
 Sound Pressure Level
 Perceptual and Statistical redundancy
 Absolute Threshold of Hearing
 Critical Bands
 Masking in Time domain
 Masking in Frequency domain
 Perceptual Entropy
 Pre-echo Effect
 Psycho-Acoustic Model 1
 Psycho-Acoustic Model 2
 Filter Banks and Transforms

                       Sound Pressure Level

 Standard metric to quantify intensity of acoustical stimulus
 Measured in decibels (dB) relative to an internationally defined reference level

 LSPL is the SPL of stimulus p
 P0 is the standard reference level at 20 µPa
 150-dB SPL is the dynamic range of human
  auditory system
 140-dB SPL is typically the threshold of pain
 Human auditory system can hear frequencies
  ranging from 20 Hz to 20 KHz frequency

        Absolute Threshold of Hearing
 Characterizes the amount of energy needed in a pure tone such that
  it can be detected by a listener in a noiseless environment
 This can be interpreted naively as a maximum allowable energy
  level for coding distortions introduced in frequency domain

 Note that the absolute threshold of hearing is a function of
 Response of a human ear for a pure tone is dependant on the
  frequency of the tone
 Sensation Level : intensity level difference for stimultus relative to
  detection threshold (quantifies listener’s audibility)
 Equal SL components can have different SPL’s                             6
Absolute Threshold of Hearing

                              Human Ear Model
 Frequency to place
     Sound wave moves the eardrum and
      attached bones
    The eardrum and the bones transfer
      mechanical vibrations to Cochlea
    Oval window of cochlear membrane
      induces traveling waves along length of
      basilar membrane.
    Traveling waves generate peak
      responses at frequency specific
      membrane positions
    Specific positions of membrane provide
      peak responses for specific frequency
 Cochlea can be considered as a set of highly
   overlapped band-pass filters.

                             Critical Bands
 Cochlea can be considered as a set of
  highly overlapped band-pass filters.
 Critical bandwidth is a function of
  frequency that quantifies the cochlear
 Loudness (percieved intensity)
  remains same when the noise energy
  in present within a critical band
 One bark corresponds to distance of
  one critical band
 Critical bandwidth tends to remain 0     2   4   6     8   10   12   14   16   18   20
  constant up to 500Hz and then
                                                       Frequency (KHz)
  increases to 20% of center frequency
  above 500 Hz

                  Simultaneous Masking
 Process where one sound is rendered inaudible by presence of another sound
 Frequency domain masking



   Tone masking Noise (TMN)
   Noise Masking Tone
   Noise Masking Noise
   In-band Phenomenon (occurs within same critical band)

                  Simultaneous Masking
 SMR (signal to mask ratio)
      smallest difference between intensity
       of masking signal and the intensity of
       masked signal
      SMR for NMN is 26dB, TMN is 24dB
       and NMT is 5dB
      Noise is a better masker than tone
 Spread of Masking
      Inter-band Masking
      Triangular spreading function

Temporal (Non-simultaneous) masking
   Masking in time-domain
   Pre-Masking : Masking occurs prior to the signal
   Post-Masking: Masking following the occurrence of signal
   Pre-masking is usually less (approx 1-2 ms)
   Post-masking is of longer duration (50 to 300ms)

      Just Noticeable Difference (JND)
 Also called as global masking threshold
 Global Masking threshold is a combinaton of individual masking
  thresholds (threshold due to NMT, TMN and absolute threshold)
 Quantization noise should be kept below the JND to keep it

                                              Masking curve

                   Perceptual Entropy
 Measure of perceptually relevant information
 Expressed in bits per sample
 Represents a theoretical limit on compressibility of a particular

 Pre-echoes occur when a signal with sharp attack begins near end of
  a transform block immediately following a region of low energy

   Inverse quantization spreads evenly throughout the reconstructed block

                       Pre-Echo control
 Bit-reservoir
      Store surplus bits, which can be used during periods of attack
 Window Switching
      Switch between long and short time-window
      Short window for transients to minimize spread of noise.
      Long window for normal case to increase compression efficiency
 Gain Modification
      Smoothes transient peaks by changing gain of signal prior to the
 Temporal Noise Masking
      Linear prediction on frequency domain spectrum
      Flattened residual and quantization noise.
      The quantization noise is suchthat it follows original signal enveope
                          Stereo coding
 MS-Stereo (Middle/Side Stereo)
      One channel to encode information identical between left and right
      One channel to encode differences between left and right channel
      Transmit sum and difference of the original signals in left and right
 Intensity Stereo
      Lossy Coding technique
      Replace left and right channel with a single representing signal plus
       directional information
      Usually used only in higher frequencies (since human ear is less
       sensitive to signal phase at these frequencies)
      Used only at low bit-rates

               Psycho Acoustic Model1
1. Spectral analysis and SPL normalization
      Normalize input samples and segment into blocks
2. Identification of Tonal and Noise maskers
      Energy from 3 adjacent spectral components combined to form single
       tonal masker
      Energy of all other spectral lines not within a range of Δ combined to
       form noise masker
 Decimation and reorganization of maskers
      Any tonal or noise threshold below absolute threshold are discarded
      Adjacent pair of maskers are compared and is replaced by stronger of
       the two.
 Calculation of individual Masking Threshold
      Calcullate threshold due to tonal and noise maskers

      Pyscho Acoustic Model 1
Threshold due to tonal maskers

Threshold due to noise maskers

               Psycho Acoustic Model 1
 Calcullation of global masking threshold
      Individual masking threshold are combined to estimate global masking
      Assumes masking effects are additive
      Sum of absolute threshold of hearing, threshold from tonal masker and
       threshold from noise masker

            Filter Bank Characteristics
 Lossless (analysis and synthesis should be invertible)
 Aliasing errors should cancel for perfect or near-perfect
 Low computational complexity
 Bandwidth should replicate critical bands of human ear.

QMF Filters

 Cosine Modulation of low-pass prototype filter to implement
  parallel M-channel filter banks with nearly perfect reconstruction
 Overall linear phase and hence constant group delay
 Complexity = one filter + modulation
 Critical sampling
                         Analysis & synthesis filters satisfy mirror image conditions to
                         eliminate phase distortion

                                                     Analysis filter

                                                     Synthesis filter

 MPEG1 uses a 32-channel PQMF bank for spectral decomposition in layer
 I and Layer II
                            MDCT (TDAC)

   De-correlate signal by mapping to an orthogonal basis functions
   Lapped orthogonal block transform
   Successive transform block overlap each other
   Overall linear phase
   Forward MDCT
        50% Overlap between blocks
        Block transform of 2M samples and block advance of M samples
        Basis functions extend across 2 blocks (blocking artifacts elimination)
        Critically sampled M samples output for 2M input samples

     Lossy Audio Compression techniques
 Decoded output is not bit-exact with original input
 Decoded output is perceptually same as original input
 More compression achieved
 Extensive use of psycho-acoustic model to discard perceptually
  irrelevant audio data
 Examples : MP3 and AAC

                Time to            Allocate bits
               Frequency                &
               Filter Bank           Quantize


                  Audio Decoder

Usually Encoder Complex and Decoder less complex

                    MPEG Compression
 ISO 11172-3 ISO (MPEG 1)
 Mainly specifies the bit-stream and hence leaves the flexibility of Encoder
  design to individual developers
 Lossy and perceptually transparent
 Sampling frequencies of 32, 44.1 KHz and 48 KHz supported
 Various bit-rates from 32-192 kbps per channel supported
 Supports following channel modes
       Mono, Stereo, Dual Mono, Joint Stereo
 Based on complexity 3 independent layers of compression
       Layer 1 (around 192 kbps per channel)
       Layer 2 (around 128 kbps per channel)
       Layer 3 (MP3) (around 64 kbps per channel)
 Complexity increases as we go from Layer 1 to Layer 3
 CRC (optional) for error checking
 Ancillary Data support

MPEG 1 layer1 and layer 2

            MPEG Layer 1 and Layer 2
 Sub-band filtering
      Polyphase filter bank
      Decompose input signal into 32 sub-bands
      Sub-bands are equally spaced (for ex : 48KHz signal, each subband is
       750 Hz)
      Critically sampled (output of each sub-band is down sampled such that
       the number of input and output samples are the same)
      sub-bands do not reflect the human ear’s critical band
      Prototype filter chosen such that high side lobe attenuation (96 dB) is
      Not perfectly Lossless (error is small)
      Done for psycho-acoustic analysis and determination of JND thresholds
      Done in parallel with the sub-band filtering
      Layer 1 : 512 and Layer 2 : 1024 point
          MPEG 1 Layer 1 and Layer 2
 Block companding
      Sub-band filtering output is block-companded (normalized by a scale
       factor) such that the maximum sample amplitude in each block is unity.
      This operation is done on a block of 12 samples (8 ms at 48 KHz)
 Psycho-Acoustic analysis
      Output of the FFT block is input to the psycho-acoustic block
      This block outputs the masking threshold for each band
 Quantization and bit-allocation
      This procedure is iterative
      Bit-allocation applies JND threshold to select an optimal quantizer
       from a pre-determined set
      Quantization should satisfy both masking and bit-rate requirements
      Scale factors and quantizer selections are also coded and sent in the

            MPEG Layer 1 and Layer 2
 Psycho-Acoustic Model
      Separate spectral values into tonal and non-tonal components or
       calcullate tonality index
      Apply spreading function
      Set lower bound for threshold values
      Find masking threshold for each sub-band
      Calculate Signal to Mask Ratio and pass it to the bit-allocation block.

          MPEG 1 Layer 1 and Layer 2
 MPEG1 Layer 1
      Frame length of 384 samples
      32 sub-bands of length 12.
      Each group of 12 samples gets a bit-allocation and a scale-factor

 MPEG 1 Layer 2
      Enhancement of Layer 1
      More compact code for representing scale-factors, quantized samples
       and bit-allocation
      Frame length of 1152 samples
      Each sub-band = 3 groups of 12 samples each
      Each sub-band has a bit-allocation and upto 3 scale-factors

         MPEG 1 Layer 1 and Layer 2

 Bitstream

   SCFSI : Scale factor Selection information. Number of scale
   factors for each sub-band.

MPEG 1 Layer 3

 Diag from fhg site
                      MPEG 1 Layer 3
Main blocks
      Filter Bank
      Perceptual acoustic model
      Quantization and Coding
      Encoding of bit-stream
      Mono and stereo support
      Bit-rates upto 320 kbps
      Sampling frequencies => 32 KHz, 44.1 KHz and 48 KHz
      CBR and VBR coding
      MS-stereo and IS-stereo coding

 Enhancements over Layer 1 and Layer 2
 Higher frequency resolution due to MDCT
 Non-uniform quantization
 Uses scale-factor bands, which resemble human ear model (unlike
  sub-bands used in Layer 1 and Layer 2)
 Entropy Coding (Variable length Huffman codes)
 Better Handling of Pre-echo artifacts
 Use of Bit-reservoir

                      Hybrid Filter Bank
     Hybrid filter bank
     Better approximation of critical
      bands of human ear
     Poly-phase filter followed by
      MDCT filter bank
     Poly-phase filter bank
        Compatible to Layer 1 and Layer
     MDCT filter bank
        Each poly-phase frequency band
         into 18 finer sub-bands
        Higher frequency resolution
        Pre-echo control
        Better Alias reduction
        Block Switching


       for i=511 downto 32 do

 for i=31 downto 0 do

   Window by 512 Coefficients
        Produce Vector Z
   for i=0 to 511 do     Z i =C i *X i
                                                Sub-band Filtering

        Partial Calculation
 for i=0 do 63 do Yi =              Z i + 64j

      Calculate 32 Samples by

for i=0 do 31 do Si =          M ik * Yk

         Output 32 Subband

                    Window Switching
 Window Switching
      Short and long windows
      Adaptive MDCT block sizes of 6 and 18 points
      Short windows to prevent pre-echo (pre-masking to hide pre-echoes)
      Long window of length 1152 samples
      Short window of length 384 samples

                Quantization and Coding
 Uses Bit-reservoir
      Bits saved from one frame are used for encoding other frame
 Non-linear quantization
           ix(i) = nint
                                   qquant+quantanf             )
                                                        - 0.0946

 Huffman encoding
      32 different huffman code tables available for coding
      Each table caters for different Max value that can be coded and the
       signal statistics
      Different code books for each sub-region

         Quantization and                                           BEGIN
                                                                                 Layer III Outer Iteration Loop

                                                   Inner Iteration Loop
 Inner iteration loop
       Rate control loop                            Calculate the distortion for each
                                                     critical band

       Assigns shorter code to more
        frequently used values                     Save scaling factors of the critical bands

       Does huffman coding and quantization
       Keeps increasing global gain till                    Preemphasis

        quantization values are small enough to
        be encoded by available number of bits     Amplify critical bands with more than the
                                                   allowed distortion

 Outer Iteration loop
                                                    All critical bands amplified ?              y
       Noise Control loop                                              n

       If quantization noise exceeds masking       Amplification of all bands below
                                                    upper limit ?
        threshold in any band then it increases
                                                     At least one band with more than the
        the scale factor for that band               allowed distortion ?                 n
       Executed till noise is less than masking                                                Restore scaling factors

         Bit-reservoir and Back-frames

 Encoder can donate bits to bit-reservoir and can borrow bits from
  the bit-reservoir
 9-bit pointer for pointing to main data begin (starting byte of audio
  data for that frame)
 Theoretically the main data begin cannot be greater than 7680 bits
  (frame length for frame of 320 kbps at 48 KHz)
Advanced Audio Coding (AAC)

                        AAC Features
 Sampling Rate (8 kHz to 96 kHz)
 Bit Rates (8 kbps to 576kbps)
 Mono, Stereo and multi-channel (Upto 48 channels)
 Supports both CBR and VBR
 Multiple profiles or Object Types
      Low Complexity (LC)
      SSR
      HE (High Efficiency)
      HEv2 (High Efficiency with Parametric Stereo)

          AAC-Basic Features and Modules
 High frequency resolution transform coder (1024 lines MDCT with
  50% overlap)
 Non-uniform quantizer
 Noise shaping in scale factor bands
 Huffman Coding
 Temporal Noise Shaping (TNS)
 Perceptual Noise Substitution (PNS)
 Modules
      FilterBank
      Perceptual Model
      Quantization and Coding
      Optional tools like TNS, PNS, prediction etc

                Improvements over MP3
 Higher efficiency and simpler filter bank
       Only MDCT vs hybrid filter bank of MP3
 Higher Frequency Resolution (1024 vs 576 of MP3)
 Improved Huffman Coding table
 Window Shape adaptation (Sine and KBD)
 Enhanced Block Switching
       The window length is dynamically changed between 2048 and 256
        samples (Against 1152 and 384 of MP3). This leads to better coding
        efficiency for long blocks and less pre-echo artifacts for short blocks.
 Use of following tools only in AAC
       Temporal Noise Shaping
       Perceptual Noise Substitution
       Long Term Prediction
 More flexible joint stereo (separate for every scale band)

                          Filter Bank
 MDCT supporting block lengths of 2048 and 256 points
 Dynamic switching between long and short blocks
 50 % overlap between blocks
 Windows are of two types
       Kaiser Bessel Window (KBD)
       Sine shaped Window
 In case of short blocks 8 short transforms are performed in a row to
  maintain synchronicity

            Temporal Noise Shaping (TNS)
 Forward Prediction
      Correlation between subsequent input samples exploited by quantizing
       the prediction error based on unquantized input samples
      Quantization error in the final decoded signal is adapted to PSD (Power
       Spectral Density) of the input signal
      Forward prediction done on spectral data over frequency. The temporal
       shape of the quantization error signal will appear adapted to the
       temporal shape of input signal at output of the decoder.
      Temporal shape of Quantization noise of a filter bank is adapted to the
       envelope of the input signal by TNS and in case of No TNS the
       quantization noise is distributed almost uniformly over time.

             Temporal Noise Shaping (TNS)
 Tool for handling transient and pitched input signals
 Duality between time and frequency domains
       Un-flat spectrum can be coded efficiently by coding spectral values or
        by applying predictive coding methods to time-domain signal
       Duality : Efficient coding of transient signals (un-flat in time-domain)
        is efficient in time-domain or by applying predictive methods to the
        spectral data
 TNS uses a prediction approach in the frequency domain to shape
  the quantization noise over time
 Quantized filter coefficients transmitted
 TNS tool can be dynamically switched on and off in the stream

         Perceptual Noise Substitution (PNS)
 Available only in MPEG-4 and not in MPEG-2
 Based on the fact that the fine structure of a noise signal is of minor
  importance for the subjective perception of signal.
 Instead of transmitting actual spectrum transmit the following
       Information that this frequency region is noise-like.
       Total power in that frequency band
 PNS can be switched on and off on a scale-factor basis.
 In decoder when a region is coded using PNS, then the decoder
  inserts randomly generated noise.

      Spectral Band Replication (SBR)

 Recreate High-frequencies from decoded base-band signal.
 Enhancement Technology (needs a base audio codec)
 Base codec operates at half the sampling frequency of SBR
 The bit-stream of the basic encoder + control parameters

SBR Decoder
  1. Decoded low-band Signal analyzed using
  2. High Frequency Reconstruction from Lower
  3. Reconstructed signal adaptively filtered to
     ensure spectral characteristics of each sub-
  4. Envelope adjustment
  5. Addition of low-band signals with envelope
     adjusted high-band signals

                Parametric Stereo (PS)
 Mono Signal is encoded along with stereo Parameters as side
  information in the encoded bit-stream
 3 types of parameters are employed in parametric stereo
      Inter-Channel Intensity Difference (IID)
      Inter-Channel Cross Coherence (ICC)
      Inter-Channel Phase Difference (IPD)

Lossless Audio Compression

            Sudhir K
       Multimedia Codecs

                        Main Features

 No Loss in Quality
 Perfect Reconstruction
 Less Compression
 No Psycho-Acoustic Model required
 Applications
      High-end Audio
      Home-Theatre
      DVD Audio
 Examples
      MLP, WMA Lossless, OptimFrog, Real Lossless, Monkey’s audio,
       FLAC, LTAC, Apple Lossless, TTA Lossless audio, MPEG4 lossless
       Coding (ALC)
               Types of Lossless Coding

 Time domain lossless Coding
      Audio data in time-domain
      Most of the current lossless compression techniques are of this type
 Frequency domain lossless Coding
      Operate on audio data in Frequency domain
      Very few schemes like LTAC

    Time Domain Lossless compression

     Block       Inter-Channel    Signal     Entropy
 Decomposition   Decorrelation   Modelling   Coding

 Block Decomposition
 Inter-Channel Decorrelation
 Signal Modelling
 Entropy Coding

                Inter-Channel Coding

 Redundancy between various channels
 Various Techniques
      Difference Channel Coding
      Mid-Side Stereo Coding
      Intensity Stereo Coding
      Inter-Channel Matrixing

        Signal Modeling and Prediction
 Model input audio signal
 Difference between original and predicted audio signal minimal
 Model parameters and error coefficients transmitted
 Computationally most complex block
 Various Techniques
      Linear Prediction
      LMS Filter or Adaptive filter
      Polynomial Curve fitting techniques

                     Entropy Coding
 Remove redundancy between bits in the bit-stream
 To compress residue or error signal further
 Many schemes
      Huffman coding
      Run length Coding
      Golomb Rice coding

 TED PAINTER, ANDREAS SPANIAS, “Perceptual Coding of Digital Audio”, in Proc IEEE Vol 88, No
  4, April 2000
 Davis Yen Pan, “Digital Audio Compression”, Digital Technical Journal, Vol 5, No 2, Spring 1993
 Heiko Purnhagen, “Low Complexity Parametric Stereo Coding in MPEG-4”, Proc of 7th Int Conference on
  Digital Audio Effects, Naples Italy, Oct 5-8, 2004
 TED PAINTER, ANDREAS SPANIAS, “A review of Algorithms for Perceptual Coding of Digital Audio
 Davis Pan, “A Tutorial on MPEG/ Audio Compression”
 Seymour Shlien, “Guide to MPEG-1 Audio Standard”
 ISO 11172-3, Information Technology- Coding of moving pictures and associated audio for digital storage
  media Part-3
 ISO 13818-3
 ISO 14496-3
 Jurgen Herre, “Temporal Noise Shaping, Quantization and Coding methods in Perceptual Audio Coding: A
  Tutorial Introduction”, AES 17th International conference on high quality audio coding.

Deleted Slides

                             Filter Banks

 Time-frequency analysis block
 Parallel bank of bandpass filters covering entire spectrum
 Divide signal spectrum into frequency sub-bands

Band-pass analysis output
                                    Upsampling in Decoder
                                                            Output is identical to
                                                            input with delay
Decimation by factor M

              Critically sampled or maximally decimated
                     Parametric Stereo
 Encoder                         Decoder

  C= 10IID/20
  α= arccos(ICC/2)


To top