EE5359 MULTIMEDIA PROCESSING
STUDY AND COMPARISON OF AC-3, AAC AND HE-AAC AUDIO
CODECS
Under the guidance of Dr.K.R.Rao
Submitted by,
Dhatchaini Rajendran
M.S.E.E
ID # 1000636681
December 8, 2010
1
Acknowledgement
I would sincerely like to thank Dr. K.R.Rao for his constant guidance, support and motivation
which led to the successful completion of this project.
I would also like to thank all my friends for their support and encouragement.
2
Abstract:
The spectral band replication technology (SBR) is advancement in the field of low bit rate audio
coding and it enhances the performance of the traditional audio coders. Coding technologies [6],
an international company in the audio coding field has developed and marketed SBR. MPEG-
AAC belonging to the ISO-MPEG standard has shown a tremendous improvement with SBR.[1]
The coding efficiency of the traditional audio coders with SBR increases at least by 30%.[7] The
SBR is a bandwidth extension technique which exploits the strong correlation between the low
and high frequency contents in an audio signal. In this project, a performance analysis of the
MPEG-AAC audio coders and advanced audio coding (AAC) audio coders with SBR is
implemented which includes a comparison of the coding efficiency.
3
Table of Contents
Abstract: .......................................................................................................................................... 3
List of acronyms ............................................................................................................................. 5
List of figures .................................................................................................................................. 7
List of tables:................................................................................................................................... 8
1.Overview of Perceptual Audio Coding ........................................................................................ 9
1.1 Psychoacoustic parameters:................................................................................................. 10
2. Overview of AC-3 Audio Codec .............................................................................................. 12
2.1 AC-3 encoder .................................................................................................................... 13
2.2 AC-3 decoder ...................................................................................................................... 16
3. Overview of Advanced Audio Coding .................................................................................... 18
3.1 Basic Profiles in AAC codec .............................................................................................. 18
3.2 AAC encoder and decoder .................................................................................................. 19
3.3 AAC Bit stream Multiplexing ............................................................................................ 24
4. Overview of HE-AAC .............................................................................................................. 27
4.1Spectral Band Replication .................................................................................................... 27
5.1 Performance analysis of the audio codecs .............................................................................. 30
5.1 MUSHRA test .................................................................................................................... 30
5.2 AAC codec .......................................................................................................................... 30
5.3 HE-AAC codec .................................................................................................................. 34
5.4 AC-3 codec ......................................................................................................................... 35
References: .................................................................................................................................... 38
4
List of acronyms
AAC - Advanced audio coding
AC-3 - Audio codec 3
AES - Audio Engineering Society
ADIF - Audio data interchange format.
ADTS - Audio data transport stream.
ATSC - Advanced television systems committee
CT - Coding technologies
HE-AAC - High efficiency advanced audio coding
IMDCT - Inverse modified discrete cosine transform
ISO - International organization for standardization
ITU - International telecommunication union
JAES - Journal of the Audio Engineering Society
KBD - Kaiser-Bessel derived
LC - Low complexity
LFE - Low frequencies enhancement
LTP - Long term prediction
MDCT - Modified discrete cosine transform
MPEG - Moving pictures experts group
MUSHRA - Multiple stimuli with hidden reference and anchor
PCM - Pulse code modulation
PNS - Perceptual noise substitution
SBR - Spectral band replication
5
SRS - Sample rate scalable
TNS - Temporal noise shaping
6
List of figures
Figure 1a: Block diagram of perceptual encoding/decoding scheme
Figure 1b: Graph illustrating the triangular spreading function
Figure 2a: Six channels in AC-3 codec
Figure 2b: Block diagram of AC-3 encoder
Figure 2c: Frame structure and window function of AC-3
Figure 2d: Flow diagram of the AC-3 encoding process
Figure 2e: Flow diagram of the AC-3 decoding process
Figure 3a: Block diagram of AAC encoder
Figure 3b: Block Switching and the window function
Figure 3c: Block diagram of AAC decoder
Figure 4a: AAC audio codec family
Figure 4b: original audio signal
Figure 4c: High band reconstruction through SBR
Figure 4d: AAC codec with SBR technology
7
List of tables:
Table 3a: ADTS header format
Table 3b: ADTS profile bits in header
Table 5a: Performance of AAC audio codec
Table 5b: Performance of HE-AAC audio codec
Table 5c: Performance of AC-3 audio codec
8
1.Overview of Perceptual Audio Coding
Audio coding algorithms aim at representing the audio signal with minimum number of bits and
at the same time achieves signal reproduction with minimum errors.
Perceptual audio coding algorithms make use of facts like the insensitivity of the human ear to
frequencies less than 20 kHz and the redundancy in audio signals to accomplish maximum
compression of the audio signal. The irrelevant information in the signal is identified by using
several psychoacoustic parameters like absolute hearing thresholds, simultaneous masking,
critical band frequency analysis, temporal masking and spread of masking along the basilar
membrane.
Digital
Audio Input Analysis Quantization Encoding of
Filter Bank and Coding Bitstream
Perceptual
Model
Figure 1a: Block diagram of perceptual encoding/decoding scheme [1]
The blocks in Fig.1a are explained below:
The filter bank decomposes the digital input signal into its sub sampled spectral
components in the time or frequency domain.
The perceptual model uses the time domain input signal and mostly the output of the
analysis filter bank along with the psychoacoustic rules, and calculates the actual
masking threshold. This is called the perceptual model of the perceptual encoding system.
The quantization and coding of the spectral components is done and the noise
introduced by quantizing below the masking threshold level is retained. There are several
ways of accomplishing this step from simple block companding to analysis-by-synthesis
systems using additional noiseless compression.
9
A bitstream formatter is used in the encoding of the bitstream which is made up of
quantized and coded spectral coefficients and some side information like bit allocation
information.
1.1 Psychoacoustic parameters:
Absolute threshold of hearing is the amount of energy required by a pure tone so that it
can be heard by the listener under noiseless conditions. It is the maximum allowable
energy level for coding distortions in the frequency domain and this information is used
to keep the noise levels introduced during quantization below the threshold. This is a
non-linear time varying function.
Critical band is a parameter that is used in the spectral analysis of the tone. This arises
from the fact that the human ear behaves as a set of band pass filters encompassing the
entire 20 kHz range. The inner ear known as the cochlea acts as the spectrum analyzer at
it contains the frequency sensitive portions. The cochlea moves on the reception of a tone
and continues till it resonates. The cochlea filter bands are quantified by the critical
bandwidth. This is a function of frequency with the unit as bark.
Masking as the name suggests is a process where one sound makes another sound
inaudible. There is simultaneous masking in the frequency domain and non-simultaneous
masking in the time domain. There are also three different cases of masking:
Noise masking tone
Tone masking noise
Noise masking noise
Spread of Masking is a predictable effect that a masker centered within one critical band
has on the detection thresholds in other critical bands. This is usually modeled in coding
applications by an approximately triangular spreading function shown in Fig. 1b
10
Figure 1b: Graph illustrating the triangular spreading function [18]
11
2. Overview of AC-3 Audio Codec
AC-3 is an audio codec developed by Dolby Laboratories. Dolby AC-3 audio compression
algorithm is a advanced television systems committee (ATSC) standard for digital audio
compression.[2] It is a lossy audio compression format and supports multi-channel format and is
used in a variety of applications including digital television and DVD.
Figure 2a: Six channels in AC-3 codec [2]
There are 5 full range channels (3Hz- 20,000Hz). Three of them are in the front (left, right and
centre) and the other two are surround channels are depicted in fig. 2a. The sixth channel ranges
from 3Hz-120Hz and is also known as low frequencies enhancement (LFE) Channel. This set of
channels is known as “5.1” channels.
Figure 2b: Block diagram of AC-3 encoder [2]
12
The working of the AC-3 encoder blocks in Fig. 2b is explained here [2]. Transforming the
representation of audio from a sequence of PCM time samples into a sequence of frequency
coefficients blocks is the first step in the encoding process. This is accomplished with the
analysis filter bank. Overlapping blocks of 512 time samples are transformed into the frequency
domain by multiplying them with a time window. As the blocks overlap, each PCM input sample
is represented by two sequential transformed blocks. This is shown in fig. 2c. Thus the frequency
domain representation gets decimated by a factor of two and so each block will contain 256
frequency coefficients. A binary exponent and mantissa is used to represent each frequency. The
set of exponents is encoded into a coarse representation of the signal spectrum which is referred
to as the spectral envelope. The core bit allocation routine is used to determine the number of bits
used to encode each individual mantissa. The mantissa is then quantized according to the bit
allocation information. The spectral envelope and the coarsely quantized mantissas for 6 audio
blocks (1536 audio samples) are formatted into an AC-3 frame. The AC-3 bit stream (from 32
to 640 kbps) is a sequence of AC-3 frames. The AC-3 decoder function is the exact opposite to
the encoder.
Figure 2c: Frame structure and window function of AC-3[17]
2.1 AC-3 encoder[2]
A detailed description of the AC-3 encoder is given in this section. A flow diagram of the
encoding process is shown in fig. 2d.
13
Input PCM: The AC-3 encoder accepts audio signals in the form of PCM words with lengths up
to 24 bits. The output bit rate and the input sample rate are locked inorder for the AC-3 sync
frame to contain 1536 samples of audio per channel. The individual input channels are high pass
filtered and DC components are removed for efficient coding. The LFE channel will be low pass
filtered.
Figure 2d: Flow diagram of the AC-3 encoding process [2]
Transient Detection: A decision to switch to short length audio blocks to improve the pre-echo
performance is made by detecting the transients in the full-bandwidth. Block switching is
employed to switch to shorter blocks if a transient is detected. The transient detector operates on
14
512 samples for every audio block and processes 256 samples at one pass. The four steps of
transient detection are
High pass filtering is done with a cascaded biquad direct form IIR filter with a cutoff of 8
kHz [2].
Block segmentation represents the block of 256 samples as three levels, where level 1
represents the 256 length block, level 2 is the 2-segment 128 length blocks and level 3 is
for 4 segments of length 64.
Peak detection is used to identify the largest magnitude on every level of the hierarchy.
Threshold comparison is done to check if there is a significant signal level in the current
block.
Forward transform: Windowing is done to reduce transform boundary effects and improve
frequency selectivity in the filter bank. A 512 point symmetrical window is formed from 256
coefficients used back-to-back. Time to frequency transformation is done either by one long N-
512 point transform or two short N-256 point transforms.
Coupling strategy: A static coupling strategy is used for a basic encoder. The coordinates for
all channels are transmitted for every other block. For advanced encoders, dynamically variable
coupling parameters are used. The frequencies can be varied based on the psychoacoustic model
and bit demand. Rapidly time varying power level channels are removed whereas slowly varying
channels may have their coupling coordinates sent less often.
Form coupling channel: A coupling channel can be got from a basic encoder by adding the
individual coefficients and dividing by 8 so that the channel does not exceed the value of 1.
Coupling coordinates are got by taking magnitude ratios within each coupling band. The
coordinates are then converted to floating point format and quantized.
Rematrixing: Power measurements are made on L, R, L+R, L-R signals within each rematrixing
band. The rematrix flag is set whenever the maximum power is found in L+R or L-R signal.
When the flag is set L+R and L-R are encoded.
Extract exponents: The number of leading zeros in the binary representation of the frequency
coefficient becomes the initial exponent value. The exponent sets are used in determining the
appropriate strategies.
Exponent strategy: The variation in exponents over frequency and time for each channel is
analyzed. It is necessary to trade off time versus frequency resolution while operating at low bit
rates. In general there is a tradeoff between the fine frequency resolution, fine time resolution
and number of bits involved in sending the exponents.
15
Dither Strategy: The coefficients that are quantized to zero bits will be reproduced with dither
and this will be controlled by the encoder. The purpose is to maintain the same energy in the
reproduced spectrum.
Encode exponents: The exponents of each set are preprocessed. They undergo encoding for
transmission in the bit stream. In this step, another set of exponents is generated which is equal
to the previous set and is used by the decoder.
Normalize mantissas: The normalization process is done by left shifting the number of times
obtained from the corresponding exponent. These mantissas are then quantized.
Core bit allocation: This routine is used by a basic encoder. The parameters involved are sent
only during block 0 as the bit allocation parameters are static. The core bit allocation is done and
the SNR is tuned till all the bits in the frame are used up.
Quantize mantissas: The normalized mantissa is quantized to give the quantized mantissa.
These are quantized by rounding to the number of bits indicated by bap that is used by the
quantized mantissa block.
Pack AC-3 frame: This is the encoded AC-3 frame with all the data. This frame can be output
as a burst or in serial format.
2.2 AC-3 decoder[2]
The various steps involved in the Ac-3 decoder are shown in fig. 2e. The input is either
continuous or in burst format. The bitstream is then bit or word aligned thus making the decoder
simpler. Rapid synchronization is possible with AC-3 bitstream. The next step is to extract the
information in the bitstream. They can be copied to specific memory location or input buffer.
Then the exponents are decoded. This requires the number of exponents and the strategy used to
be known. Bit allocation, processing mantissas, decoupling and rematrixing, dynamic range
compression and inverse transform are the reverse processes of the steps in the encoder. The
adjacent blocks are overlapped and added together to reconstruct the final continuous time output
PCM audio signal. Then downmixing is done if the number of channels encoded in the bitstream
is higher than the channels in the decoder. The PCM samples are held in a buffer before being
output. These samples can be connected to a digital to analog converter.
16
Figure 2e: Flow diagram of the AC-3 decoding process [2]
17
3. Overview of Advanced Audio Coding
Advanced audio coding scheme was a joint development by Dolby, Fraunhoffer, AT&T, Sony
and Nokia standardized by ISO (International organization for standardization) and IEC
(International Electro Technical Commission) as a part of MPEG-2 and MPEG-4 specifications
[9]. It is a digital audio compression scheme for medium to high bit rates which is not backward
compatible with moving pictures experts group (MPEG) audio standards. It is a wide band audio
coding algorithm that supersedes its predecessor MP3 (MPEG Layer 3 audio) by providing a
better compression ratio at the same bit rates as the previous standards or same quality audio at
lower bit rates. The main features of this standard are [7]
Sample frequencies from 8 KHz to 96 KHz (MP3 16 KHz to 48 KHz) and thus can support
48 channels.
Higher efficiency and simpler filter banks (MDCT- modified discrete cosine transform)
Better handling of frequencies above 16 KHz and superior performance at bit rates > 64 Kbps
and bit rates reaching as low as 16 Kbps.
AAC meets the requirements for stereo quality sound at 128 Kbps and 5.1 channel audio at
320 Kbps.
3.1 Basic Profiles in AAC codec :[13]
The AAC encoding follows a modular approach and the standard define four profiles which can
be chosen based on factors like complexity of bitstream to be encoded, desired performance and
output.
Low-complexity profile is the most widely used and it deletes the prediction tool and
reduces the temporal noise-shaping tool in complexity.
Main profile is the profile which uses all tools except the gain control module and it
provides the highest quality for applications where the amount of random accessory
memory (RAM) needed is not a constraint.
Sample-rate scalable (SRS) profile adds the gain control tool to the low complexity
profile and allows the least complex decoder.
Long term prediction (LTP) profile was newly introduced in MPEG-4 and reduces the
redundancy of a signal between successive coding frames.
18
3.2 AAC encoder and decoder [16]:
A generic block diagram of an AAC encoder is shown in fig. 3a. [3]
Figure 3a: Block diagram of AAC encoder [4]
Filterbank and block switching: MDCT (modified discrete cosine transforms) is the standard
transform used to convert the incoming audio signal from time domain to frequency domain.
MDCT is a lapped Fourier transform based on type IV DCT. Since it is a lapped transform the
number of outputs is as half as the number of inputs. This transform is very useful in signal
compression application and is used in AAC and AC-3 audio codecs. The MDCT is computed
using the equation below [11].
(3.1)
k = 0,1,….., N-1
where , Xk is the MDCT co-efficient in the frequency domain
19
xn is the sample in the time domain
The inverse MDCT is computed by adding the consecutive overlapping blocks, thus cancelling
the errors and retrieving the original signal. The formula used to compute IMDCT is given below
[11].
(3.2)
n = 0,1,….., 2N-1
where , Xk is the MDCT co-efficient in the frequency domain
yn is the sample in the time domain
The audio sample is first broken into segments called blocks. The data in these blocks are
modified to provide smooth transition between blocks by applying a time domain filter called a
window [10]. This is done by MDCT to the blocks. One of the challenges faced by audio coders
is the election of optimal block size.
Figure 3b: Block Switching and the window function [19]
Intermediate transition windows between the long and short windows smoothens the window
switching as shown in Figure 3b. AAC handles the difficulty associated with coding audio
material that vacillates between steady-state and transient signals by dynamically switching
20
between the two block lengths: 2048-samples, and 256-samples, referred to as long blocks and
short blocks, respectively [10]. The long block offers improved coding efficiency for stationary
signals and the short blocks provides optimized coding capabilities for transient signals. AAC
also switches between two different types of long blocks based on the window shape: sine-
function and Kaiser-Bessel derived (KBD) according to the complexity of the signal. The far-off
rejection is higher in KBD when compared to the sine shaped window.
This signal adaptive selection of the transform length is an important feature and is controlled by
analyzing the short time variance of the incoming audio signal. The block synchronicity between
two channels with different block length sequences is ensured by performing eight short
transforms in a row with 50% overlap and the transition windows are used at the start and end of
a short sequence. Thus the spacing between two consecutive blocks is maintained at a constant
level of 2048 input samples.
Filterbank and gain control: A gain control module and a processing block containing an
uniformly spaced PQF (4-band Polyphase quadrature filter) precedes the MDCT. The gain
control block is used to attenuate or amplify the output of each PQF band and decreases the pre-
echo effects. After performing gain control, MDCT is applied on each PQF band and the length
is one quarter of that of the original MDCT.
Temporal noise shaping (TNS): Speech signals that vary with time are often a challenge to
conventional transform schemes owing to the fact that quantization noise is controlled over
frequency but is constant in a transform block. The TNS technique was introduced into MPEG-2
AAC to overcome this limitation. It is like a post processing step of the MDCT transform which
is used to create a continuous filter bank instead of a switched filter bank. This scheme provides
enhanced control of the location of quantization noise within a filter bank window in the time
domain. It uses the principle of duality of time and frequency domain. A prediction approach is
used in the frequency domain to shape the quantization noise over time. This is done by filtering
the original spectrum and then quantizing and the quantized filter coefficients are transmitted in a
bitstream. This is used at the decoder end to undo the filtering resulting in a temporally shaped
distribution of quantization noise in the decoded audio signal.
TNS handles signals that are between steady state and transient in nature. Quantization noise is
present throughout the audio block when a transient signal lies at an end of a long block. The
non-transient locations in the blocks are described due to the availability of greater amount of
information allowed by TNS. This results in an increase in quantization noise of the transient,
where masking will render the noise inaudible, and a decrease of quantization noise in the
steady-state region of the audio block. [10].
Long term prediction (LTP): Redundancy reduction of stationary signal segments can be
improved by frequency domain prediction. Stationary signals are supported in long transform
blocks and not in short blocks. The predictor can be implemented by a second order backwards
adaptive lattice structure which is calculated independently for every frequency line. The use of
predicted values is controlled on a scale factor band basis and also depends on the prediction
gain in the band. A cyclic reset mechanism which is synchronized between the encoder and
21
decoder is used to improve the stability. Another advantage of the backwards adaptive structure
of the filter is the bitstreams are sensitive to transmission errors.
LTP is a very effective tool for frequency domain prediction especially for signals which have
clear pitch property. It reduces redundancy of the signal between successive coding frames. LTP
implementation is simpler and it uses forward adaptive predictor making it less sensitive to
round-off numerical errors in the decoder or bit error in the transmitted spectral coefficients.
Intensity stereo: Intensity stereo coding is based on an analysis of high-frequency audio
perception specifically on the energy-time envelope of the region of the audio spectrum. This
allows a stereo channel pair to share a single set of spectral values for the high-frequency
components while preserving the sound quality. This is achieved by maintaining the unique
envelope for each channel by means of a scaling operation so that each channel produces the
original level after decoding [10]. In this method, the right and left signal is replaced by a signal
plus directional information thus reducing the bit rate. It is a lossy coding method used primarily
for low bit rates.
Prediction: The prediction module is used to represent stationary or semi-stationary parts of an
audio signal and the repeated information for sequential windows can be represented by a repeat
instruction thus checking on the redundancy of the signal. Short blocks are used for the non-
stationary or rapidly varying signals and so prediction is used along with long blocks. The
prediction process is based on a second-order backward adaptive model in which the spectral
component values of the two preceding blocks are used in conjunction with each predictor. The
prediction parameter is adapted on a block-by-block basis [10].
Mid/Side (M/S) stereo coding: M/S stereo coding is another data reduction module based on
channel pair coding and is used to increase coding efficiency. In this case channel pair elements
are analyzed as left/right and sum/difference signals on a block-by-block basis. In cases where
the M/S channel pair can be represented by fewer bits, the spectral coefficients are coded, and a
bit is set to note that the block has utilized m/s stereo coding. M/S stereo achieves a significant
saving in bit rate when the signal is concentrated in the middle of the stereo image. During
decoding, the decoded channel pair is de-matrixed back to its original left/right state [10]. This
scheme is used for coding at higher bitrates.
Scalefactors: The inherent noise shaping in the non-linear quantizer is not sufficient to achieve
acceptable audio quality. To improve audio quality the noise is shaped using scalefactors. The
scalefactors increase SNR (signal to noise ratio) in certain bands by amplifying the signal in
those spectral regions. The bit-allocation over frequency is modified as more bits are used to
code the higher spectral values. At the decoder, original spectral values are reconstructed by
transmitting the scalefactors within the bitstream. Huffman coding is used to reduce the
redundancy within the scalefactor data.
Quantization and coding: Majority of the data reduction generally occurs in the quantization
phase after the data has already achieved certain level of compression when passed through the
previous modules. In the AAC module, the spectral data is quantized under the control of the
22
psychoacoustic model. The number of bits used must be below a limit determined by the desired
bit rate. Huffman coding is also applied 24 in the form of twelve codebooks. In order to increase
coding gain, scale factors with spectral coefficients of value zero are not transmitted [10].
Adaptive quantization is the primary source of bit rate reduction and key components in the
process are the quantization function and noise shaping. Non-linear quantization is used as it has
implicit noise shaping when compared to the conventional linear quantizer.
Noiseless Coding: This block is used to optimize the redundancy reduction. It is nested inside
the quantization and coding module. Noiseless dynamic range compression can be applied prior
to Huffman coding. A value of +1/- 1 is placed in the quantized coefficient array to carry sign,
while magnitude and an offset from base, to mark frequency location, are transmitted as side
information. This process is only used when there is a reduction in the number of bits [10]. An
efficient grouping algorithm is used to find an optimum tradeoff between the optimum table for
each scalefactor band and minimizing the number of data elements to be transmitted.
The AAC decoder is shown in fig. 3c
The coding efficiency is enhanced by the following tools and they help attain higher quality at
lower bit rates.[3]
This scheme has higher frequency resolution with the number of lines increased up to
1024 from 576.
Joint stereo coding has been improved. The bit rate can be reduced frequently owing to
the flexibility of the mid or side coding and intensity coding.
Huffman coding is applied to the coder partitions.
The following tools are used to improve the audio quality:
Enhanced block switching: Switched MDCT filterbank with an impulse response of 5.3
ms at 48 kHz sampling frequency is used. This helps in the reduction of pre-echo
artifacts.[3]
TNS: An open loop prediction is done in the frequency domain which leads to noise
reduction in the frequency domain. This technique enhances quality of speech at low bit-
rates.
23
Figure 3c: Block diagram of AAC decoder[15]
3.3 AAC Bit stream Multiplexing [8]:
AAC has very flexible bit stream syntax. A single transport is not ideally suited to all
applications, and AAC can support two basic bit stream formats: audio data interchange format
(ADIF) and audio data transport stream (ADTS).
ADIF (audio data interchange format): This format has only one header at the
beginning of the AAC file and the rest of the data are consecutive raw data blocks. This
24
file format is used for simple local storing purposes, where breaking of the audio data is
not necessary.
ADTS (audio data transport stream): This format has one header for each frame
followed by a raw block of data. ADTS headers are present before each AAC raw data
block or block of 2 to 4 raw data blocks in a frame to ensure better error robustness in
streaming environments. For this study ADTS bit stream format is adopted. The details of
the ADTS header are given in Table 3a and 3b.
Table 3a: ADTS header format[14]
Field name Field size in bits Comment
ADTS Fixed header: these
do not change from frame
to frame
Syncword always: '111111111111'
12
ID 0: MPEG-4, 1: MPEG-2
1
always: '00'
Layer
2
protection_absent
1
Profile
2
Sampling_frequency_index
4
private_bit
1
channel_configuration
3
original/copy
1
Home
1
ADTS Variable header:
This can change from
frame to frame
25
Copyright_identification_bit 1
Copyright_identification_start 1
length of the frame
aac_frame_length
13 including header (in bytes)
0x7FF indicates VBR
ADTS_buffer_fullness 11
No_raw_data_blocks_in_frame 2
ADTS Error check
Only if protection_absent
crc_check
16 == 0
Variable
Raw block of data
Table 3b: ADTS profile bits in header[14]
Profile bits ID 1 (MPEG-2 profile)
00 (0) Main profile
01 (1) Low complexity profile (LC)
10 (2) Scalable sample rate profile (SSR)
11 (3) (reserved)
26
4. Overview of HE-AAC
High efficiency advanced audio codec is a low bit rate audio codec defined in MPEG4 audio
profile belonging to the AAC family [4] . This is a combination of AAC with SBR where AAC
is the audio codec and SBR is a technique which increases the coding gain by bandwidth
extension technique. The family of advanced audio codecs is shown in figure. 4a.
HE AAC v2
AAC SBR PS
HE AAC
Figure 4a. AAC audio codec family [20]
4.1Spectral Band Replication (SBR):
SBR is one of the important advancements in the field of audio coding. It is a bandwidth
expansion technique which is based on the correlation between the energy components at the
high and low frequency bands of the audio signal.
SBR is an add-on to the audio coder. It is a pre-process on the encoder side and a post
process on the decoder side. The data rate of the SBR data is a fraction of data of the combined
system. The audio encoder codes the lower band of frequencies upto a certain cutoff frequency.
The higher frequencies above cutoff are recreated from the lower band. This reconstructed band
along with the low band forms the full decoded audio signal. The encoder operates at half the
sampling rate of the SBR thus increasing the frequency resolution of the filter bank. . These are
27
referred to as SBR data. The original and the high band reconstructed audio signal are shown in
the figures 4b and 4c respectively.
Figure 4b: original audio signal [21].
Figure 4c: High band reconstruction through SBR [21].
SBR has enabled high-quality stereo sound at bitrates as low as 48 kbps. Parametric stereo
coding is a technique to efficiently code a stereo audio signal as a monaural signal plus a small
amount of stereo parameters and this technique when combined with SBR and AAC is the HE-
AAC codec version 2 also known as enhanced aacplus codec.
28
In general, a signal composed of a strong harmonic series up to a cutoff frequency has the same
harmonic series in its higher band of frequencies. This property is the principle for SBR. For
signals that do not follow this property, tools like inverse filtering, adaptive noise addition and
sinusoidal regeneration are used to improve the signals. It also exploits the fact that the
psychoacoustic parameters of the high band are relatively less important and uses the
transposition technique to predict energies at the high band with the knowledge of the low band.
A block diagram of the audio codec with SBR is shown in Fig. 4d [4].
Figure 4d: AAC codec with SBR technology [4]
29
5.1 Performance analysis of the audio codecs
5.1 MUSHRA test [22]
This test is done to assess the quality of the audio compression algorithm. Multiple stimuli
with hidden reference and anchor (MUSHRA) defined by international telecommunication
union (ITU) is a methodology employed for subjective evaluation of audio quality. It is used
to evaluate the perceived quality of the output from lossy audio compression algorithms. The
MUSHRA methodology is recommended for assessing "intermediate audio quality". This
method requires fewer participants to obtain statistically significant results owing to the fact
that all codecs are presented at the same time, on the same samples, so that a paired t-test can
be used for statistical analysis. In MUSHRA, the listener is presented with the reference
(labeled as such), a certain number of test samples, a hidden version of the reference and one
or more anchors. The recommendation specifies that one anchor must be a 3.5 kHz low-pass
version of the reference. The purpose of the anchor(s) is to make the scale be closer to an
"absolute scale", making sure that minor artifacts are not rated as having very bad quality.
5.2 AAC codec
An analysis of AAC at constant bandwidth is done for different file formats.
Length of audio sequence = 45 seconds.
Bit rate before encoding = 1536 kbps
Table 5a. Performance of AAC audio codec
Bit rate
Encoding Decoding
Results: after Original Compressed Compression
time time
File format encoding Size (MB) Size (kB) Ratio
(seconds) (seconds)
(kbps)
ADTS 64 8.7 3.09 8.23 353 12:1
ADIF 64 8.7 3.51 8.23 353 12:1
AAC 64. 8.7 3.07 8.23 353 12:1
30
The snap shots of the encoded and decoded audio sequences are shown below.
31
32
33
5.3 HE-AAC codec
An analysis of HE-AAC at constant bandwidth is done and the results are tabulated below.
Length of audio sequence = 45 seconds.
Bit rate before encoding = 1536 kbps
Table 5b. Performance of HE-AAC audio codec
Bit rate
Encoding Decoding
after Original Compressed Compression
time time
encoding Size (MB) Size (kB) Ratio
(seconds) (seconds)
(kbps)
48 3.0 2.0 8.23 272 30:1
32 3.0 2.0 8.23 184 45:1
24 3.0 2.0 8.23 140 59:1
34
A snap shot of the encoded and decoded audio sequences are shown below:
5.4 AC-3 codec
An analysis of HE-AAC at constant bandwidth is done and the results are tabulated below.
Length of audio sequence = 45 seconds.
Bit rate before encoding = 1536 kbps
35
Table 5c. Performance of AC-3 audio codec
Bit rate
Encoding
after Original Compressed Compression
time
encoding Size (MB) Size (kB) Ratio
(seconds)
(kbps)
32 0.53 8.23 175 47:1
48 0.41 8.23 263 31:1
The snapshot of the encoding and decoding process is shown below:
36
5.5 Conclusions:
In this project, a study of three audio codecs (AC-3, AAC and HE-AAC) has been done. A wave
file had been encoded and decoded using these codecs and they were compared based on the
compression ratios and encoding times. It is seen HE-AAC has a better compression ratio than
AAC owing to the SBR technology being used. AC-3 and HE-AAC have similar compression
ratios and they are based on different standards. The results are tabulated in tables 5a, 5b and 5c.
37
References:
[1] K. Brandenburg and M. Bosi, “Overview of MPEG audio: current and future standards
for low-bit-rate audio coding,” JAES, vol.45, pp.4-21, Jan./Feb. 1997.
[2]A/52 B ATSC Digital Audio Compression Standard:
http://www.atsc.org/cms/standards/a_52b.pdf
[3] D.Meares, K. Watanabe and E.Scheirer, “Report on the MPEG-2 AAC stereo verification
tests”, ISO/IEC JTC1/SC29/WG11, Feb.1998.
[4] M. Dietz, L. Liljeryd and K. Kjörling, “Spectral band replication, a novel approach in
audio coding,” in 112th AES Convention, Munich, May 2002.
[5] F. Henn , R. Böhm and S. Meltzer, “ Spectral band replication technology and its
application in broadcasting”, International Broadcasting Convention, 2003.
[6] M. Dietz and S. Meltzer, “ CT-aacplus – a state of the art audio coding scheme”, Coding
Tecnologies, EBU Technical review, July. 2002.
[7]P. Ekstrand, ― Bandwidth extension of audio signals by spectral band replication‖, IEEE
Benelux Workshop on Model based Processing and Coding of Audio (MPCA-2002), Nov.15,
2002.
[8] International Standard ISO/IEC 11172-3:1993, ―Information technology – Coding of
moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s – Part
3: Audio,‖ ISO/IEC, 1993.
[9] ISO/IEC IS 13818-7, “Information technology – Generic coding of moving pictures and
associated audio information Part 7: advanced audio coding (AAC)”, 1997.
[10] M. Bosi and R.E. Goldberg, “Introduction to digital audio coding standards”, Norwell.
MA: Kluwer, 2003.
[11] H.S. Malvar, “Signal processing with lapped transforms”, Artech House: Norwood MA,
1992.
[12] T.Ogunfunmi and M.Narasimha, “Principles of speech coding”, Boca Raton, FL: CRC
Press, 2010.
[13] X. Hu et al ,―An efficient low complexity encoder for MPEG advanced coding‖ ICACT
2006, pp. 1501-1505, Feb. 20-22, 2006.
38
[14] H. Kalva et al. “Implementing multiplexing, streaming and server interaction for MPEG-
4”, IEEE Transactions on circuits and systems for video technology, vol. 9, No.8, pp 1299-
1311,Dec. 1999.
[15] MPEG–2 Advanced audio coding, AAC. International Standard IS 13818–7, ISO/IEC
JTC1/SC29 WG11, 1997.
[16] H. Murugan, “Multiplexing H264 video bit-stream with AAC audio bit-stream,
demultiplexing and achieving lip sync during playback”, M.S.E.E Thesis, University of Texas
at Arlington, TX, May 2007.
[17] Dr. O. Yamada, “Technologies and services on digital broadcasting – Source coding of
audio”, CORONA publishing co., Ltd., 2002
[18] GB/T 20090.1, “Information technology - advanced coding of audio and video – Part 1:
system, chinese AVS standard‖.
[19] H.G. Ranjani and A. Kalagi, “Algorithmic delay and synchronization in MPEG audio
codecs‖, Ittiam Systems Pvt. Ltd., May 2010
[20] “MPEG-4 HE-AAC v2 — audio coding for today's digital media world “, article in the
EBU technical review (01/2006. Link: http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf
[21] M. Modi, “Audio compression gets better and more complex”, link:
http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-and-more-
complex
[22] Recommendation ITI-R: BS.1534: ―Method for the subjective assessment of intermediate
quality levels of coding systems‖. Link: http://www.itu.int/rec/R-REC-BS.1534/en
[23] C.C.Todd, G.A. Davidson, M.F. Davis et. Al,” AC-3: Flexible perceptual coding for
audio transmission and storage”, Dolby laboratories.
http://www.dolby.com/uploadedFiles/English_(US)/Professional/Technical_Library/Technologie
s/Dolby_Digital_(AC-3)/37_ac3-flex.pdf
Reference Web Sites:
[24] Audio coding website www.audiocoding.com
39