Design of MPEG-4 AAC Encoder
Authors: Chi-Min Liu, Wen-Chieh Lee, Chung-Han Yang, KangYan Peng, Ting Chiou, Tzu-Wen Chang, Yu-Hua Hsiao, Hen-Wen Hue and Chu-Ting Chien
Outline
Introduction Psychoacoustic Model M/S Coding Window Switch Temporal Noise Shaping Experiments & Demonstration Conclusion
Introduction–
NCTU-AAC Encoder
Audio in
W-Switch
Psychoacoustic Model
Filterbank
Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
Introduction–
NCTU-AAC Encoder
Audio in
W-Switch
Psychoacoustic Model
Filterbank
Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
1. Introduction–
NCTU-AAC Encoder
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
1. Introduction
Modules
Psychoacoustic Model M/S Coding Window Switch Temporal Noise Shaping Theoretical Frameworks Quality Complexity
Objective
2. Psychoacoustic Model
Approach
MDCT-based instead of FFT-based. New Masking Models Detection of tonal attack band. Detection of tone-rich signal.
2. Psychoacoustic Model (c.1)
MDCT and FFT
Similar spectrum. MDCT spectrum is chaotic due to the aliasing. MDCT leads to the consistent spectrum for analysis and encoding process.
2. Psychoacoustic Model (c.2)
DCT Spectrum
Q-Bands instead of Lines or P-Bands Tone/Noise information based on
Band Flatness instead of Frame Predictivity
N 1 1 GM b 1 N 1 flatness b , GM b xi N , AM b xi AM b N i 0 i 0
For tone-rich signal in band, flatnessb approximates to 0 For noise-rich signal in band, flatnessb approximates to 1
2. Psychoacoustic Model-Adaptive TMN and NMT offset
Utilization Human Perception
Insensitivity in high frequency The masking effect in high frequency is higher than the lower one
Offset 4 3.5 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Offset
2. Psychoacoustic Model–
Tone/Harmonic
Tonal Attack and Tone-Rich Signals
Tonal attack. Tone-rich signals. Masking adjustment Disable window switch
Reconstructed Spectrum
Solution
Original Spectrum
2. Psychoacoustic Model–
Concluding Remarks
New Models
Filterbank instead of FFT. SFM instead of unpredictivity. Detection of tonal attack bands. Detection of tonal-rich signals. Noise masking effect alone.
Results
Speedup by 70% and 65% for AAC and MP3. Quality improves by 0.2 and 0.1 for AAC and MP3.
3. M/S Coding
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
3. M/S Coding
Issues & Approach
Band-Level Switching Decision
Viterbi Algorithm from O(249) to O(49) Conservative masking threshold Allocation Entropy Coupling
M/S Psychoacoustic Model
Bit Allocated to M/S Channels
Joint Design with Window Switch
3. M/S Coding-- Viterbi Algorithm
Find the Optimal Solution
SLR(i) and SMS(i) represent the optimal accumulated cost found in i-th band αLR,LR, αLR,MS, αMS,LR and αMS,MS represent the transition cost
S LR (0)
nLR (0) S LR (1)
S LR (47)
S LR (48)
LR, LR
nLR (1)
nLR (47)
LR, LR
nLR (48)
LR,MS MS , LR
nMS (0)
S MS (0)
LR,MS MS , LR
nMS (1)
n MS (47)
MS ,MS
MS ,MS
n MS (48) S MS (48)
S MS (1)
S MS (47)
Scale factor band
3. M/S Coding– Frame-Level Switching
Compare the AE of MS and LR
C1 is a constant factor
False AE_MS < C1 * AE_LR ? True
Use M/S Frame
Use L/R Frame
3. M/S Coding–
M/S Psychoacoustic Model
Noise of Reconstructed Signal
L'i [k ] M 'i [k ] S 'i [k ] R' i [ k ] M ' i [ k ] S ' i [ k ]
L'i [k ] Li [k ] N Li [k ] M i [k ] S i [k ] N M i [k ] N Si [k ]
R'i [k ] Ri [k ] N Ri [k ] M i [ k ] S i [ k ] N M i [ k ] N Si [ k ]
3. M/S Coding– M/S Psychoacoustic Model
Variance of Noise
2 2 2 N N TL 2 2 2 N N TR
NL i Mi Si i
NR i Mi Si i
2 0.5 Min(TL , TR )
2 0.5 Min(TL , TR )
NS i i i
NM
i
i
i
TX is the masking threshold of X channel σX is the variance of X channel
TM i 0.5 Min(TLi , TRi )
Threshold of M/S Channels
TSi 0.5 Min(TLi , TRi )
3. M/S Coding– Allocation Entropy
Ei Ti Bi 0 if ( Ei Ti Bi ) if ( Ei Ti * Bi )
SMRChanneli
AEChanneli Wi log( SMRChanneli 1)
Ei is the energy of i-th quantization band Bi is effective bandwidth of i-th quantization band Wi is the bandwidth of i-th quantization band
3. M/S Coding–
Available Bits in the M/S Channels Channel Allocation Bits
L/R band ? True False
AEM Bit M B AEM AES
AEM = AEM + L_AE[i] AES = AES + R_AE[i]
AEM = AEM + M_AE[i] AES = AES + S_AE[i]
False
i < 49 ? True
AE S Bit S B AE M AE S
B is allocated bits for current frame
4. Window Switch
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
4. Window Switch
Design Issues
Window Decision Psychoacoustic Model Window Grouping Joint Design with Other AAC Modules
4. Window Switch– Window Decision
Global Energy Ratio Zero-Crossing Ratio Tonal Attack
4. Window Switch–
Psychoacoustic Model
Models based on Long Window
Calculate SMRs for Short Windows From SMRs for Long Windows
band SMRs for short window
band SMRs for long window
4. Window Switch–
Window Grouping
Calculate the Scale Factor
Bit allocation module calculate the scale factor for each band.
Error of Scale Factors
Eg sfb, w sharedsfg ,b bandwidthb
b wg
Criterion
Minimizes the Grouping Number Eg in each group should be smaller than a threshold M
5. Temporal Noise Shaping
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
VLC
5. TNS
Three Artifacts
Error Amplification at Attack periods Time-Aliasing TNS order vs Error.
Detection Mechanism TNS Design
Design Issues ?
5. TNS
Remarks
Pre-aliasing leads to the tradeoff with Pre-echo Post-aliasing may be masked by post-aliasing
5. TNS-- Ease Aliasing Artifacts
Combining with Window Switch
Long Start and Long Stop window
6. Experiments
Psychoacoustic Model M/S Coding Window Switch TNS Overall
6. Experiments-- Test Samples
Track 1 2 3 4 5 6 Time 10 8 7 10 12 11 es01 es02 es03 sc01 sc02 sc03 Signal description vocal (Suzan Vega) German speech English speech Trumpet solo and orchestra Orchestral piece Contemporary pop music Complex sound mixtures Speech signal
7
8 9 10 11 12
7
7 27 11 10 13
si01
si02 si03 sm01 sm02 sm03
Harpsichord
Castanets pitch pipe Bagpipes Glockenspiel Plucked strings Simple sound mixtures Single instruments
6. Experiments–
Psychoacoustic Model
Intel vTune 7.0 Psychoacoustic Models
P1: Psychoacoustic Model II P4: MDCT Psychoacoustic Model
Speed up 72.58% over P1
1 P1 30.24 2 29.66 3 29.75 4 29.96 5 27.75 Average Speedup (%) 29.47 72.58
P4
8.57
8.94
8.00
7.31
7.59
8.08
6. Experiments-
Psychoacoustic Model
Speed up 14.59% over P1
Tracks es01 Length 02:51 P1 26 P4 19 Percentage (%) 26.92
es02
es03 sc01 sc02 sc03 si01 si02 si03 sm01 sm02 sm03 Average
02:17
04:03 02:55 03:23 03:04 04:47 03:05 05:34 04:27 02:01 04:11
19
36 22 28 27 39 30 49 38 18 38 30.8
14
27 18 23 23 36 26 45 35 16 34 26.3
26.32
25.00 18.18 17.86 14.81 7.69 13.33 8.16 7.89 11.11 10.53 14.59
6. Experiments-
Psychoacoustic Model
Category Result
P4 gets better quality than P1 in speech signal, single instrument and simple sound mixtures For complex sound mixtures, only sc02 is worse than P1
es01 0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4 P1 P4 es02 es03 sc01 sc02 sc03 si01 si02 si03 sm01 sm02 sm03
6. Experiments– M/S Coding
Environment
Coding Mode es01 es02 es03 sc01 sc02 sc03
L/R -1.57 -2.03 -2.21 -0.74 -1.11 -0.7
New M/S -0.82 -0.55 -0.84 -0.54 -0.83 -0.52
Disable bit reservoir, window switch and TNS Uses P4
Improve 0.39 of average ODG
si01
si02 si03 sm01 sm02 sm03 Average
-1.16
-3.24 -1.29 -0.9 -1.54 -1.37 -1.4883
-1.05
-3.01 -1.21 -0.93 -1.4 -1.5 -1.1
6. Experiments– Window Switch
Coupling Method
Average ODGs of with and without coupling method are −0.7025 and −0.8483
Bit Rate=128Kbps, Sample Rate=44.1kHz, with Short Window and M/S
01
02
03 sm
sm
sm
si0
si0
si0
es
es
es
sc
sc
sc
0 -0.2 -0.4 -0.6
ODG
-0.8 -1 -1.2 -1.4 -1.6 NCTU_AAC without Coupling Method NCTU_AAC with Coupling Method
A
ve r
01
02
03
01
02
03
1
2
3
ag e
6. Experiments– TNS
Easing Aliasing Method
Improve quality except sm01 Especially for si02
6. Experiments– Overall
Nero 6.3 QuickTime 6.3 NCTU-AAC
es01 es02
-0.6 -0.45
-0.32 -0.11
-0.27 -0.15
Commercial Encoders
es03
sc01 sc02 sc03 si01 si02 si03 sm01 sm02 sm03 Average
-0.51
-0.88 -1.38 -0.84 -1.32 -0.82 -1.59 -1.36 -0.72 -1.29 -0.98
0.02
-0.22 -0.84 -0.64 -0.71 -0.72 -0.78 -0.75 -0.37 -0.73 -0.51417
-0.23
-0.45 -0.66 -0.4 -0.62 -0.54 -0.98 -0.61 -0.53 -0.62 -0.505
Nero 6.3 QuickTime 6.3 NCTU-AAC has better quality in all tracks as compared to Nero 6.3 NCTU-AAC has better quality in 7 tracks as compared to QuickTime 6.3 NCTU-AAC performs better than these two encoders in average
Result
Encoders with Audio Patch Method
Nero 6.3 es01 es02 es03 sc01 sc02 sc03 si01 si02 -0.6 -0.45 -0.51 -0.88 -1.38 -0.84 -1.32 -0.82 Nero6.3 +APM -0.38 -0.44 -0.43 -0.73 -0.70 -0.40 -0.52 -0.63 QuickTime 6.3 -0.32 -0.11 0.02 -0.22 -0.84 -0.64 -0.71 -0.72 QT6.3 +APM -0.26 -0.18 -0.02 -0.21 -0.43 -0.32 -0.47 -0.55 NCTU-AAC -0.27 -0.15 -0.23 -0.45 -0.66 -0.4 -0.62 -0.54 NCTUAAC +APM -0.28 -0.14 -0.24 -0.43 -0.51 -0.37 -0.43 -0.53
si03
sm01 sm02 sm03 Average
-1.59
-1.36 -0.72 -1.29 -0.98
-0.64
-0.83 -0.73 -0.55 -0.5817
-0.78
-0.75 -0.37 -0.73 -0.51417
-0.43
-0.53 -0.38 -0.35 -0.34417
-0.98
-0.61 -0.53 -0.62 -0.505
-0.51
-0.46 -0.54 -0.42 -0.4050
QuickTime 6.3 with APM gets the best quality in average
Conclusion
Quality and Efficiency
Efficient Psychoacoustic Model
DCT-based Approach. Tonal Attack bands and Tone-Rich Signals. Efficient decision method. Psychoacoustic model for M/S channels. Viterbi algorithm.
M/S Coding
Window Switch
Switch Detection. New grouping method. Psychoacosutic Model for Short Window.
Conclusion
TNS
Window Detection New window switch policy Single Loop Approach Two-Step Approach.
Bit Allocation
Bit Reservoir
Filter bank
Fast DCT method
Zero band and High frequency extension.
Audio Patch Method
5. NCTU- AAC CODEC
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing PatchEnable Decoder
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
Effect
VLC
5. NCTU- AAC CODEC (Patents)
Audio in
W-Switch
Psychoacoustic Model
Filterbank Bit-Stream Packing PatchEnable Decoder
Bit Reservoir
TNS
M/S
Bit Allocation
Quantization
Effect
VLC
SC03 Original
SC03 QT 6.3
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
SC03 Nero 6.3
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
SC03 Lame 3.88
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
SC03 NCTU-AAC
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
SC03 NCTU-MP3
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
-0.91
SC03 QT 6.3+APM
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
-0.91
-0.32
SC03 Nero 6.3+APM
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM -0.4
NCTU -AAC +APM
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
-0.91
-0.32
SC03 NCTU-AAC+APM
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM -0.4
NCTU -AAC +APM -0.37
NCTU -MP3 +APM
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
-0.91
-0.32
SC03 NCTU-MP3+APM
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM -0.4
NCTU -AAC +APM -0.37
NCTU -MP3 +APM -0.38
Lame 3.88 +APM
-0.64
-0.84
-1.16
-0.4
-0.91
-0.32
SC03 Lame 3.88 + APM
QT6.3
Nero 6.3
Lame 3.88
NCTU -AAC
NCTU -MP3
QT6.3 +APM
Nero 6.3 +APM -0.4
NCTU -AAC +APM -0.37
NCTU -MP3 +APM -0.38
Lame 3.88 +APM -0.41
-0.64
-0.84
-1.16
-0.4
-0.91
-0.32
Questions