R. C. Rose and S. Parthasarathy, ICSLP02
ASR for Wireless Mobile Devices
Tutorial on ASR for Wireless Mobile Devices
ICSLP 2002, Denver, CO Monday, September 16 R. C. Rose and S. Parthasarathy AT&T Labs – Research Florham Park, NJ (rose,sps)@research.att.com
R. C. Rose and S. Parthasarathy, ICSLP02
ASR for Wireless Mobile Devices
ASR for Wireless Mobile Devices
Part I: Survey of Existing Applications, Architectures, and Supporting Technology Part II: Algorithms for Robust, Efficient ASR Applications
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 1 - Survey of Applications, Architectures, and Technology
• Introduction – Mobile Applications, Devices, and Architectures • Survey – Mobile Applications and Devices – Communications Channels – Supporting Technology - Microphones • ASR Architectures for Mobile Applications – Distributed Functionality – System Requirements
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Applications
Multimodal Dialog Voice Form Filling Command and Control
• MATCH: Multimodel access to city help [Johnston et al,2002] • Fujitsu Tablet PC
• Multimodal Directory Retrieval [Rose and Parthasarathy,2001] • Compaq Ipaq PDA
• Name/Digit Dialing • Motorola M70 Digial Cellular Phone
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Applications
• Emphasis On: • Voice enabled services that are applicable to the widest variety of mobile devices • Robust ASR to facilitate reliable performance in the widest variety of acoustic environments • Not Emphasized: • Applications requiring specialized multimedia hardware • Human-Human ASR Applications like language translation
Xynernaut Poma Multi-media System
[S. Captain, PCWorld.com, March 2002]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Devices
Tablet PC Tablet PC PDA PDA / Mobile Phone Mobile Phone
Fujitsu Stylistic LT C-500
Sony CLIE PEG-N710
Handspring Treo 270
Motorola M70
Inherent Trade-offs In Designing Voice Enabled Applications
Computing Power Affordability Display Size/Resolution Portability/Convenience Input Modalities Battery Life
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Dialog Application Architecture
Application Database Application Manager ASR Audio Client GUI Client GUI Manager
Network
Distribution Of Functionality
Client
• ASR is only one component of multi-modal dialog architecture • Fully Embedded Implementation [Viikki, 2001] • “Value added” technology enriching feature set of device • Enables easy access to device UI functions • No need for network connectivity • Network Only Implementation • Voice enables services residing on a variety of platforms • Access to large application specific databases • Ease of portability across languages and applications
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Automatic Speech Recognition Components
Application Manager
Acoustic Model Processor
Language Model Processor Result Processor
Audio Client
Speech Detector
Feature Extractor
ASR Decoder
• Architectures: Designing and distributing functional blocks over different platforms • Robustness: Efficient feature space and model space transformations for better speaker, device, or environment representations • Efficiency: Reducing decoding time and model storage requirements • Task Constraints: Applied to ASR results (lattices) and LMs to improve ASR performance
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 1 - Survey of Applications, Architectures, and Technology
• Introduction – Mobile Applications, Devices, and Architectures
• Survey – Mobile Applications and Devices – Communications Channels – Supporting Technology - Microphones
• ASR Architectures for Mobile Applications – Distributed Functionality – System Requirements
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Applications • Different mobile applications have different impact on:
– – – – – Memory Requirements Search Speed ASR Accuracy Reconfiguration Requirements Display and Input Modality Requirements
• To illustrate this, look at the requirements for existing tasks:
– Email Dictation – Very large vocabulary ASR (NAB) [Ljolje et al, 1995] – Travel Information Task – Spontaneous database queries for air travel between 50 cities (ATIS) [Bocchieri et al, 1995] – Directory Retrieval Task – Menu driven isolated utterance database queries (PDA-PQ, 3900 name employee directory) [Rose et al, 2001] – Command and Control – Digit Dialing (Connected digit recognizer)
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Applications: Resources Requirements
Application Acoustic Model (densities)
330,900 ~35 MB 25,000 ~3.0 MB 17,000 ~2.0 MB 880 ~0.1 MB
Lexicon (words)
65,000 1,530 2,980 11
Language Model (perplexity)
237 18 2,980 11
Network Size (Mbytes)
220 0.9 0.3 0.01
1. Dictation (NAB) 2. Travel Information (ATIS) 3. Directory Retrieval (PDAPQ) 4. Command and Control (DD)
• Dictation: • High Performing, unrestricted domain dictation tasks are largely restricted to resource rich machines • Efficient prototype embedded implementations of email dictation systems have been reported [Kumagai, IEEE Spectrum, 2002] • Phillips Taiwan: Model Storage – 2 Mbytes, Code Size 200 Kbytes
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Applications: Resources Requirements
Application Acoustic Model
(densities) 1. Dictation (NAB) 2. Travel Information (ATIS) 3. Directory Retrieval (PDAPQ) 4. Command and Control (DD) 330,900 25,000 17,000 880
Lexicon Language Network Model Size
(words) 65,000 1,530 2,980 11 (perplexity) 237 18 2,980 11 (Mbytes) 220 0.9 0.3 0.01
• Directed Dialog: • ATIS and PDAPQ requirements are within limitations of most devices • Comparison of ATIS and PDAPQ Dialog Tasks • ATIS – Continuous speech, limited 50 city domain • PDAPQ – Voice form filling discrete utterances, 3900 names • Similar acoustic model and network storage requirements • Command and Control: • Many very efficient embedded DD implementations [Yadid et al, 2002] • Total engine code size: 50KB, decoding memory size: < 6KB
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Current Generation Mobile Device Capabilities
• Does the current generation of mobile devices facilitate voice enabled services? • Look at Examples of Current Generation Devices
– – – – – Cellular phones Combine PDA / Cell phones: Handspring Treo Personal digital assistants: Compaq Ipaq Tablet PCs: Fujitsu Stylistic Voice Enabled Watch Phone …
Samsung Watch Phone
• Current Generation Device Resource Capabilities
– Computational Resources – ARM + DSP Processors – Memory Resources – From 4 to 256MB of static RAM – Varied Display Capabilities and Input Modalities
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Computational Resources of Current Mobile Devices
Device Cellular Phone Phone/PDA PDA TabletPC Processor 25 MHz ARM7 TI TMS320C45 DSP 33 MHz Dragonball 200 MHz StrongARM 600 MHz Pentium III Memory 4MB 16MB 32MB 256MB Operating System Symbian Palm OS Pocket PC/Linux Windows Battery Life (talk / standby) 5 hours/ 6 days 3 hours / 5 days < 12 hours < 5 hours
• All devices have processor / memory resources to … • Support fully embedded multimodal dialog applications •Other issues may determine whether implementations are practical [Viikki,
2001][Yadid et al, 2002] • Application Programming Interfaces and access to OS
• Competition with other functionality on the device • Power limitations and battery life conservation
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
U.I. Capabilities of Current Mobile Devices
Device Cellular Phone Display 6 line LCD Input keypad Pen/ keypad pen Pen/ handwriting Microphone Handset/ Headset Handset/ Headset Device Device/ Headset Weight 3 oz 5.0 oz 6.3 oz 2.7 lbs Cost $150 $500 $500 $3500
Phone/PDA 2.25” x 3.0” LCD PDA TabletPC 2.25” x 3.0” LCD 8.4” LCD
• Most devices have sufficient GUI resources to … • Serve as a client in distributed or network based applications • Available devices represent a trade-off between display/input modalities and size/cost of device • Cell Phones: Limited U.I., and big motivation to voice enable control functions • Phone/PDA’s: Voice enable services can bring functionality that exists on desktop workstations to low cost portable devices • Tablet PC’s: Powerfull U.I. facilitates new classes of multimodal services
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Typical Next Generation Wireless Devices
• Next generation devices are likely to be based on Open Standards Hardware and Software Reference Platforms
• TI’s Open Multi-media Application Platform (OMAP) • Intel’s Personal Internet Client Architecture (PCA) • Example Software Architecture (PCA): • Specifies operating system, middle-ware, application interfaces, and development environment
• Example Hardware Architecture (OMAP):
• Single Chip with 200 MHz ARM925 Processor and 200 MHz TMS320C55 DSP cores • Flexible user interfaces facilitated by wireless interfaces
(Blue Tooth) to displays and transducers
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Wireless Connectivity
• Wireless Voice Channels - Mobile Radio Standards • Effect on ASR Performance • Wireless Voice Channels - Mobile Speech Coding Standards • Effect on ASR Performance • Wireless Data Connectivity • 2.0, 2.5, and 3.0 Generation Standards
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Wireless Voice Channels – Mobile Radio Standards
• Existing Standards:
• • • • GSM – Europe TDMA (IS-54) – U.S. CDMA (IS-95) – U.S. PDC - Japan
• Fading radio is channel often modeled as combination of
• Background white Gaussian noise • Sequence of Gaussian noise bursts with Poisson spacing • Measuring radio channel effect on ASR performance • Use standardized error patterns to corrupt clean speech • GSM Error Patterns: EP1 (good quality), EP2 (medium qual.), EP3 (low qual.) • Collect speech data transmitted over cellular network
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Mobile Radio Channels – Effect on ASR Performance
• Speech Corrupted by Standard Error Patterns [Pearce, 2000]: • Medium quality GSM voice channel doubles error rate relative to clean • Degradation reduced by sending features over error protected data channel
GSM Channel Error Condition Clean EP1 (good quality) EP2 (medium quality) EP3 (low quality) GSM Enhanced Full Rate Codec (WAC) 97.45 96.0% 93.0% 78.1% Aurora Distributed Speech Recognition Front-End 97.45% 97.44% 97.39% 92.25%
*Word Accuracy (WAC) on Aurora 2 connected digit corpus (Pearce, 2000)
• Speech Data Transmitted Over GSM Channel [Sukar et al, 2002]: • GSM channel is less harmful to WAC than noisy mobile environment
Wireless Standard Landline GSM GSM Environment Home/Office Home/Office Traffic Landline HMM Models (WAC) 96% 91% 70% Multi-Condition HMM Models (WAC) 95% 94% 89%
* Word Accuracy (WAC) on SpeechDat Car connected digit corpus (Sukar et al, 2002)
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Wireless Voice Channels – Vocoder Standards
• Speech coding standards have been evaluated and adopted by …
• International Telecommunication Union (ITU) (standard names G.***) • European Telecommunications Standards Institute (ETSI), • Telecommunications Industry Association (TIA) (standard names IS-***) • Third Generation Partnership Project (3GPP) • Wireless organizations (GSM, PDC,…) • Speech coding standards in wireless networks [H. G. Kang, 2001]: • GSM: Fixed rate - RPE-LTP 13 Kbps • TDMA: Fixed rate - IS641 ACELP 7.4 Kbps • CDMA: Variable rate coders 8.0 – 0.8 Kbps • IS-96 QCELP Variable Rate • IS-127 Enhanced Variable Rate Coder (EVRC)
• Speech coders proposed for 3G wireless networks [H. G. Kang, 2001]:
• Narrowband: • Adaptive Multi-rate (AMR) 4.75 – 12.2 Kpbs • Wideband: • WB-AMR 14.4 – 22.8 Kbps • G.722.1 16 – 24 Kbps
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Wireless Voice Connectivity
• Effect of multi-rate coding standard on ASR performance [H.K. Kim, 2002]: • Mismatched conditions – ASR models are trained on un-coded speech
• Narrow-band AMR coder can increase ASR word error rate by 90% • ASR model compensation techniques can significantly reduce degradation • Effects do not include channel induced degradation
Coding Standard Baseline* (WAC) 96 95 94.1 92.5 With Model Compensation** (WAC) 96 95.4 95.1 94.2
Landline 64 Kbps AMR – High Rate 12.2 Kbps IS641 ACELP 7.4 Kbps AMR – Low Rate 4.75 Kbps
Measured on Aurora 2 connected digit corpus (Pearce, 2000) *H.K. Kim et al, 2002 **H.K. Kim et al, 2001
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Wireless Data Connectivity
• Second Generation Services
– Voice Band Data 9.6 – 14.4 Kbps – U.S. Cellular Digital Packet Radio (CDPD) – European High Speed Circuit Switched Data Services (HSCD)
• 2.5 Generation Services
– Packet Overlays to Voice Networks: 28.8 - 64 Kbps – General Packet Radio Service (GPRS) – Packet Overlay to GSM – PDC-P – Packet Overlay to Japan’s PDC Voice Network
• Third Generation Services
– Enhanced Data Rates for GSM Evolution (EDGE) – 144 Kbps (mobile), 384Kbps (stationary)
• Other Wireless Services
– IEEE 802.11b Local Wireless Area Networks (WiFi) – 11 Mbps operating in unlicensed 2.4 GHz frequency spectrum
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Microphones for Mobile Applications
• Microphone Technologies • Connectivity With Mobile Devices • Exploratory Technologies
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Microphone Technologies
• Two Major Design Issues: Transducer technology (typically electret or
dynamic) and directionality (pressure gradient, noise canceling, steerable arrays)
• Pressure Gradient: Off axis sound source suppressed using the gradient of
pressure arriving at different microphone surfaces
• Electronically Steerable Arrays:
– Far Field Arrays: Table mounted, visor mounted, device mounted – Near Field: Close Talking Microphone Arrays (CTMA) • Pair of elements separated by ~10mm positioned ~30mm from mouth
Source Element 1
r d
Element 2
far field : r >> d near field : r ≈ d
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Microphone Connectivity with Mobile Devices
• Analog Audio Integrated in Mobile Device:
– Connectors, audio specifications, and microphone quality vary among devices – Poor SNR can result from poor electrical isolation from digital electronics
• Universal Serial Bus (USB):
– USB pods allow microphone connectivity to any device with USB port – All amplifier and data conversion electronics contained in pod
• Blue Tooth: – Personal Area Network:
• 720 Kbps over 10 meter distance in 2.4GHz frequency spectrum • Targeted price point of $10 per device • Integrated into fifth generation Motorola Dragonball Processors
– Headsets contain microphone, amplifier, data conversion, and BT interface
• Ericsson HBH-10, Toshiba
– Some models include ASR for command and control functions
• Toshiba
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Non-Acoustic Sensors [Fisher, 1999] • Ear Mounted Microphones
– Both bone conduction and acoustic transducers – Used in very low signal to noise ratio environments or when mouth is obstructed by oxygen mask – Jabra, Temco, Genesys
• Camera – Video of Lips and Tongue
– Visual features extracted from video lips and tongue motion – Input to combined audio/visual ASR system
• Glottal Electro-Magnetic Sensors (GEMS):
– Very low power radar-like sensors [Burnett et al, 1999] – Positioned Near Glottis: Measures motion of rear tracheal wall – Developed at Laurence Livermore and Commercialized by Aliph
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 1 - Survey of Applications, Architectures, and Technology
• Introduction – Mobile Applications, Devices, and Architectures • Survey – Mobile Applications and Devices – Communications Channels – Supporting Technology - Microphones
• ASR Architectures for Mobile Applications – Distributed Functionality – System Requirements
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
ASR Architectures for Mobile Applications Outline • Functional description of a speech recognizer • ASR over existing telecom infrastructure
– Landline voice network – 2G and 3G wireless networks – Aurora project
• ASR over data networks
– Distributed Speech Recognition (DSR): distribution of ASR functionality over the network – Sample scenarios – Form-filling: useful class of applications; discussion of ASR and user interface issues
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Automatic Speech Recognition Components
Application Manager
Acoustic Model Processor
Language Model Processor Result Processor
Audio Client
Speech Detection
Feature Extraction
ASR Decoder
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
ASR Over Voice Network
Cellular / Landline Network Speech Recognizer
speech
•
Features – ASR service is available from any telephone – ASR resides entirely on the server
• Computing resources are not too constrained; leverage Moore’s law • ASR can be tightly coupled with application servers and databases
•
Issues – Speech signal is subject to distortions, degrades ASR performance [Junqua, 1996]
• Channel – especially cellular • Background noise at the source and its effect on speech codecs • Source coding – data rates higher than 8 Kbps does not degrade ASR performance
– User interface is constrained
• Speech only input • Speech only output – serial, transient
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
ASR Over Wireless Networks
1. Bit-stream based feature extraction [Kim, 2000][Raj, 2001]
speech Coded parameters (bit-stream)
Codec
Voice Network
Coded parameters
Speech
Codec
Feature Extractor
ASR Features
Speech Recognizer
•
Features
– Not susceptible to reconstruction losses – Codec parameters can be combined to improve ASR performance – No changes required in the handsets or transmission protocols
•
Issues
– The bit-stream has to be made available to the recognizer
• What if there are multiple carriers / transcoders ? • The coding scheme chosen for telephony is not optimized for ASR
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
ASR Over Wireless Networks
2. Distributed ASR [Pearce, 2000] Codec
Coded parameters
speech
Voice Channel Data Channel
Coded parameters
Speech
Codec
Feature Extractor •
ASR Features
ASR Features
Speech Recognizer
Features – Improved speech recognition performance
• Minimizes impact of channel errors: error protected data channel • Eliminates mismatch due to different speech codecs in use [Besacier, 2001]
•
– Enables integration of speech input and data output on devices with displays – Potential for improved feature extraction from wideband speech or from multiple microphones Issues – Handsets have to incorporate feature extraction – Protocols need to be established to select between modes: speech codec or ASR features – Devices and networks are evolving rapidly; standards have to keep up
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Distributed Speech Recognition - Aurora
• Project [ETSI]
– Working Group within the European Telecommunications Standards Institute (ETSI) – Goals:
• A standard frontend for ASR to ensure compatibility between terminal and remote recognizer • Applications and protocols to implement client-server ASR
•
Advanced Frontend Group
– MFCC frontend standard; includes feature compression, bit-stream formatting and error protection; contributions from a number of research groups – Noise-robust frontend proposal – A frontend for tonal languages
•
Applications and Protocols Group
– Speech and multimodal applications
• Architecture to support speech and multimodal applications; based on SIP and RTP protocols • Codec switching using SIP and SDP; packet voice or ASR features; simultaneous voice and data • Protocols for client and server communication: draft RTP payload specification for DSR in the IETF Audio/Visual Transport working group
– Liaison with other standards activities in IETF, 3GPP, W3C, ITU
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Distributed ASR Over Data Channels
Motivation: Smart programmable devices and high-speed data networks facilitate distribution of ASR functionality. Considerations: • Client resources New application processors that combine features of general purpose processors and DSPs provide (TI OMAP platform, ARM9E)
– – – Significant computing power and programmability for implementing ASR functionality Increased memory Low power consumption for improved battery life Wireless data networks: evolving more slowly than previously projected; 3G and 802.11 networks are being deployed. Quality of service: wide range of data rates and latency; needs for ASR not as stringent as for voice communication Prerequisite for distributing functionality (Aurora front-end). Programmable devices, such as java-enabled phones, allows some customization. Databases: Security and consistency need to be maintained Future: ASR likely to reside on both client and in the network
• • Terminal functions using device-resident ASR Information access from large databases (airline reservations, stock quotes, etc.) on network ASR
•
Network resources
– –
•
Standards
– –
•
Other
– –
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Scenarios - Outline
1. Multimodal thin client
• • • • • • Client resources limited Reliable high speed data network All ASR functions on the server Migrate some ASR functionality to the device ASR resides on client and server Useful in a wide range of services
2. Moderate client resources, limited bandwidth network 3. Fat client 4. Form-filling applications
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Multimodal Thin Client
ASR Decoder Endpointer Feature Extractor
Speech
Stylus
GUI Client
Display
Data Network
Audio Client
Speech
Dialog GUI/Speech Text-To-Speech
Application Server Database
CLIENT • • • •
SERVER
Client is simply an input-output device ASR functions are tightly coupled; changes in functional blocks are transparent to the client Network delays will degrade the user interface unless carefully designed Can be implemented
– – on currently available PDA’s such as the Compaq iPaq within local area 802.11b networks
Language Model
Acoustic Model
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Moderate Client Resources, Limited Bandwidth
ASR Decoder Endpointer
Speech
Feature Extractor Feature Quantizer
Display Information
Stylus
GUI Client
Data Network
Audio Client
Quantized Features
Dialog GUI/Speech
Application Server Database
GUI Manager
CLIENT • • • Migrating the frontend to the client allows
– – –
SERVER
transmission of ASR features at a low bit-rate: 4.8Kbps in the case of Aurora Potential for signal processing: noise-cancellation; multiple microphones; wideband features programmable features: parameterized and controlled by server; downloadable code
A suitable client-server protocol allows Information for display can be transmitted in compressed form over the network and rendered for display by the GUI manager
Language Model
Acoustic Model
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Form-Filling
• •
•
• •
•
Desktop: web browser, typed input Mobile device: voice-enabled browser, voice input, field selection by stylus/pen Duplicate desktop experience on a device with a small screen and no keyboard Thin client model allows deployment on wide range of devices GUI manager on the client allows tight control and synchronization of speech and display events [Pieraccini et al, 2002] Standards based solution allows portability: W3C’s Multimodal Interaction Activity [W3CMMI,2002]
R. C. Rose and S. Parthasarathy, ICSLP02
ASR for Wireless Mobile Devices
ASR for Wireless Mobile Devices
Part II: Algorithms for Robust, Efficient ASR Applications
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• • • • • •
Fundamentals of ASR Robust Modeling for Mobile Domains Robust Acoustic Modeling Algorithms Efficient ASR Implementations Exploiting Task Constraints Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications • Fundamentals of ASR [Rabiner and Juang, 1993]
[Huang, Acero, and Hon, 2001]
– Feature Extraction – Acoustic HMM Models – Language Models – Viterbi Search
• • • • • Robust Modeling for Mobile Domains Robust Acoustic Modeling Algorithms Efficient ASR Implementations Exploiting Task Constraints Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Automatic Speech Recognition Components
Application Manager
Acoustic HMM Models
Language Models
Audio Client
Speech Detection
Feature Extraction
Viterbi Search
Result Processor
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Feature Extraction
• Every 10 ms compute a windowed FFT on 20 ms of speech xt()
xt () X t []
Mel-Scale Filter-bank FQ [] Fl [] Al
100 1000 4000 F (Hz)
Yt []
Cosine Transform
Ot []
FFT
| |2
F1[]
1 • Mel-Scale Filter-bank [Davis et al, 1980]: Yt [l ] = log Al
k = Ll
∑ Fl [k ] X t [k ] l = 1,..,24
Ul
• Mel scale is motivated by perceptual masking phenomena [Flanagan, 1972]
1 Q 1 π • Cepstrum: Inverse cosine transform: Ot [n] = ∑ Yt [l ] cos[n(l − ) ] n = 1,..,12 Q l =1 2 Q
• Inverse cosine transform results in elements of cepstrum Ot being approximately independent [Zahorian, 1979] • Cepstrum distances shown to be equivalent to spectral slope distortion [Hanson et al, 1980] • Augment Ot[n] with local temporal information dOt[n]/dt and d2Ot[n]/dt2
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Hidden Markov Models
• Discrete first order Markov Chain: Generates state sequence S = s1,…,sT characterized by state transition matrix, A, where ai,j = P(st = qi | st-1 = qj) and qi, i =1,…,M are state indices. Observation Densities: Give rise to observation sequence O1,…,OT where each state has density bi(Ot) = p(Ot|st=qi)
– Discrete: Ot is a discrete index and bi(Ot) = {bi,1,…,bi,N} – Continuous: Ot is a continuous random vector modeled by a mixture of Gaussians:
•
bi (Ot ) =
∑ cij fij (Ot ) where fij : N( µij ,σ ij )
j =1
M
•
Typical sub-word phonetic HMM [Rabiner and Juang,1993][K.F. Lee, 1989]:
– Left-to-Right topology:
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Statistical Language Models
• ASR task is to identify optimum word string [Jelinek,1999]: ˆ W = argmax P(W | O) = argmax P (O | W ) P (W )
W W
where W = w1,..., wN and wi ∈ν
•
P(W) must assign probability to every possible word string:
P ( w1 ,..., wN ) = ∏ P ( wi | w1 ,..., wi −1 )
• • Trigram: P(W ) = ∏ P( wi | wi −1, wi − 2 ) i =1 Issues: – Unseen Events: Smoothing of tri-gram probabilities – Complexity Measure: Perplexity (PP) measured on an independent test set
N
N
i =1
1 N LP = lim − ∑ log 2 P( wi | w1,..., wi −1 ) N →∞ N i =1 PP = 2 LP
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Search: The Viterbi Algorithm
Network Expansion
• Expand word network into a lexical sub-word network and into a HMM state network
Language Model Lexical Expansion HMM Expansion G
GO
OW H
HOME
OW M
Trellis Network
• Find optimum word and state sequences using the Viterbi algorithm [Rabiner and Juang,93]: ν i (t ) = max ν j (t − 1)a j ,i b j (Ot )
1≤ j ≤ M
• Propagate paths through trellis network by recursively applying Viterbi algorithm at each node • Beam Search: Prune paths whose global costs are below a threshold from best path cost
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• Fundamentals of ASR
• Robust Modeling for Mobile Domains – Sources of Variability in ASR – Acquiring Models of Acoustic Variability – Configuration Server – Issues for Robust Modeling in Mobile Domains
• • • • Robust Acoustic Modeling Algorithms Efficient ASR Implementations Exploiting Task Constraints Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Sources of Variability in ASR
• ASR systems must cope with many sources of variability: – Inter-Speaker
• Physiological differences among speakers introduces variability in vocal tract characteristics
– Acoustic Environment
• Street noise, car noise, background speech and music
– Communication Channel
• Transducers, speech coders, linear channel effects, and fading
– Intra-Speaker
• Prosody not well modeled in current ASR systems
– Coarticulation
• Pronunciation variability caused by influence of surrounding articulatory events
• Intra-speaker and coarticulation based variability are not addressed here
– Intra-Speaker - Smoothed cepstrum representations in front-end – Coarticulation - Context dependent phonetic units used in recognizer
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Coping With Variability in Mobile Domains • Mobility: Implies wider variety of acoustic environments than wire-line telephone or desk-top application • Personalized Device: Representations of speaker, environment, and transducer variability can be acquired through normal use of the device
– Enables continual, incremental update of transformation and normalization parameters
• Robust Modeling for Mobile Devices
– Present aspects of model adaptation, environment compensation, and feature space normalization that are specific to mobile domains
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Example Implementation – Configuration Server
Application Manager
Supervision: -Word String/ Lattice -Dialog State Transform Parameters Interim Statistics Task Independent HMM Models
Parameter Estimation Configuration
Server
Model Transform
Acoustic Model Processor
Language Model Processor
Audio Client
Feature Extractor
Feature Transform
ASR Decoder
Result Processor
• Configuration Server [Rose et al, 2001]:
• Continuous unsupervised estimation of model/feature space transformations
• Computational requirements:
• Estimation of parametric transformations • Application of transformation during recognition
• User specific storage requirements:
• Parameteric model/feature space transformations and interim statistics for estimating transformation parameters • Dialog state and task completion status obtained from application provide supervision for parameter estimation algorithms
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Issues for Robust Modeling in Mobile Domains
• Incremental Implementation of Robust Algorithms
– Memory requirements for partial statistics – Computation for incremental update to transformation – Computation for applying transformation during recognition
• Side Information: Prior Information …
– Obtained from the network, the mobile device, or from user profiles – Concerning speaker, channel, transducer characteristics
• Measuring Performance of Robust ASR Techniques
– Progress has been limited by lack of speech corpora and difficulty in simulating noisy environments – Most robust ASR techniques have been evaluated on artificially corrupted speech data
• Differences in vocal effort not represented • Noise corruption models overly simplified • ASR test results from simulated noisy conditions do not generally predict performance in real environments [Gajic and Rose, 2000]
– It is difficult to compare techniques evaluated in different domains
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• Fundamentals of ASR • Robust Modeling for Mobile Domains
• Robust Acoustic Modeling Algorithms – Acoustic Adaptation – Acoustic Environment Compensation – Acoustic Feature Space Normalization
• Efficient ASR Implementations • Exploiting Task Constraints • Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Adaptation
• • HMM training paradigms try to achieve speaker/environment independence through sampling large speaker/ environment populations Mismatch between HMM model, λ, and new utterances, X, can be minimized by adapting the model to better explain the new utterances – Supervised: Content of utterance is known – Implies enrollment scenario – Unsupervised: Content not known – Implies continuous adaptation transparent to user
• Adaptation scenarios
• Optimization Criteria
– Bayesian: Prior distributions P(λ) available for model parameters - Objective function: log[P(λ)P(X|λ)] – Maximum Likelihood: Parametric model transformation estimated to maximize adaptation data likelihood logP(X|λ)
• Data Limitation:
– Rate of Bayesian adaptation and the parameterization used for model transformation limited by the size of adaptation data, X
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Maximum Likelihood Based Model Transformations
•
Model Transformation for HMM model, λ, with Gaussians N(µi, Σi) i=1,…,N separated into classes r=1,…,R
•
Parameterize Transformations
• • •
Transformations estimated to optimize objective criterion: P(X | λ, A, b, H) Incremental vs. static estimation of transformation parameters
– Must consider storage for interim statistics, computation for parameter estimation, and computation for applying transformation in recognition
– Mean only (MLLR) [Leggetter et al,1994]: µi’ = Ar µi + br – Constrained mean and covariance (CMA) [Gales,1998]: µi’ = Ar µi + br Σi’ = Ar Σi ArT – Unconstrained mean and covariance [Gales,1998]: µi’ = Ar µi + br Σi’ = Hr Σi HrT
Transformation of Model vs. Transformation of Features
– Updating models can be impractical in many ASR scenarios – Feature transformations can be easier to implement – CMA can be applied directly in the feature domain Xt’ = A Xt + b
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Environment Compensation
• Simple model for corruption of speech, X(f), by background, N(f), and
channel, H(f), in linear spectrum domain: Z(f) = X(f) |H(f)|2 + N(f) • Model based acoustic environment compensation performed by …
1. Combining Speech and Noise Models:
Background Estimation Speech Model:
λs
f(λs,λb)
Combined Model:
ˆ λZ
Background Model: λb
2. Compensating Observations:
Compensated Cepstrum:
Noisy Cepstrum:
f(Z,λb)
z
ˆ x
• Problems for Environment Compensation Implementation • Corruption is non-linear in cepstrum domain z = x + h + (1+en-x-h)
• Requires accurate estimate of background model λb • Requires speech / background detection mechanism
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Environment Compensation
• Problem: Non-linear interaction of speech and background in cepstrum domain
– 1. Perform Combination / Compensation in linear spectrum domain
• Model Space: Parallel Model Combination [Gales and Young, 1996] [Rose et al,1994] • Feature Space: Spectral Subtraction [Van Compernolle, 1989]
– 2. Approximate non-linearity using Jacobian
• Model Space: Jacobian Adaptation [Sagayama et al, 1997] • Feature Space: Vector Taylor Series [Moreno et al, 1995]
– 3. Estimate compensation parameters from stereo data
• Feature Space: Assumes availability of both clean and noise corrupted versions of utterances [Deng et al, 2000]
• Problems: Background estimation and speech detection
– Use of minimum mean squared error log spectral amplitude (MMSE-LSA) based spectrum estimation [H.K. Kim et al, 2002] – Sequential estimation of background parameters using EM algorithm [N.S.
Kim et al, 1998][Afify et al, 2001]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Parallel Model Combination
• • Parallel Model Combination: Speech model means and covariance for state i, µi, Σi, combined with background model [Gales et al, 1996] Assumptions:
– Background does not affect alignment of speech frames with HMM states – Speech and background are additive in the linear spectrum domain – The sum of log normal densities is also log normal
•
Combine Models in the linear spectrum domain:
Cepstrum Log Spect. Linear Combination
λs λb
µi , Σi µb , Σb
µilog , Σilog log log µb , Σb
ˆ µ ilin ˆi Σlin
lin = gµilin + µ b = g 2Σlin + Σlin i b
Log Spect.
Cepstrum
ˆ ˆ µilog , Σilog
ˆ ˆ µi , Σi
ˆ λ
• Issues for implementation in mobile domains: • Computational Complexity – Conversion to linear spectrum domain • Gain Estimation – Continuous tracking of SNR • Background model estimation – Continuous on-line estimate of λb
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Incremental Mode Environment Compensation
•
Use minimum mean squared error – log spectrum amplitude (MMSE-LSA) based speech enhancement algorithm to estimate background [H. K. Kim et al, 2002]
– Frame synchronous update of background model – Speech and background models additive in the cepstrum domain
•
MMSE-LSA Algorithm:
Given noisy speech spectrum Zt[k] = Xt[k] + Nt[k], Xt[k]=At[k]ejϕ(k) Zt[k] =Rt[k]ejθ(k) . ˆ Obtain MMSE estimate of clean speech spectral magnitude, A , to minimize ˆ E{(log At [ k ] − log At [k ]) 2 } Results in gain function, Gt[], that is applied to noisy speech spectrum:
ˆ At [k ] = Gt [qk ,ηk , γ k ]Rt [k ]
Where qk: speech activity prob, ηk: a priori SNR estimate, γk: a posteriori SNR estimate
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Incremental Mode Environment Compensation
• Model Combination using frame synchronous MMSE-LSA gain function:
– Obtain background model parameters directly from Gt[k] = Gt (qk, ηk, γk): • Transform Gt to cepstrum vectors gt • Compute background means and variances µg(t), Σg(t) from sequence g1,…, gt Cepstrum Cepstrum Combination
•
Parallel Model Combination (PMC) in Cepstrum Domain:
Speech Model
λs
µi , Σi
ˆ µ i (t ) = µi − µ g (t )
Cepstrum
Gt[k]
•
g t []
Cepstrum
ˆ µ g (t ), Σ g (t ) Σi (t ) = Σi + Σ g (t )
Expectations
ˆ λ (t )
Compare Complexity to PMC in linear spectrum domain:
Cepstrum Log Spect. Linear Combination
λs λb
µi , Σi µb , Σb
µilog , Σilog log log µb , Σb
ˆ µ ilin ˆi Σlin
lin = gµilin + µ b = g 2Σlin + Σlin i b
Log Spect.
Cepstrum
ˆ ˆ µilog , Σilog
ˆ ˆ µi , Σi
ˆ λ
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Feature Space Normalization
• Normalize feature space for both training and testing utterances to account for statistical mismatch between training and testing conditions Normalization procedures include
– – – – Cepstrum mean normalization – mismatched linear channel effects Variance normalization – mismatched global variance Cumulative density normalization – mismatched global statistics Vocal tract length normalization – mismatched speaker characteristics
•
•
Implementation Issues
– Delay Considerations – All procedures compute normalization parameters using the recognition utterance – Insufficient Data – All procedures rely on statistical averages and require reasonably long utterances to perform well – Speech Detection – The normalization parameters should be estimated only from speech observations
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Acoustic Feature Space Normalization
• Cepstrum Mean Normalization (CMN)
– Remove effects of fixed linear distortions – Compute average of cepstrum vectors for both training and testing utterances – Subtract average from input vectors to produce zero mean cepstrum for utterance
•
Variance Normalization (VN)
– Normalize training and test data to common global variance – Compute variance of cepstrum for training and testing utterances – Scale input vectors by standard deviation to produce unit variance cepstrum for utt.
•
Cumulative Density Function Normalization (CDFN)
– Compensate for environmental mis-match [Dharanipragda et al, 2000] – Normalize the quantiles computed from training and testing CDFs in filter bank domain [Hilger et al, 2001] – Estimate parameters, Λ, of non-linear function to normalize filter-bank energies Yt’[k] = Tk(Yt[k],Λ) where Λ is chosen to minimize: Σi (Tk(Qk,i,Λ) − Qitrain)2
1.0
Training CDF CDF Test CDF
0.0
Qk,1 Qk,2 Qk,3 Qk,4 Filter Bank k Energy and Quantiles Qk,i
• Frequency warping based speaker normalization
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Frequency Warping Based Speaker Normalization
• • Normalize for speaker specific variability by linearly warping frequency axis, f = αf Warping is performed by warping the mel-scale filter-bank [Lee and Rose, 1998] Y α =0.9 Oiα =0.9 Warped Filter-bank α=0.9 i Cosine Transform F (Hz) 1000 4000 100
’
xi (t )
X i []
FFT
| |2
100 1000
Warped Filter-bank α=1.0 Yi
4000
α =1.0
F (Hz)
Cosine Transform Cosine Transform
Oiα =1.0 Oiα =1.1
Y Warped Filter-bank α=1.1 i
100 1000 4000
α =1.1
F (Hz)
•
Optimum warping factor found by performing ensemble search to maximize P(Oα | λ)
Un-warped Utterance Warping
f ′ = α
f ′ = α
1
f
Oα1
Likelihood Est.
Select Warping
P (O α 1 | λ )
X
N
f
Oα
N
P (Oα N | λ )
α = argmaxP(Oα | λ)
α
)
Warped Utterance O α)
•
HMM model is trained from warped utterances to obtain a mode “compact” model
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Noise Robustness Using Non-Acoustic Sensors • Glottal Electro-Magnetic Sensors (GEMS) for ASR
– Determine speech onset and voicing – Reconstructing Vocal Tract Transfer Function for Noisy Speech:
Noisy Speech Model: Glottal Excitation G(f) Noise N(f) Noisy Signal Vocal Tract Y(f) T(f)
Reconstructing Glottal Excitation: Reconstructed Position Signal Glottal Signal Sub-Glottal Sub-Glottal GEMS ˆ G(f) Vibration Pressure Model Sensor
ˆ Reconstructing Vocal Tract Transfer Function: T ( f ) = P GY ( f ) − P GN ( f ) P GG ( f )
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Robust Modeling Implementation Summary
• Three cost components to continuous mode incremental implementation:
1. Storage for statistics and transformation parameters 2. Computation for transform estimation 3. Computation for applying transform during recognition
•
Different procedures have different requirements:
– Adaptation
• CMA - applied in cepstrum domain 1. Probabilistic alignment with data, 2. Iterative matrix inversion, 3. Transform Cepstrum
– Incremental Mode Environment Compensation
• Sequential EM algorithm for background estimation 1. Update background estimate at each frame, 2. Update noise corruption model parameters, 3. Apply noise corruption model to cepstrum
– Feature Space Normalization
• CMN, VN, CDFN 1. Accumulate statistics from speech frames 2. Normalize cepstrum • Frequency Warping 1. Probabilistic alignment with data for each warping function, 2. Update estimate of warping function, 3. Select warped filter-bank in recognition
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Robust Modeling Implementation Summary
CMA Adaptation
Per-Speaker Memory Requirements Computational Complexity: Transform Estimation Computational Complexity: Recognition O(n3) ~6 x 104 O(I n4) ~7 x 106
Environment Compensation
O(Mn2) ~2 x 105 O(n3) ~6 x 104
Feature Normalization
n ~39 O(Tn2) ~3 x 105
Frequency Warping
W ~20 O(TWn) ~1.5 x 105
O(Tn2) ~3 x 105
O(Tn2) ~3 x 105
O(Tn) ~8 x 103
0
n: feature vector dimension (39) T: utterance length (200) I: number of iterations for matrix inversion (3)
W: Number of possible warping functions (20) M: Number of mixtures for speech model (128)
• Memory Requirements: Less than 0.2 Mbytes for all techniques • Transform Estimation Computation: Not real-time critical • Recognition Computation: At most one matrix multiply per frame
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• Fundamentals of ASR • Robust Modeling for Mobile Domains • Robust Acoustic Modeling Algorithms
• Efficient ASR Implementations – Efficiency Considerations – Algorithms
• Exploiting Task Constraints • Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Efficient Implementations: Outline
• Efficiency considerations – Computing, memory, power
– Cost considerations are out of scope of this presentation
•
Algorithms – Endpointing and Feature Extraction: frame-rate reduction, fixed-point issues – Acoustic modeling: model size reduction, fast likelihood computation – Search: network optimization, implementation issues
Caveat: The effectiveness of the various algorithms is a function of: – Processor: fixed or floating point, general purpose processor (GPP) or digital signal processor (DSP), memory/cache availability and speed – Task: vocabulary size, grammar perplexity, model size – Acoustic condition: signal-to-noise ratio, mismatch between training and testing, telephone or desktop
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Efficiency – Computing and Memory
Computing I
II
Memory
LVR-continuous
Dynamic memory
Word accuracy
LVR-isolated Command & control digits
0 •
1 2 Real-time factor
Static memory
LVR - Large Vocabulary Recognition
Accuracy–speed tradeoff
– – – –
Acoustic model size, match to the data; Language models with low perplexity Speaker-independent vs. speaker-dependent Training method: discriminative or maximum-likelihood Pruning during search Static: size of acoustic and language models, vocabulary size Dynamic: isolated vs. continuous speech input, pruning, lattice generation, preserving segmentation and other side information, model mismatch due to noisy speech
•
Memory
– –
R. C. Rose and S. Parthasarathy
Efficiency – Power Consumption
Pentium MMX DSP16210 M68000
ASR for Wireless Mobile Devices
1000
10K 100 Peak MMAC’s
DSP32C DSP1 DSP1600 DSP16 80286 80386
Power (mw/MMAC)
Pentium
1K
DSP1 DSP32C
Pentium Pentium MMX
10
80386
100 10
DSP16A DSP1600 DSP16210
1
M68000
80286
1 1980 1985 1990 1995 2000 1980 1985 1990 1995 2000
Data from [Bickerstaff et al, 1999] Power efficiency is improved by • Migrating computation-intensive algorithms to the digital signal processors (DSP) • Reducing the complexity of algorithms – Intel XScale processor consumes 450 mW at a clock speed of 600 MHz and 40 mW at 150 MHz. A Pentium class processor consumes about 10 W. • Using fixed-point algorithms • Dual core processors: general purpose processors (GPP) and DSPs
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Computing and Memory Requirements
• • What is covered
– – – – Issues related to efficient implementation of ASR algorithms Basic principles of techniques that are useful over a range of computing and memory constraints Processor specific optimizations Techniques specifically tailored for use under very limited resources
What is not covered
ASR algorithms • Speech signal to features
– – – –
–
Speech detection Issues in feature extraction Segment-based features Quantization of features
Other: speech enhancement, beamforming using multi-microphone input, can be computationally demanding
•
Acoustic Modeling
– – – Model size: methods for effective use of parameters, reducing storage requirements Computing requirements: local distance computation Discriminative training
•
Search
–
–
Network optimization: reduced storage and computing
Efficient search techniques
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Speech Signal to Features
• Speech detection improves efficiency
– Usually energy based; computing resources required are minimal (< 5%) – Memory requirements for caching features could be significant – Significant overall efficiency can be realized by reducing the number of feature vectors sent to the decoder – Look-ahead can be used to improve detection performance at the cost of delay
•
Issues in feature extraction
– Most common are based on the short-time Fourier transform [Davis et al, 1980] – Devices often contain fixed-point processors to save power and cost – Dynamic range of speech signal is large; scaling necessary [Gong et al, 2000] • IEEE Floating point
Sign Exponent 8 [30-23] Mantissa Single 1 [31] 23 [22-00]
Bias 127
• Ranges: IEEE Floating-point (2127), 32-bit integer (232), 16-bit integer (216) • Multiply-add operations are fast; divisions and logs require many cycles – Many chip vendors provide efficient FFT implementations [TIdsp,2002] – Filter-bank energies (in the log domain) and cepstrum have reduced dynamic range and can be represented in 16-bit format. – Computing resources: depends on task; usually less than 10% of total time for ASR
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Efficient Feature Representation
• Segment-based features [Glass et al, 1996] [Ning et al, 2002] – Successive frames of speech are highly correlated – Redundancy is reduced by clustering sequence of feature vectors into acoustically homogeneous segments – Fixed-dimensional feature vector per segment; typical ratio of segment rate to original feature rate is 3 to 5 – Computing efficiency is realized: reduced segment rate results in savings in likelihood computation; unit transitions constrained to occur at segment boundaries Quantization of feature vectors – Feature frames may need to be buffered for look-ahead or if the computation falls behind real-time – On-chip memory is relatively small especially on DSPs – Cepstral coefficients can be quantized to 4-8 bits with minimal impact on performance [Pearce, 2000]
•
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Computing and Memory Requirements
• Speech signal to features
•
Acoustic Modeling
– Model size reduction
• Decision tree based context models • Clustering parameter subspaces
– Local likelihood computation
Pi (Ot ) =
• • • •
∑ cij fij (Ot )
j =1
M
where f ij : N( µ ij ,σ ij )
k-d trees VQ-aided Gaussian selection Batching feature vectors Other optimizations
– Discriminative training
• Search
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Model Size: Decision Tree Context Models
• Tying [Young et al, 1994]
– – – – Decision tree based state/distribution clustering Much less than NC context models for N phones and a context length of C Significant memory savings Using context-dependent models often result in a faster recognizer than using context-independent models because of better acoustic match. Task
Names WSJ # Mixture Distributions 3398 15848
pL ε VF
Yes
S
No
S10
Yes
S11
pL ε VB
No
# HMM States 23631 88282
S20
S21
–
These .techniques are useful even in limited resource environments [Kao et al, 2000]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Parameter Reduction Methods
• Clustering parameter subspaces [Bocchieri et al, 2001]
– Continuous density HMM
Ps (O) =
–
m =1
csm N (O; µ sm ,σ 2 ) ∑
M
Subspace clustering (K-streams)
Mk
O1 O2 . . OK
K tied 2 Ps (O ) = ∑ c sm ∏ N (Ok ; µ smk , σ smk ) m =1 k =1
Results on ATIS task: Baseline model has 76154 Gaussians and a WER of 5.2% VQ-size per stream 256 128 64 32 Word Error Rate (%) 5.8 5.2 5.0 5.0 Recognition time relative to baseline 0.42 0.44 0.50 0.67 Parameter reduction (%) 63 70 74 77 Memory savings (%) 35 18 13 7.3 K 4 13 20 39
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Computing and Memory Requirements
• Speech signal to features
•
Acoustic Modeling
– Model size reduction
• Decision tree based context models • Clustering parameter subspaces
– Local likelihood computation
Pi (Ot ) =
• • • •
∑ cij fij (Ot )
j =1
M
where f ij : N( µ ij ,σ ij )
k-d trees VQ-aided Gaussian selection Batching feature vectors Other optimizations
– Discriminative training
• Search
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Local Distance Computation
• Basic idea: evaluate only Gaussians that are relevant [Ortmanns et al, 1997] – k-d trees
• • • • Organize the densities of an HMM state or a collection of states into a binary tree Levels of the tree are split along successive dimensions Densities located on one side of the hyperplane passing through the node are assigned to the left node and the rest to the right node Median of the model mean or prototype vectors are used for the split to keep the trees balanced X < X3 Y < Y7 2-D example
3
Y < Y9
9 3
X < X2
7 6 4
2 1
•
8
5
A factor of 3 reduction in the number of Gaussians evaluated has been observed on a 20000 word NAB task.
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Local Distance Computation
• VQ-aided Gaussian selection [Bocchieri, 1993]
– Acoustic space is divided into a set of vector quantized regions. Model means are clustered using:
1 d ij = D
–
∑
D
k =1
w ( k )( µ i ( k ) − µ j ( k )) 2
This results in K codewords χι. A neighborhood νι of the codeword χι consists of all Gaussians such that.
1 D
– –
∑
D
k =1
( χ i ( k ) − µ ( k )) 2 ≤Θ 2 σ avg ( k )
– –
Recognition: Find χι closest to observation. Compute distance for components in χι Reductions in local likelihood calculations: • a factor of 3 for WSJ, context-dependent models [Knill et al, 1996] • a factor of 9 for ATIS, context-independent models [Bocchieri, 1993] • dependent on the number of components per mixture, pruning threshold, and the nature of the task Considerable memory overhead for saving the codewords and neighborhoods. Computing overhead for codeword selection
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Local Distance Computation
• Batching frames [Goffin, 2002]
– – – Buffer a sequence of N feature vectors and compute likelihoods in batch Better cache performance and pipelining improves the recognition speed, even if some of the computed likelihoods are not used in the search Example • Name recognition, vocabulary of 2000 names • Experiment run on a pentium class desktop • Accuracy remains unchanged at 98% • At batch size of 6, the recognition time reduces by 38% over baseline Batch size 1 2 3 4 5 6
x real-time
(% speedup) 0.112 (0 ) 0.086 (23) 0.078 (30) 0.073 (34) 0.071(36) 0.069 (38)
•
Miscellaneous optimizations
– – – – Assembly coding of the dot product calculation and other processor specific implementations Template-based recognizers using dynamic time warping (DTW) for speaker dependent recognition Vector quantization of continuous density HMM parameters [Vasilache, 2000] Vector quantized HMM – distance computation is replaced by table lookup
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Local Distance Computation
• Issues
– – Speed up techniques most effective for Gaussian mixtures with large number of components Every method for selecting Gaussians to be evaluated has overhead due to
• • • Book-keeping: side-information that is kept to implement the selection scheme and extra computation Cache misses: the selected components may not be in the cache Memory usage: codebooks or K-d trees require storage that could be a significant fraction of the size of the acoustic model
– –
In practice, the effectiveness of the techniques is task dependent Batching of frames appears to be effective in many tasks and rarely degrades performance
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Discriminative Training
• Maximum Mutual Information (MMI) training [Bahl et al, 1986]
– – Maximize the mutual information between the training word sequences and the observation sequences Objective function:
FMMIE ( λ ) =
Alpha-digits task
x real time (pentium III) 0.04 0.06 0.10
∑ log
t =1
T
∑
p λ ( O t | M wt ) P ( wt )
w
p λ (Ot | M w ) P ( w )
Word Accuracy (%) MLE 90 92 93 MMIE 93 95 96
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Computing and Memory Requirements
• • Signal Processing and feature extraction Acoustic Modeling – computing local distances
• Search: Typically about 50% of total time for ASR; more for LVCSR; less for
small vocabulary tasks – Network optimization – Efficient search techniques • Pruning methods
– Look-ahead based strategy – Pruning threshold dependent on the grammar state
• Multi-pass methods
– A fast first pass to produce a short list of candidates or a lattice, followed by second pass rescoring with larger acoustic and language models
• Implementation details
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Network Optimization
• • • •
Equivalence Determinization Minimization Reweighting
Both acceptors associate the same weight to each input string Each state has at most one transition with any given input label Equivalent automaton with a smaller number of states/transitions Impacts the efficiency of pruning in beam search
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Networks for Speech Recognition
• •
Composition T = R o S R maps a to A S maps A to 1 Speech recognition [Riley et al, 1997] [ Mohri et al, 1998]
– – – –
T maps a to 1
Context-dependency C pairs context-independent phones and context independent phones Pronunciation lexicon L pairs sequence of words from a vocabulary to the corresponding pronunciations Grammar G restricts word sequences Recognition network N = C o L o G is a transducer that maps from context-dependent phones to word sequences restricted to the grammar G
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Network Optimization
• Network size reduction
Names: vocabulary size 100 1K 10K 100K 1M # states (% reduction) Before optimization 1,115 10,560 102,982 1,025,441 9,617,065 After optimization 764 (31) 6,258 (40) 50,740 (50) 406,248 (60) 2,530,395 (73) # transitions (%reduction) Before optimization 1,634 15,764 155,080 1,546,542 14,622,594 After optimization 1,049 (35) 8,623 (45) 72,227 (53) 615,713 (60) 4,529,960 (69)
•
Recognition speed
Task 1K names 10K names NAB 40K continuous * x real-time Unoptimized 0.33 2.21 12.5 Optimized 0.25 (24%) 0.84 (62%) 0.7 (94%)
* [Mohri et al,1998]
• •
Recognition speed is a function of a number of factors in addition to network size Generally, smaller the network, faster the recognizer. Effect is greater for continuous speech recognition and large vocabularies.
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Efficient Implementations - Summary
• • Computing and memory efficiency are relevant even for server-based implementations; power is a primary concern only for embedded implementations Cell-phone and chip-set manufacturers such as, Nokia, Motorola, Qualcomm, and TI, often have proprietary algorithms or modifications to published algorithms that take advantage of the intimate knowledge of the processor architecture, to enhance performance The use of computing cycles can be divided between frontend, local distance calculation, and search. Static memory: acoustic model and network Dynamic memory: feature vectors, search Signal processing and robustness algorithms can demand significant computing and memory resources
• • • •
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• • • • Fundamentals of ASR Robust Modeling for Mobile Domains Robust Acoustic Modeling Algorithms Efficient ASR Implementations
• Exploiting Task Constraints
• Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Exploiting Task Constraints
• Form-filling
– Rescoring algorithms for applying inter-field constraints – Experimental study of a name recognition application
•
Free-format input
– Issues with the complexity of language models and natural language understanding – User interface issues
•
Multimodal input
– Speech and gesture input – Integration of the multiple input modes
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Exploiting Task Constraints
Form-filling example [Rose et al, 2001]
1. Recognize first and last names independently
! ! Static grammars; simple to use Poor accuracy because constraints that are applicable across fields are not used Useful when one of the fields has a small number of options: access to names of employees by location Recognize last name, generate all first names for the given last name Real-time generation becomes difficult for large grammars Still limited by the accuracy with which the first field is recognized
2.
Switch between precompiled grammars
•
Armstrong Mike
Last Name
First Name
3.
Generate dynamic grammars
• • •
4. UL UF
Rescore using lattice outputs Last Name
Lattices Concatenate
Constraint grammar First Last Intersect Rescore
First Name
• This could be generalized to more than two fields as long as a constraint grammar that applies across all those fields is available
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Exploiting Task Constraints
Rescoring
• • •
Lowest cost hypothesis for last name is Ramakrishnan Lowest cost hypothesis for first name is Prabakar Rescoring
• • • Concatenate first name lattice and last name lattice Compose concatenated lattice with the constraint grammar Find best path through the result; yields the correct result Prabakar Balakrishnan
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Exploiting Task Constraints
• Experimental results
– – Vocabulary consists of 2985 last names, 1861 first names, and 3700 unique name entries (first, last) Rescoring is as effective as concatenating the utterances and performing recognition using the full name grammar. Prompting for full name may be reasonable here but is not generally applicable to other fields Recognizing the first name and last name independently and concatenating the recognition hypotheses to obtain the full name provides poor results
–
Grammar Last Name First Name Full Name (rescoring) Full Name Concatenated Full Name Concatenated First Choices
Accuracy (%) 74 72 90 90
49
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Free-Format Input
A Calendar Application [Huang et al, 2000]
• • • • • Click-to-talk: Meeting at 11 AM with Peter Fields: Activity – meeting; Time – 11 AM; Name – Peter User interface: Easier to use. User can present information about all the fields of interest in one utterance Language modeling: Complex. Class-based models are often used Understanding: Robust parsing to handle ill-formed utterances. Large amount of real usage data is necessary for the semantic grammar to achieve good coverage.
[Wang, 2001]
• •
Dialog: Application logic, confirmation, and error recovery. User Interface: Free-format input could be used to fill in the fields. Errors can be corrected by the user by click-to-talk on a specific field.
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Multimodal Interfaces
• Combine spoken and graphical interaction [Johnston et al, 2000] [ Johnston et al 2002]
• • Input can be speech, gesture, dynamic combinations of the two System responses combine speech with dynamic graphics Speech: Email this person and that organization Pen: Click on the icon for this person and that organization Speech and gesture are parsed and integrated by a single weighted finite-state device Mutual compensation among input modes: gestural input can dynamically alter the language model for speech recognition
•
Queries: Simultaneous speech and pen
• •
•
Technology: Multimodal finite state transducers
• •
Speech: email Gesture: Meaning: email ( [
this person G (person) person,
and
that organization G (org) org ] )
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Exploiting Task Constraints - Summary
• Form-filling
– Rescoring algorithms for applying inter-field constraints improve recognition accuracy. – Higher recognition accuracy can be translated to more effective applications through good user interface design.
•
Free-format input
– Such input is more difficult to handle because of the wide variety of ways in which users can present even simple information. Interesting research is underway to handle this complexity.
•
Multimodal input
– Speech and gesture inputs can be combined effectively to mutually compensate for errors in either channel.
•
Visual output
– A display, however impoverished, is assumed in all these applications. Visual output is parallel and persistent. This simplifies application design and also avoids problems related to speech playback.
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Part 2 – Algorithms for Robust, Efficient, and Flexible ASR Applications
• • • • • Fundamentals of ASR Robust Modeling for Mobile Domains Robust Acoustic Modeling Algorithms Efficient ASR Implementations Exploiting Task Constraints
• Survey of ASR Services for Mobile Devices
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Survey of ASR Services for Mobile Devices
• AT&T Labs Directory Retrieval Application [Rose et al, 2001]
Mapping menu based desk-top functionality to mobile devices:
Display sets context for dialog User selects voice field, speaks the value of the field, and sees result on the display Network based ASR implementation
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Survey of ASR Services for Mobile Devices
• SpeechWorks walking directions application [Pieraccini et al, 2002]
Scenerio involves users walking in a city and trying to find their way to a given destination location
Users can select a city, select a start or destination address, and control map display Input provided via both speech and stylus Network based ASR implementation
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Survey of ASR Services for Mobile Devices
• Microsoft Multimodal interactive pad (MiPad) [Wang, 2001]
Example of the “Appointment Card” from MiPad Accepts both free format command utterances containing cross field information and constrained input for individual fields Network based ASR implementation
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
Survey of ASR Services for Mobile Devices
• AT&T Labs MATCH Multimodal dialog system [Johnston et al, 2002]
• Finding restaurants – Speech: “show inexpensive italian places in chelsea” – Multimodal: “Are there any cheap Italian places in this neighborhood?” (circle) Tablet PC based ASR implementation
•
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
References
[Afify & Siohan, 1995] Afify, M., & Siohan, O. 1995. Sequential noise estimation with optimal forgetting for robust speech recognition. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, April, 229-232. Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. 1986. Maximum Mutual Information Estimation of hidden Markov model parameters for speech recognition. Proc. ICASSP. Besacier, L., Bergamini, C., Vaufreydaz, D., & Castelli, E. 2001. The effect of speech and audio compression on speech recognition performance. IEEE Multimedia Signal Processing Workshop. Bickerstaff, M., Nicol, C., & Ackland, B. 1999. Cool Chips Tutorial: Low Power DSPs for Wireless Infrastructure. 32nd Annual International Symposium on Microarchitecture. Bocchieri, E. 1993. Vector quantization for efficient computation of continuous density likelihoods. Proc. Eurospeech. Bocchieri, E., & Mak, B. 2001. Subspace Distribution Clustering Hidden Markov Model. IEEE Transactions on Speech and Audio Processing, March, 264-275. Bocchieri, E., Riccardi, G., & Anantharaman, J. 1995. The 1994 AT&T ATIS Chronus Recognizer. Proc. Spoken Language Syst. Tech. Workshop, January, 265-268. Comerford, L., Frank, D., Gopalakrishnan, P., Gopinath, R., & Sedivy, J. 2001. The IBM personal speech assistant. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May. Compernolle, D. Van. 1989. Noise adaptation in a hidden Markov model speech recognition system. Computer, Speech, and Language, 3, 151-167. Davis, S. B., & Mermelstein, P. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. on Acoust. Speech and Signal Proc., ASSP-28(4), 357-366. Dharanipragda, S., & Padmanadhan, M. 2000. A nonlinear unsupervised adaptation technique for speech recognition. Proc. Int. Conf. on Spoken Language Processing, October. [Bahl et al., 1986]
[Besacier et al., 2001]
[Bickerstaff et al., 1999]
[Bocchieri, 1993]
[Bocchieri & Mak, 2001]
[Bocchieri et al., 1995]
[Comerford et al., 2001]
[Compernolle, 1989]
[Davis & Mermelstein, 1980]
[Dharanipragda et al, 2000]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
References
[Huang, 2000] [ETSI, 2002] [Fisher, 2002] X. Huang et al. MiPad: A next generation PDA prototype. Proc. ICSLP2000. STQ Aurora Working Group. http://portal.etsi.org/. Fisher, P. 2002. Alternative speech sensors for military applications. DARPA Multi-modal Speech Recognition Workshop, June. Flanagan, J. L. 1972. Speech Analysis, Synthesis, and Perceptron. 2nd edn. New York: SpringerVerlag. Gajic, B., & Rose, R. C. 2000. Hidden Markov Model Environmental Compensation for Automatic Speech Recognition on Hand-Held Mobile Devices. Proc. Int. Conf. on Spoken Language Processing, October. Gales, M. J. F. 1998. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language, 12, 75-98. Gales, Mark J. F., & Young, Steve J. 1996. Robust continuous speech recognition using parallel model combination. IEEE Trans. on Speech and Audio Proc., 4(5), 352-359. Glass, J., Chang, J., & McCandless, M. 1996. A probabilistic framework for feature-based speech recognition. Proc. ICSLP. Goffin, V. Personal communication. Gong, Y., & Kao, Y. H. 2000. Implementing a high accuracy speaker-independent continuous speech recognizer on a fixed-point DSP. Proc. ICASSP. Hanson, B. A., & Wakita, H. 1987. Spectral slope distance measures with linear prediction analysis for word recognition in noise. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35(7), 968-973. Hilger, F., & Ney, H. 2001. Quantile based histogram equalization for noise robust speech recognition. Proc. European Conf. on Speech Communications, September, 1135-1138. Jelinek, F. 1999. Statistical methods for speech recognition. Cambridge, MA: MIT Press. Johnston, M., & Srinivas, B. 2000. Finite state multimodal parsing and understanding. Proc. COLING.
[Flanagan, 1972] [Gajic & Rose, 2000]
[Gales, 1998]
[Gales & Young, 1996]
[Glass et al., 1996]
[Goffin, 2002] [Gong & Kao, 2000]
[Hanson & Wakita, 1987]
[Hilger & Ney, 2001]
[Jelinek, 1999] [Johnston & Srinivas, 2000]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
References
[Johnston et al., 2002] Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., & Maloor, P. 2002. MATCH: An architecture for multimodal dialog systems. Proceedings of 40th Anniv. Mtg. of Assoc. for Computational Linguistics, June. Junqua, J. C., & Haton, J. P. 1996. Robustness in Automatic Speech Recognition - Fundamentals and Applications. Boston: Kluwer. Kang, H. G. 2001. Activities on Speech Coding Standards. Proc. of Northeastern Regional Conference, March. Kao, Y. H., & Rajasekaran, P. K. 2000. A low cost dynamic vocabulary speech recognition on a GPP-DSP system. Proc. ICASSP. Kim, H. K., & Cox, R. V. 2000. Bit-stream based feature extraction for wireless speech recognition. Proc. Int. Conf. on Acoustics, Speech, and Sig. Processing. Kim, Hong Kook, & Rose, R. C. 2002. Cepstrum-domain model combination based on decomposition of speech and noise for noisy speech recognition. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, April, 209-212. Kim, Nam Soo. 1998a. Nonstationary environment compensation based on sequential estimation. IEEE Signal Processing Letters, 5(3), 57 -59. Kim, Nam Soo. 1998b. Statistical linear approximation for environment compensation. IEEE Signal Processing Letters, 5(1), 57 -59. Knill, K. M., Gales, M. J. F., & Young, S. 1996. Use of gaussian selection in large vocabulary continuous speech recognition using HMMs. Proc. ICSLP. Lee, K. F. 1989. Automatic Speech Recognition. Norwell, Mass.: Kluwer. Lee, L., & Rose, R. C. 1998. A frequency warping approach to speaker normalization. IEEE Trans on Speech and Audio Processing, 6(January). Leggetter, C. J., & Woodland, P. C. 1994 (March). Speaker adaptation of HMMs using linear regression. Tech. rept. CUED/F-INFENG/TR181. Cambridge University Engineering Department. Ljolje, A., Riley, M. D., Hindle, D. M., & Pereira, F. 1995. The AT&T 60,000 Word Speech-To-Text System. Proc. ARPA Spoken Language Systems Technology Workshop, January.
[Junqua & Haton, 1996] [Kang, 2001] [Kao & Rajasekaran, 2000] [Kim & Cox, 2000]
[Kim & Rose, 2002]
[Kim, 1998a] [Kim, 1998b] [Knill et al., 1996] [Lee, 1989] [Lee & Rose, 1998] [Leggetter & Woodland, 1994] [Ljolje et al., 1995]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
References
[Miller, 1947] [Mohri & Riley, 1998] [Moreno et al., 1995] Miller, G. A. 1947. The masking of speech. Psychological Bulletin, 44(March), 105-129. Mohri, M., & Riley, M. 1998. Network optimization for large vocabulary speech recognition. Speech Communication, 25:3. Moreno, P. J., Raj, B., & Stern, R. M. 1995. A vector Taylor series approach for environmentindependent speech recognition. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, April, 733 736. Ng, L. C., Burnett, G. C., Holzrichter, J. F., & Gable, T. J. 2000. Denoising of human speech using combined acoustic and EM sensor signal processing. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, April, 229-232. Ning, B., Garudadri, H., Chang, C., DeJaco, A., Qi, Y., Malayath, N., & Huang, W. 2002. A robust speech recognition system embedded in CDMA cellular phone chipsets. Proc. ICASSP. Ortmanns, S., Ney, H., & Firzlaff, T. 1997. Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition. Proc. Eurospeech. Pearce, D. 2000a. Enabling New Speech Driven Services for Mobile Devices: An Overview of the ETSI Standards Activities for Distributed Speech Recognition Front-ends. AVIOS 2000: The Speech Applications Conference. Pearce, D. 2000b. An overview of the ETSI standards activities for distributed speech recognition frontends. Applied Voice Input/Output Society Conference AVIOS2000, May. Pieraccini, R., Carpenter, B., Woudenberg, E., Caskey, S., & S. Springer, J. Bloom, M. Phillips. 2002. Multi-modal spoken dialog with wireless devices. ISCA Tutorial and Research Workshop. Rabiner, L. R., & Juang, B. H. 1993. Fundamentals of speech recognition. Englewood Cliffs, N. J.: Prentice Hall. Raj, B., Migdal, J., & Singh, R. 2001. Distributed speech recognition with codec parameters. Proc. Automatic Speech Recognition and Understanding. Riley, M., Pereira, F. C. N., & Mohri, M. 1997. Transducer composition for context-dependent network expansion. Proc. Eurospeech.
[Ng et al., 2000]
[Ning et al., 2002]
[Ortmanns et al., 1997]
[Pearce, 2000a]
[Pearce, 2000b]
[Pieraccini et al., 2002]
[Rabiner & Juang, 1993] [Raj et al., 2001]
[Riley et al., 1997]
R. C. Rose and S. Parthasarathy
ASR for Wireless Mobile Devices
References
[Rose et al., 1994] Rose, R. C., Hofstetter, E. M., & Reynolds, D. A. 1994. Integrated models of signal and background with application to speaker identification in noise. IEEE Trans. on Speech and Audio Proc., 2(2).
[Rose et al., 2001]
Rose, R. C., Parthasarathy, S., Gajic, B., Rosenberg, A. E., & Narayanan, S. 2001. On the Implementation of ASR Algorithms for Hand-Held Wireless Mobile Devices. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May. Sagayama, S., Yamaguchi, Y., Takahashi, S., & Takahashi, J. 1997. Jacobian approach to fast acoustic model adaptation. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, April, 396-403. Sukkar, R. A., Chengalvarayan, R., & Jacob, J. J. 2002. Unified speech recognition for landline and wireless environments. Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May, 293-296. Signal Processing Libraries. http://dspvillage.ti.com. Vasilache, M. 2000. Speech recognition using HMMs with quantized parameters. Proc. ICSLP. Multimedia Interaction Activity. http://www.w3c.org/2002/mmi/. Viikki, O. 2001. ASR in Portable Wireless Devices. Proc. IEEE ASRU Workshop, December. Wang, Y.-Y. 2001. Robust language understanding in MiPad. Proc. Eurospeech. Yadid, T., & Livnat, D. private communication. Young, S., Odell, J., & Woodland, P. 1994. Tree-based state tying for high accuracy acoustic modelling. In Proceedings ARPA Workshop on Human Language Technology, 1994. Zahorian, S. 1979 (August). Principle Components Decompostion of Speech. Ph.D. thesis, Univerity of Michigan.
[Sagayama et al., 1997]
[Sukkar et al., 2002]
[TIdsp, 2002] [Vasilache, 2000] [W3CMMI, 2002] [Viikki, 2001] [Wang, 2001] [Yadid & Livnat, 2002] [Young et al., 1994]
[Zahorian, 1979]