Using SONIC to build a
speech recognizer
Pellom & Hacioglu, ``Sonic: The University of
Colorado Continuous Speech Recognizer,'' Center for
Spoken Language Research Technical Report TR-
CSLR-2001-01, U. Colorado, 2003
Presented by Yang Shao, CIS788K04 Wi04
Performance on standard tasks
On a 1.7GHz Pentium 4
Procedures
Preparation
– identify the goal;
– decide the recognition unit: phoneme, syllable,
word etc;
– preparing the corpus: training, development,
testing;
– label part of training data (opt).
– etc.
Procedures cont.
ˆ
W arg max p(O | W ) P(W )
W
Training
– Acoustic model training;
– Language model training;
Adaptation
– Speaker adaptation (VTLN, MLLR, MAP);
– Environment adaptation (mismatch of training and
testing);
Testing
Acoustic model training
Feature extraction and iterative steps of viterbi state-
based alignment and model estimation;
Outputs a set of decision-tree state-clustered HMMs;
Feature extraction (PMVDR)
Perceptual Minimum Variance Distortionless
Response cepstral coefficients;
– fea [options] speechfile.raw featurefile.fea
Dynamic features;
Language Model I
Finite state grammar in terms of a regular
expression;
Language model II
Language model:
– P(W) = P(w1, w2, …, wm) gives the probability of a
given word sequence;
– expanded as
– N-gram
– Calculated as
Bigram example: P(Mary loves that person) =
P(Mary|)P(loves|Mary)P(that|loves)P(person|that)
Recognition overview
Speech-enabled applications can be built by
calling functions within the Sonic API.
– Sonic_batch –c config.txt [-l]
Configuration file
It is a text file that has a set of parameters
followed by arguments to establish the basic
settings of the recognizer.
– location of the acoustic model files;
– location of the language model file;
– location of the pronunciation lexicon;
– recognizer settings such as search beams, pruning
settings, etc.;
– (opt) a pointer to a control file containing a list of audio
files to process.
Components
Audio file format:
– 16-bit linear PCM format (raw);
– sampling rate is configurable (8k default);
Phoneme configuration file format
– support 55-phoneme symbol set adopted by
CMU Sphinx-II speech recognizer.
Components cont.
LM format
– support up to 4-gram language model
Pronunciation lexicon format
Acoustic model format
– using binary files from trainer function;
– .-, ex. AA.1-l;
Discussion
Unlike HTK, the trainer code estimates
models for one base phone at a time.
Potential problem?