Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus
Two major frameworks exit that are widely accepted and used in the Speech Processing domain: CMU Sphinx,
developed by Carnegie Mellon University, and HTK (HMM-ToolKit), developed by Cambridge University. Both
frameworks can be used to develop, train, and test a speech model from existing corpus speech utterance data
by using Hidden Markov modeling techniques. This project will provide a detailed comparison of both
frameworks in two major phases.
The first phase will be a comparison of the functional and performance characteristics of Sphinx and HTK. The
AN4 Corpus training data will be used to train a recognition model. The recognizer for each framework will be
run against the test data. This procedure was already completed using Sphinx for the final homework
assignment. The process will be adapted as closely as possible for the HTK toolset and the detailed steps
developed for each phase will be presented. The goal will be to generate a model that most closely resembles
the one generated by Sphinx. Different performance metrics and characteristics will then be defined and
measured using the results:
Decoder time to completion
Decoder accuracy on the sentence level.
Decoder accuracy on the word level.
Types and quantities of decoding errors encountered during the decoding process.
Notable trends of errors
Framework code footprint size
Memory requirements for recognizer at runtime
The second major phase of the study will be comparing other features of Sphinx and HTK. The following
features will be compared:
Coded data feature format support
Acoustic Modeling algorithm support
Language Modeling algorithm support
Overall ease of training and decoding corpora
Notable features of the Software Baseline of each toolkit
Operating System support
Available documentation and community support
Licensing and usage rights
Future development plans
The conclusion will also provide a reference to a feature comparison matrix, outlining the superior toolkit for
each focus area.
Framework Overviews and History
The Sphinx system consists of the training portion (SphinxTrain) as well as a set of evolving speech decoders
(Sphinx1-4 as well as PocketSphinx, a decoder designed specifically for embedded environments). The Sphinx
Project has been supported by programs such as DARPA, IBM, Sun Microsystems. Some notable applications
that use Sphinx include: Roomline, a conference room reservation system at CMU, and Let’s Go, a spoken
dialog system in use at Pittsburgh’s transit system.
The HTK framework was originally developed in 1989 by the Speech Vision and Robotics Group of Cambridge
University. While dubbed a general-purpose HMM toolkit, the main application area has been speech
recognition. HTK was purchased by Entropic Laboratories in 1993 and then again by Microsoft during its
acquisition of Entropic in 1999. The HTK source code was then licensed back to Cambridge University for
advances in development.
Acoustic Model Training/Testing Procedure for HTK
The AN4 corpus is a training/testing database collected from census activities conducted by CMU in 1991. It
contains alphanumeric utterances as well as a limited set of command words, for a total of 948 training
utterances and 130 test utterances. All recordings at 16-bit linear PCM samples at 16kHz.
In order to make a valid comparison of the performance of the Sphinx and HTK recognizers, the same type of
model will be trained and tested against that was created for Sphinx. The acoustic model will have the following
1. 8 Gaussians per HMM state
2. Context-dependant Tri-phone state models
3. Tied states
HTK provides a living document of its programs, called “HTKBook”. (TODO ref) Section II of this reference is a
step-by-step process on creating a toy training database using a very simple word grammar and recording
utterances manually. This procedure will be modified to process the pre-recorded and transcribed data that
AN4 provides. The following flowchart outlines the major steps in this process (this can be compared to the
flowchart provided in the Sphinx tutorial).
The procedure that can be used as a tutorial for training and testing the AN4 database with HTK 3.0 is provided
Tutorial – HTKTrainingDecoding_tutorial.doc
The following link refers to the environment directory for the entire tutorial. It contains the original NIST sphere
and processed MFCC data, HMM’s at each stage of the model training phase, perl scripts authored to support
translating various AN4 nuances to the HTK-preferred format (i.e. translating transcriptions, phone lists, etc.),
and any HTK configuration files used.
HTK Tutorial Directory – htktut
Training/Decoding Result Comparison
The following results were achieved decoding the test data from the AN4 corpus. These were achieved on a PC
running Windows XP and Cygwin, with a 2.6GHz Pentium 4 processor with 2 GB System RAM.
Metric Sphinx3 HTK
Peak Memory Usage (MB) 8.2 5.9
Time to Completion (sec) 63 93
Sentence Error Rate (%) 59.2 69.0
Word Error Rate (%) 21.3 9.0
Word Substitution Errors 92 92
Word Insertion Errors 71 154
Word Deletion Errors 2 0
It is very interesting to note that each framework made the same amount of Word Substitution errors. It also
seems the source of most errors for HTK was the insertion of erroneous words. This may have been due to the
more exploratory processes by the which the training model was developed, instead of the predefined and
tested tutorial given with Sphinx. However, the HTK decoder did not make any deletions, which gave it a slight
advantage on the overall word error rate. Also, while HVite (HTK decoder program) did use less memory during
decoding, the time difference in running the test set is significant at 30 seconds.
Front-End Coded Data Feature Format Support
The Sphinx toolkit provides the tool wave2feat in order to process Microsoft Wave, NIST Sphere, or raw wave
data files to MFCC with a limited set of configuration parameters. However, the Sphinx trainer and decoder are
compatible with man other data formats, that must be generated outside of the toolkit. Sphinx-4 now contains
front-end capabilities to process both MFCC and PLP Cepstral encoded data.
The HCopy tools provides a wealth of input/output combinations of model data and front-end features. HTK can
natively handle waveform files with many different popular header formats (NIST, TIMIT, no header, Microsoft,
etc). Any type of conversion is supported, including wave feature, feature feature, feature wave. The
main feature data types that are included are Linear Prediction Coefficients (LPC), Mel-Frequency Ceptral
Coefficients (MFCC), and Perceptual Linear Prediction (PLP) coefficients. The first, second, and third
differential coefficients are also supported for inclusion to output feature vectors (The tutorial used 39-element
vectors, 13 for each). HTK utilizes its own feature vector file format to optimize feature data for native
processing, allowing any of the feature types to be stored in one format. HTK also offers the option of saving
compressed feature vector data, utilizing a vector quantization lookup table. The following tables from HTKBook
illustrate the compatible conversions as well as the Supported parameters. Of course, all data feature types are
then directly compatible with other applications in the toolkit.
Acoustic Modeling Support
Both toolkits use the same types of procedures for Acoustic HMM training. HTKBook reveals more details into
the inner-workings of some of these algorithms:
First, the mean and variance for every Gaussian component in the HMM will be calculated from global mean
and variances from the training data. (flat-start scheme). Parameters for the HMM models are then refined
using Baum-Welch Re-estimation. A modified form called Embedded training offers the following
enhancements by performing training of all models in parallel:
1. Allocate and zero accumulators for all parameters of all HMMs.
2. Get the next training utterance.
3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol
1. transcription of the training utterance.
4. Calculate the forward and backward probabilities for the composite HMM. The inclusion
2. of intermediate non-emitting states in the composite model requires some changes to the
3. computation of the forward and backward probabilities but these are only minor. The details
4. are given in chapter 8.
5. Use the forward and backward probabilities to compute the probabilities of state occupation
6. at each time frame and update the accumulators in the usual way.
7. Repeat from 2 until all training utterances have been processed.
8. Use the accumulators to calculate new parameter estimates for all of the HMMs.
For decoding, HTK uses a form of the Viterbi algorithm called the Token Passing Model. This method passes
information to all next states each time period. Each state then increments its log probability. All next-states
are examined and only the highest is retained. This is used because it extends very easily across word
boundaries. Each state will add more information as well as a pointer called a Word Link Record. This can be
evaluated at the end of the utterance to extract boundaries. This also allows saving of more information than
the best-case, allowing for multiple hypothesis to be recorded.
Language Modeling Support
The main language model (LM) used by the Sphinx decoder is a conventional bi-gram or tri-gram back off
language model. However, Sphinx2-4 generally supports N-Gram statistical grammars as well as finite state
grammars. For generation of language models, however, Sphinx relies on other software (CMU Statistical
Language Model toolkit) for training and testing.
HTK provides a separate set of tools (In the HLMTools directory) for training and testing Language Models.
While not performed as part of this study, an entire HLM tutorial, with training/test data, is provided that uses
most of these functions, and would be a good exercise. (Section 15 of HTKBook). The HLMTools provide n-
gram model generation, as well as class-based n-gram models. Tools are also available to easily measure LM
perplexity, using LPlex. Tools also exist to generate count-based models, which have the ability to dynamically
grow as the task vocabulary changes in content and size. Finally, the LMerge tool is useful for combining
multiple existing LM’s together to one.
Both Sphinx and HTK also support simple, Finite-state grammars that are specified using a BNF-style syntax.
These are useful when using a relatively small rule-set or command-driven sentences for the task. (I.e. “Call
<phone-number>”). This was created manually in the first step in the HTK tutorial.
Operating System Support and Installation Procedure
Both system are developed using the popular and well-supported GNU utilities such as Autoconf, GNU
Makefiles, and the GCC compiler. This allows them to support most Unix variants. Both toolkits also have
Windows-specific building and installation instructions/defines that will allow them to be imported as a Visual
Studio Project. README and INSTALL files in the source distributions are clear and easy to follow.
Framework Image Size Footprint
Total: ~26 MB
Total: ~49 MB
Software Baseline Comparison
Sphinx is organized across three components. This leads to a large amount of code, especially if all of the
possible decoder products are considered.
However, Sphinx uses general Unix-style organization of files (header files in /include, source files in /src, library
files output to /lib); however, this leads to a 3 or 4 deep nesting of dependencies in some cases, which add to
the overall complexity. However, on average, most source files only average 1200 LOC or less, however some
C files were found to have in excess of 13000 LOC!
Sphinx does have built-in unit/regression test modules (invoked with make test). This is an excellent resource
when developing the original Sphinx code to verify that pre-existing functionality is still intact.
HTK’s baseline is much simpler than Sphinx. All provided code is ANSI C and the recognizer and training
components are in the same package. The base code is all found in the HTKLib directory. This single folder
contains all source and header files needed to compile the HTKLib standalone library. The HTKTools directory
contains one-to-one source file to executable targets. Most of these files rely on functionality provided by
HTKLib, thus keeping the components very decoupled. For the Language Modeling portion of the toolkit, the
same style is used, having two directories: HLMLib and HLMTools. Generally, the source and header files are
very well maintained, with appropriate formatting conventions and comments. All prototypes and structure
definitions are located in appropriate header files. Many of HTK’s source files are 1400 lines or less, with the
exception of a few of the more complex tools such as HHED (which is over 6000 lines).
Ease of Use, Available Documentation, and Community Support
Sphinx’s main reference materials are available in from the main web page. These include:
1. Fully implemented tutorial processing the AN4 speech corpus. This is an excellent resource to provide
a quick way to verify all major components of Sphinx are running properly.
2. Many different corpora that can be used to build systems. Also included are pre-built models that can
be readily used with the recognizer.
3. Sphinx “Manual”. This is a loose collection of background theory, Frequently Asked Questions, and
decoding topics. Unfortunately, this section does not seem to be actively maintained and is generally
not as extensive as HTK’s documentation.
The greatest asset to Sphinx is the presence of an easy to set up and run tutorial. After this is done, developers
can then take a white-box approach to different steps to see what commands are needed.
Sphinx also has a Wiki-style portal interface and Doxygen/Javadoc documentations for developers.
HTK’s reference “HTKBook” is an extremely thorough reference guide for a speech recognition framework. The
book is split into the following major parts:
1. The first part of the book gives enough background theory to equip relatively unversed individuals with
enough knowledge to understand the mechanics of the toolkit. The HTK toolkit is then briefly explained
in terms of its application content as well as command-line semantics for getting started with HTK
applications. Most of these are then directly applied in an end-to-end tutorial that provides enough
information to create and test a speech model from scratch.
2. Section two of the book provides extensive details about the core architecture of HTK through the
major phases of model training and testing.
3. Section three provides an in-depth look into the language modeling features that HTK provides as a
part of its framework.
4. Section four provides a detailed reference to each application that is provided with the framework.
Overall, despite not providing a well-scripted tutorial out of the box (HTK does have a separately developed
Tutorial, however, it was found to be unhelpful for this particular task), HTK provides an excellent starting
resource through the “HTKBook”, if properly maintained with version releases. This, combined with the loosely
coupled applications with uniform syntax, makes HTK very easy to use from any Unix-like command line shell.
The community support is offered through three different mailing lists. This seems like an outdated method for
providing feedback and is not as useful as a “Wiki”-type interface.
Licensing and Usage Rights
All Sphinx products are fully owned and licensed by Carnegie Mellon University. All software is provided with an
“as is” clause and may be distributed in any form, as long as the copyright information is distributed. This
applies to both source and binary forms.
As described in the history of HTK, this framework has changed developers and ownerships many times during
its lifetime. With these changes have come changing license requirements. In 2000, Microsoft licensed the
HTK software back to Cambridge University in order to develop and continue distributing the software. The
license is more restrictive than Sphinx in that the code may not be redistributed to any third-part software in any
form or fashion.
Future Development Plans
For all of the Sphinx decoder products, only Sphinx 3 and 4 have plans to continue development. Sphinx 4 is
the latest Java version of the recognizer; however, it is still in the beta stage of development. Sphinx 3 has the
most active development of the “classic” architecture and improvements are being continuously made and
released. The embedded version, PocketSphinx will also be an area of focus. The latest version of Sphinx 3
was released in August of 2007.
The plans found on HTK’s website are somewhat out of date. The last version of HTK released was version 3.4
in December of 2006. Cambridge plans on incorporating features used in research systems from various areas,
including more complex matrix modeling and Cluster Adaptive Training. There is no publicly announced plan for
a release of another major version of HTK at this time.
HTK and Sphinx both offer well-defined and time-honored methods and algorithms for training and decoding
speech data. The unique features of each system must be studied and carefully weighed against the objectives
of the particular task at hand. A summary feature matrix of the discussed topics can be found here. This
outlines personal ratings in each category as well as important notes.
Summary Matrix -- comparison_matrix.xls
Main HTK Website -- http://htk.eng.cam.ac.uk/
Sourceforge Sphinx -- http://cmusphinx.sourceforge.net/html/cmusphinx.php
Brief Sphinx/HTK Comparison -- http://lima.lti.cs.cmu.edu/moinmoin/SphinxHTK
HTKBook -- http://htk.eng.cam.ac.uk/prot-docs/htk_book.shtml
ASR System Review -- http://www.cis.hut.fi/Opinnot/T-61.6040/pellom-2004/lecture-09.pdf
Arthur Chan Sphinx Presentation -- http://www.cs.cmu.edu/~archan/sphinxPresentation.html
Sphinx-3 Decoder Wiki -- http://cmusphinx.sourceforge.net/sphinx3/doc/s3_description.html#lm_dumpfile