Docstoc

Testing and Improvement of the Triple Scoring Method for

Document Sample
Testing and Improvement of the Triple Scoring Method for Powered By Docstoc
					      Testing and Improvement of the Triple Scoring
        Method for Applications of Wake-up Word
                      Technology
                  Andrew Stiles, Brandon Schmitt, Frederick Gertz, Tudor Klein & Veton Kepuska



   Abstract— Constant monitoring of an individual’s voice and                   Applications of accurate single word recognition are abundant
near perfect recognition of a specific word while maintaining                   in applications from push-to-talk application replacement, to
consistent rejections of all other words can be realized by                     smart-room technologies. However, due to the inadequacies of
implementation of Wake-Up Word (WUW) Speech Recognition                         current recognition methods many different possible
(SR) technology. The algorithm shown here has the potential to
                                                                                applications of WUW technology have not gained wide
add robustness to even in a speaker independent environment,
and provides much better results for the application of single
                                                                                adoption. With our new TSM these new applications will be
word recognition when compared to current industry or                           adopted since our accuracy improvement will make them not
academic standards such as Microsoft SAPI and HTK                               only usable, but useful and even indispensable.
respectively. By implementing a Triple Scoring Method (TSM)                        In order to get such a great improvement, many different
implemented with Hidden Markov Models (HMM) in the feature                      smaller tasks had to be done to find how to achieve higher
domain the WUW modeling results are found to be far superior                    accuracies. The first step was to work on finding a way to
in single word recognition, providing a 15166.15% increase in                   automate running different tests so that more tests could be
correct recognition with Callhome corpus over HTK and a                         run, and more consistently. Another task that needed to be
1303.78% increase over Microsoft SDK.                                           done was a parameter search for HMM in order to find the
                                                                                optimal number of states, mixtures, and silent states to obtain
  Index Terms— Voice Recognition, Wake-Up word, voice                           the highest accuracy. Once the research was done it needed to
command, HTK
                                                                                be compared to existing technologies, so tests had to be run
                                                                                with those technologies, such as the leading open-source
                           I. INTRODUCTION
                                                                                academic Speech Recognition Systems know as Hidden
                                                                                Markov Model Toolkit (HTK) [Ref]. In order to run those
C      urrently most speech recognition software has the
primary focus of converting human speech into text or
                                                                                tests many different scripts and tools were necessary. Another
                                                                                topic that needed work was improving the voice activity
                                                                                detector, and investigating other possible improvements such
providing the limited ability to give commands to an                            as using a Neural Network based Voice Activity Detector
automated system. Many of the current systems in production                     (VAD) or a Support Vector Machine (SVM) VAD. A grid
lack robustness and are strongly speaker dependent which                        search was run to find better parameters to use with the SVM.
means that they require significant training or adaptation.                     Many of the different corpora had files that had been
Single word recognition is usually only considered in                           mislabeled or were terrible examples of the utterance, so
theoretical applications (reference: HTK guide book). Since                     outlier detection was an important aspect of improving the
most recognition systems are required to model large                            models. In order to make the WUW SR more effective and
vocabulary data sets they are unable to get the single word                     robust, pitch detection was done to help identify whether the
precision that might be required for specific tasks.                            user intended to wake-up the application or not.
                                                                                   All the different pieces of project had their part in
   Manuscript received July 13, 2007. This work was supported in part by the
National Science Foundation IIS-REU-0647018, and IIS-REU-0647120                improving the accuracy of the recognizer. Some of the pieces
   Andrew Stiles is with Virginia Polytechnic Institute and State University,   had more direct applications while others confirmed design
Blacksburg, VA 24060 (e-mail: astiles5@vt.edu)                                  decisions. However, with the improvements made a big
   Brandon Schmitt is with the Florida Institute of Technology (FIT),           difference in closing and overcoming the gap between our
Melbourne, FL 32792 (e-mail: bschmitt@fit.edu)
   Frederick Gertz is with Alfred University, Alfred, NY 14802 (e-mail:         recognizer’s accuracy and what is necessary for a commercial
ftg5@alfred.edu)                                                                application.
   Tudor Klein is with Florida Institute of Technology, Melbourne, FL 32901
(email: tklein@fit.edu)
   Veton Kepuska is with Florida Institute of Technology, Melbourne, FL
32901 (email: vkepuska@fit.edu)
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                2

  II. INTRODUCTION TO HTK TESTING, SVM GRID SEARCH,                     different functions. It can: optionally generate features,
                 AND ASSORTED TOOLS                                     generate a model named by the date and the time that the
There were many different steps involved in working towards             model was generated, and train the model using the in-
improving the performance of the e-WUW Speech                           vocabulary scores from the CSV files set in the configuration
Recognition System. The first step involved was working on              file [1]. Once the model was trained it could then score the in-
automating the model generation and recognition process. The            vocabulary and out-of-vocabulary data passed in, and generate
next step was to implement statistical testing to determine if          a plot which it saves a figure file of in the output directory.
the difference in results were statistically significant. One step      The script would also save a diary/log file that includes a copy
in working on improving the Support Vector Machine of the               of the configuration file it was run with, along with the
system was to perform a grid search to find the optimal                 support vector machine (SVM) model generated after training
gamma and c parameters for the model. Another part of the               using the same naming convention as the model, diary file,
process was parsing the Phonebook corpus into the format                and figure file.
needed for model generation and other tasks. The next large
step in the process was to compare the results of the e-WUW               Example Usage:
Speech Recognition System to the results of the Hidden                  run_ewuw(‘F:\amalthea\run-e_wuw\myconfig.txt’)
Markov Model Toolkit. The comparison with the HTK results
confirmed that the eWUW-SRS is a great improvement over                   Help Information: help run_ewuw
freely available systems. These varying tasks all played a part
in contributing to improve the accuracy of the current eWUW-
SRS as well as laying the groundwork for continued work in                       IV. IMPLEMENTATION OF MCNEMAR’S TEST
the future.                                                             The first McNemar MATLAB function, sig_score, takes two
                                                                        vectors from the table of correct and incorrect acceptance [2].
                                                                        The first row is the first vector, and the second row the second
              III. TESTING OF SCORING METHOD                            vector.
Run_ewuw is a MATLAB script that, using a configuration
file passed as a command-line parameter, performs many

                                                                          A0

                                                                  Correct              Incorrect

                             A1            Correct                N00                  N01

                                  Incorrect                   N10                 N11
                            Table 1. Table of Correct and Incorrect Acceptance for McNemar’s Test [2]

The second function, mcnemar_test, figures out the correct              fraction, the numerator of the second fraction, and the
acceptance and rejection from the fractional accuracies such            denominator of the second fraction.
as those generated by the SVM classifier. The SVM classifier            Example usage:
would generate an accuracy figure such as (3890/4184) for               p = mcnemar_test(Num_of_f1, Denom_of_f1, Num_of_f2,
one test and say, (3900/4184) for a second test. The second             Denom_of_f2)
function takes those two fractions and then calculates the
correct acceptance and rejection numbers from it, passes those                  V. SUPPORT VECTOR MACHINE GRID SEARCH
numbers into the original McNemar function, and then returns            There are two configurable parameters for the SVM model,
the probability returned from the test, which will indicate             gamma and a ‘c’ [3]. The gamma value controls how “trained”
whether the difference was statistically significant or not. The        the model is to the data, so with a higher gamma it would be
methodology used by the second function is the following:               trained to the model more. The ‘c’ parameter controls how
                      ⎛ k ⎞⎛ 1 ⎞
                  k                    k
                                                                        intolerant the model is of error, so with a higher ‘c’ the SVM
         P = 2 ∑ ⎜ ⎟⎜ ⎟
                      ⎜ ⎟                      when n10 > k             generates more intricate models that are more tightly fitted to
              m = n10 ⎝ m ⎠⎝ 2 ⎠
                                                              2
                                                                        the data. The ‘c’ parameter and the gamma parameter are
                                                                        lightly inversely related, so the same data a model with a
                      ⎛ k ⎞⎛ 1 ⎞
                n10                k

              2 ∑ ⎜ ⎟⎜ ⎟
                      ⎜ ⎟                      when n10 < k             lower gamma would be fitted by a model with a higher
                m = 0 ⎝ m ⎠⎝ 2 ⎠
                                                              2         gamma and lower ‘c’ value. To generate the grid to search, a
                                                                        linearly spaced grid was used, with a top value and bottom
The parameters of the mcnemar_test function are the                     value for the gamma and ‘c’ grids and then a specification for
numerator of the first fraction, the denominator of the first           the number of points in the range, say, twenty gammas and
                                                                        twenty ‘c’s. The in-vocabulary and out-of-vocabulary were


                                                                                                                                      2
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                            3

randomly split into two equal halves, one half of the in-          sentence ending. The word net file described to the toolkit
vocabulary and one-half of the out-of-vocabulary were used to      how to set up the model, which had to match up with how the
train the SVM, and the other half of the in-vocabulary and         grammar file was constructed. Therefore, the model dictated
half of the out-of-vocabulary were recognized. Then the grid       that there was a null state, which would be the entry state, then
search was run and it iterated through all four hundred            the wake-up-word, operator, and then another null state that
combinations, comparing the accuracy percentage for each           was the exit state. And the model dictated that it could travel
model to the previous percentage that was saved. If the grid       from the entry state, to the wake-up-word, to the exit state, or
search found a higher percentage it would save it and output       in the reverse order.
the associated gamma and c parameters.
                                                                   Once the standard files were set up the next step was to figure
Example usage:                                                     out how to use the HTK tools in the proper order based on the
  grid_test(oov_CSV_file, inv_CSV_file, result_file)               needs of a wake-up-Word application, which is different from
                                                                   the tutorial since the tutorial is aimed at people doing research
   In order to compare the results of the grid search to the       with phoneme-based models. The proper order was
results using old gamma and c combinations it was necessary        determined to be: to first generate a blank hidden Markov
to have another script based on the first grid search script       model based on the number of states, mixtures, dimensions,
which would take the in-vocabulary and out-of-vocabulary           and coefficients being used, which in this case was twenty-
scores and split them in the same method, then would run the       five states, two mixtures, thirty-nine dimensions, and Mel
SVM tests to generate an accuracy using the gamma and c            Frequency Cepstral Coefficients (MFCCs) with energy,
specified in the file.                                             derivative, and acceleration features, or MFCC_E_D_A as the
                                                                   configuration parameters were with HTK.
  Example usage:
  svmtest()                                                        A. HTK Testing Using Feature Segments and HTK Models
                                                                   with Various Scoring
   From the grid search the best parameters that were found        The process followed with HTK to test HTK’s scoring
for the cleaned CCW17 list were to have a gamma parameter          methods using features generated by VAD, is as follows. First
of 0.02 and a c of 3.6. These parameters yielded a significant     the feature data generated by our VAD was used with a
improvement over the previous 0.08 and 15 respectively for         function that would convert the feature files into the format
the three-score model.                                             needed for the HTK tools as they expected a certain
                                                                   formatting for the header in each file. The following is an
      VI. HIDDEN MARKOV MODEL TOOLKIT TUTORIAL                     example of how to use that function:
   In order to work through the HTK Tutorial it was necessary
to read through the HTK documentation that is provided on          Example Usage:
the HTK website and follow the tutorial provided to learn how      make_htk(‘E:\amalthea\htk-3.3\tutorial\list_file.list’);
to use the HTK software [4]. With a basic understanding of
the software it was necessary to download the source code and      After the feature files had been converted into HTK files the
then compile a library from the common source and then             next step was to use the HTK tool HInit along with a blank
include that library in the twenty or so other projects for each   model generated using a script that initializes a hidden
of the separate HTK tools, such as HCopy, HRest, HInit, and        Markov model. Next, that model was used along with HRest
HVite.                                                             to update the model so that it was more accurate. Finally,
With all the different HTK tools compiled, the next step was       HVite was used to generate scores for the in-vocabulary and
to set up the standard HTK files which were used for all of the    out-of-vocabulary feature files using the CCW17 corpus.
future HTK testing. These files were: the dictionary file, a       From the scores the following plot was generated using
grammar file, and a word net file. The dictionary file             MATLAB. This plot depicts the score distributions of the out-
contained the word list, which for a wake-up-word                  of-vocabulary and in-vocabulary scores along with the curves
application, was just “operator”, as well as a phone which in a    that show the percentage distribution. The graph shows a
wake-up-word application, was just the word operator since         fairly high level of overlap between the in-vocabulary and
no phoneme-based model was used. The grammar file told             out-of-vocabulary scores which is not desirable. The Equal
HTK what to use as the wake-up-word, which was operator,           Error Rate (EER) of 22.9% is much worse than that using the
as well as how to expect it, which was with an optional            e-wuw Speech Recognition System.
sentence start, the wake-up-word and then an optional




                                                                                                                                  3
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                         4

                                                                             CCW17 Recognized Using HTK
                                                Out of Vocabulary Scores         In Vocabulary Scores           OOV Percentage    INV Percentage
                      100



                       90



                       80



                       70



                       60
            Percent




                       50



                       40



                       30


                                                                                        EER: 22.9%
                       20



                       10



                        0
                                   −95           −90                  −85                    −80                  −75            −70               −65
                                                                                              Score




                                    Fig. 1. CCW17 Using VAD Features and HTK Model Generation/Scoring

After using a proprietary method HVite was used to generate                                      plot, without using HRest the distributions of out-of-
INV and OOV scores. A scatter plot of both scores was                                            vocabulary and in-vocabulary utterances show excessive
generated yielding the following plot. As can be seen from the                                   overlap.
                                                           CCW17 Using VAD Features and HTK Score Generation
                                                                           Out−of−Vocabulary          In−Vocabulary


                      −50



                      −60



                      −70



                      −80



                      −90
          Score




                  −100



                  −110



                  −120



                  −130



                  −140

                            −130         −120     −110              −100               −90                −80              −70         −60               −50
                                                                                             Score




                                  Fig. 2. CCW17 Using VAD Features and Two HTK Scores
After the model was plotted, the process was repeated, except distributions of the out-of-vocabulary as well as the in-
this time HRest was used after HInit in order to refine the vocabulary scores. However, there are still a significant
model, and then the same procedures were followed as amount of in-vocabulary scores in the out-of-vocabulary
previously, to generate the following graph. As can be seen distribution.
from the graph, using HRest greatly improved the

                                                                                                                                                               4
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                       5

                                               CCW17 Using VAD Features and Improved Two Score HTK
                                                             Out−of−Vocabulary       In−Vocabulary
                  −60



                  −70




                  −80




                  −90




                  −100
          Score




                  −110




                  −120




                  −130




                  −140




                  −150
                    −150    −140       −130    −120          −110          −100           −90         −80    −70      −60       −50
                                                                           Score




                                   Fig. 3. CCW17 Using VAD Features and Improved Two Score HTK

The next experiment uses yet another (proprietary) method of                     also a fair amount of in-vocabulary utterances that were given
obtaining a second score. Once the method was used, HVite                        scores that placed them inside the out-of-vocabulary
was run again to generate INV and OOV scores, they were                          distribution, which is not desirable.
graphed yielding the following plot. This shows a fairly high
correlation with the in-vocabulary which is good, but there is
                                                      CCW17 Using VAD Features and Two Score HTK
                                                             Out−of−Vocabulary       In−Vocabulary

                  −65




                  −70




                  −75




                  −80
          Score




                  −85




                  −90




                  −95




                  −100



                           −105       −100     −95          −90           −85           −80          −75    −70      −65       −60
                                                                           Score




                                      Fig. 4. CCW17 Using VAD Features and Two HTK Scores




                                                                                                                                             5
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                 6

Once the three sets of scores were generated, the support            Corpora it became necessary to make the function more
vector machine (SVM) was used to train and then plot the             efficient. In order to make this function more efficient it
combinations of the three different models.                          would pre-allocate a thousand element array to store the lines
                                                                     that it would write to the new file, and every thousand lines it
                                                                     would write to the file, reset to the beginning of the array, and
B. Automation of HTK Testing                                         continue. This simple change improved the efficiency of the
In order to automate the HTK testing it was necessary to write       function immensely.
a number of different matlab scripts to help automate the
process of testing. First a script was needed that would               Example Usage: segment_htk(007)
generate a blank hidden Markov model in the form that HTK
uses. The script was written so that it would generate a model       Once a list of new files to write along with their associated
with an adjustable number of states, mixtures, and dimensions.       VAD segmentation was generated, a function that could take
The script would first output the standard header information        that information and write the actual segmented files was
to use based on the inputs, and then would iterate through           needed. The function written would read in the list file that the
making states for each state from 2 to n-1, where n is the           previous function wrote, then read in the associated segments
number of states. For each state it would iterate through            of the whole files and write those segments to their own
creating mixtures based on value that was passed in. Once it         separate files. A similar method was used to optimize this
finished with the states it would output a tag that indicated the    function where the array was preallocated, and the function
transition probability matrix was starting, and then it would        would periodically write the array to disk and then use the
generate and print a forward-model transition matrix, where          same array again.
basically the probability for the entry state was 1, and then the
probability for each state to stay in its state or go to the next    Example Usage:
state were both 0.5, except for the exit state which is 0. With      make_segfiles('E:\amalthea\htk-3.3\tutorial\ccw_inv.list')
this script all that was necessary was to put in the place to save
the model, the number of states, mixtures, and dimensions to            After the segmented files were written, the model
use, and it would output a valid blank HTK HMM model.                initialization and model update procedures were preformed.
                                                                     Once competed those procedures were complete, recognition
   Example Usage:                                                    was preformed on the segmented test files, and then another
   gen_blank_hmm('E:\amalthea\htk-                                   function was used. This function would generate new models
3.3\tutorial\test_model',25,2,39)                                    based on proprietary technology to calculate the second and
                                                                     third score.
   The next step in automation was to write a function that
would generate the script file to use with the HCopy utility
which would generate HTK feature files. This function would             Example Usage:
just take in a list of the source files to generate features for,       flip_trans(‘E:\htk-3.3\operator’,’E:\htk-
and it would add on the place to store the generated MFC file        3.3\second_operator’)
so that the file was in a source destination format.                    all_trans(‘E:\htk-3.3\operator’,’E:\htk-3.3\third_operator’)

   Example Usage:                                                    Once the three different models were finished the HTK
make_mfc('E:\amalthea\htk-                                           scoring tool was used to score the segmented features, and
3.3\callhome.list','E:\amalthea\htk-3.3\callhome\seg_files')         once they were scored, a function that would take the scoring
                                                                     files that the recognition tool output and parse the scores from
   After MFC files were generated a function was needed that         the text in the file was used. Occasionally the recognition tool
could take those files along with the associated VAD                 would output that a node had died, and in that case a zero
segmentation files that VAD (of the e-wuw Speech                     score was used for that node. Once the function had parsed
Recognition System) had generated, and would produce a list          out all the scores from the file it would then output them to a
of the new files to make along with the segments from the old        CSV file named accordingly to the parameter that was passed
files that they would come from. This function would get a list      to the function.
of the files in the directory where the VAD segments were,
then would take the VAD segments whose filenames matched                Example Usage:
a certain expression, such as 007, for “operator” in the                parse_htk_scores('E:\htk-3.3\ccw_phonebook.txt','E:\htk-
CCW17 corpus. Then the function would parse the segments             3.3\phone_ccw.CSV')
from each VAD segment file and then write each segment on
a line along with the new filename associated with it. In the        C. Parsing and Generation of Phonebook Corpus
case where the VAD did not have an ending location, the
                                                                     In order to parse the Phonebook Corpus the first step was to
function would just write the starting location in the file, and
                                                                     read the documentation and find out the format that the
later scripts would assume to read until the end of the file
                                                                     transcription files were stored in. In order to parse these
from that starting location. Later, when dealing with larger
                                                                     transcription files it was necessary to write a function because


                                                                                                                                       6
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                            7

it would have been infeasible to transcribe the approximately      interleaved ULAWs and then return the result. Once a
95,000 separate utterances. The basic operation of the             function handler existed then some code was added to call the
function is that it creates the directory structure with a         new function to one of their if-else statements so that if the
separate directory for each speaker, and then reads in the         “INULAW” format specifier was passed in it would call the
transcription files and then opens a file handle to write the      new function and then handle the result properly.
new transcription file. It writes a standard header to the file,
and then iterates through the wordlist and constructs an array        Once the modifications necessary in order to get HTK to
containing all the separate words in the Corpora. Once it          generate the proper feature files were done, a clean
constructs a wordlist it then iterates through the transcription   standardized list that was used for testing our implementation
file of all the different speakers and matches up the words        was used and generated a file containing a list of the audio
each speaker spoke with the words from the wordlist and then       data files for HTK to use. Once the plain list was generated
copies the old file to the correct new directory based on the      the next step was to use the script which would transform the
speaker. Once the file has been copied the function outputs a      list into a script with a series of source and destination files.
line to the new transcription file that includes the filename,     Essentially the same configuration file was used as the earlier
what the utterance number is, and the number of the speaker        testing, with a few changes such as a different
who spoke it.                                                      SOURCEFORMAT specification, as well as an option to
                                                                   force HTK to add some noise to files that were completely
Example Usage:                                                     silent so that it would get accurate features. Then HCopy was
parse_phonebook()                                                  used to generate features using a command similar to the
                                                                   following:
                                                                      ..\HTK\release\HCopy.exe -T 1 -C config_hcopy -S wuwII-
                                                                   operator.scp
 VII. HTK TESTING USING HTK FEATURE GENERATION AND
                    HTK MODELS                                        Once the feature files had been generated a blank model
                                                                   was constructed using the script with 25 states, 2 mixtures,
   For the formal HTK testing where original HTK was used          and 39 dimensions. Although the comparison testing using our
for the whole process from feature generation to scoring, the      implementation was done with six mixtures, HTK could not
scripts that were written previously were used as well as a few    handle using six mixtures, so it was necessary to use two
new scripts that dealt with segmented HTK feature files which      mixtures instead. Attempts to use six mixtures with HTK
were described earlier in the paper. In order to use the HTK       ended with too many clusters dying. The blank model was set
feature generation a few modifications to the HTK source           to be a MFCC_E_D_A model, or MFCC model with energy,
were necessary in order to get it to handle the files it had to    first derivative, and second derivative features.
generate features for. One modification was to add support for
NIST files so that HTK could read the phonebook files which        After a model was constructed it was trained using the wuwII
were stored as NIST files with a 1024-byte NIST header             + CCW17 clean list, with a command similar to the following:
followed by an 8-bit non-interleaved ULAW file. HTK had
support for NIST files but it would not correctly recognize the      ..\HTK\release\HInit.exe     -T    7   -C   config_hmm      -S
phonebook files. From debugging the source code it was             cc_wuw2.list operator
evident that HTK did support NIST files, but only when the
actual audio data was an interleaved ULAW file, it did not
support non-interleaved ULAW files. In order to fix this,          Once the model was initialized it was trained using the same
some code was added onto the existing NIST function so that        list with a command similar to the following:
if the NIST file contained non-interleaved ULAW data, it
would call the wave function ULAW handler that the HTK               ..\HTK\release\HRest.exe      -T   7   -C   config_hmm      -S
source code already had which could handle the non-                cc_wuw2.list operator
interleaved ULAW correctly.
                                                                      After the model had been trained it was used to recognize
   Another modification to the HTK source code was needed          the same list, to get in-vocabulary scores, and then with the
since there was no support for plain ULAW files that were not      phonebook and Callhome lists to get out-of-vocabulary scores.
contained in other files such as a wave file. In order to add      To recognize the same list the following command was used:
support for ULAW files a few more modifications were                  ..\HTK\release\HVite.exe -T 1 -w wdnet -C config_hmm -S
necessary than were needed for the non-interleaved NIST            cc_wuw2.list dict hmm_list > ccw17wuw2.txt
ULAW support. The first thing done to add ULAW support
was to add a format specifier to an HTK enumeration as well          To recognize the phonebook list the following command
as another structure so that the tool would recognize a            was used:
“INULAW” or plain interleaved ULAW file. Once the format
was added a fairly simple function was needed that followed          ..\HTK\release\HVite.exe -T 1 -w wdnet -C config_hmm -S
the same convention as the rest of their file handlers. All the    phonebook.list dict hmm_list > phone_ccw17wuw2.txt
function had to do was to call the existing handler for


                                                                                                                                  7
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                                 8

  To recognize the Callhome list the following command was                                                     Once the scores were generated by HVite then
used:                                                                                                       parse_htk_scores was used to parse the scores from the text
                                                                                                            files and then save them into a CSV file. After the CSV files
  ..\HTK\release\HVite.exe -T 1 -w wdnet -C config_hmm -S                                                   were constructed, they were read into MATLAB using
callhome.list dict hmm_list > callhome_ccw17wuw2.txt                                                        csvread, and the following plots were generated.


                                                                      Callhome OOV Using CCW17+WUWII Model
                                           Out of Vocabulary Scores         In Vocabulary Scores         OOV Percentage      INV Percentage
                   100



                    90



                    80



                    70



                    60
         Percent




                    50



                    40



                    30



                    20



                    10
                                                                                                               EER: 7.2%


                     0
                         −120       −110               −100               −90                −80             −70            −60               −50     −40
                                                                                         Score




                                Fig. 5. Callhome corpus recognized using a CCW17+WUWII trained model



                                                                  Phonebook OOV Using CCW17+WUWII Model
                                             Out of Vocabulary Scores           In Vocabulary Scores      OOV Percentage          INV Percentage
                   100



                    90



                    80



                    70



                    60
         Percent




                    50



                    40



                    30



                    20
                                                                                                               EER: 18.3%


                    10



                     0
                         −80      −78            −76             −74               −72             −70             −68        −66              −64   −62
                                                                                         Score




                                Fig. 6. Phonebook corpus recognized using a CCW17+WUWII trained model

                                                                                                              The plot of the Callhome corpus shows an equal error rate
                                                                                                            (EER) of 7.2% while the Phonebook corpus shows an EER of
                                                                                                            18.3%. The EERs for both tests are significantly worse than


                                                                                                                                                                       8
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                              9

                                                                      VIII. WRAPPER CLASS FOR EXISTING E-WUW DEMO AND
those achieved by the triple scoring method. This shows how                         GRAPHICAL INTERFACE
the triple scoring method is a significant improvement over
the existing market technology.




                                                Fig. 7. Wake Up Word main dialog


In order to write a graphical demo the first step was to write a       In order to make the graphical interface for the demo a
wrapper for the existing demo configuration file. The wrapper       MFC project was used to start with. Once the base project was
was a class which contained an object of the configuration          constructed the GUI editor was used to construct the dialog
type, and that had member functions so that the graphical           and the preferences option, and to generate code outlines for
interface classes could set the different configuration options     the command handlers. The preference dialog was
like the paths to the HMM and SVM models. Also, a function          implemented as an extension of the existing CDialog class so
pointer was set up so that the client could call a set function,    that it could define some command handlers for the ‘ok’ and
pass in the name of its print function, and then when the           ‘cancel’ buttons as well as to initialize it properly. It was also
wrapper needed to print something it would call the client’s        necessary to add support so that the load model options would
print function. Once all the configuration options were set the     bring up a dialog box so that the user could select different
client would call the run function, which would interface with      models to load using the standard windows file dialog box.
the existing configuration type and finish setting its variables,
then it would run the recognization process.




                                                                                                                                    9
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                        10




                                       Fig. 8: Wake-Up-Word Demo Preferences Dialog

                                                                  was found to improve over Microsoft’s Engine by 1303.78%
   Once the dialog class was written handlers were needed for     for
the different dialog functions, such as handlers to disable the
other configuration options when live input was selected.         Correct Rejection and False Acceptance in tests run on the
More handlers were needed that would interface with the           CallHome corpus. Correct Acceptance was improved by
wrapper class and get the current values when the preference      1760.00% over Microsoft’s results on the same test data.
dialog was opened so it could display what the current settings
were. The print function written would take the strings that      When examining the false rejection errors manually it was
were passed to it and print them to the main edit box control     observed that often the current voice activity detector
on the dialog. The dialog would set the function pointer in the   incorrectly detected the start and end of the utterance, which
wrapper class in its initialization and then the wrapper would    may have influenced the invalid scoring. In an attempt to
call the print function to output.                                improve the accuracy of the recognition system two methods
                                                                  for classifying whether or not the current section of the input
   The final step in writing the demo was to implement            signal was voice data were investigated, neural networks and
threading into the demo so that the main dialog box would         support vector machines. The research into these novel
make a new thread and use that to call a wrapper function that    methods has lead to a new implementation of the voice
then it turn calls the run function of the wrapper class. The     activity detector using a linear support vector machines model.
additional indirection of having a separate function that calls
the wrapper’s function was necessary because C++ does not                 X. MICROSOFT SPEECH RECOGNITION ENGINE
allow function pointers to class member functions. The
threading allows the demo to run the recognizer without           The Microsoft Speech Recognition Engine (MSR) is a
blocking I/O to the graphical interface.                          commercially available speech recognition engine developed
                                                                  and distributed by Microsoft Corporation. Interfacing with the
                                                                  recognition engine is provided through a number of
  IX. INTRODUCTION TO MICROSOFT SPEECH RECOGNITION                development interfaces packaged as Microsoft’s Speech API
      ENGINE TESTS AND VOICE ACTIVITY DETECTION                   (SAPI.) Through SAPI a developer may create programs built
                                                                  upon any SAPI compliant recognition engine. Microsoft’s
In order to quantify the overall performance of eWUW a            engine is distributed freely along with each Microsoft XP and
comparison to currently available recognition systems was         Vista Operating System as well as freely downloadable from
necessary. Microsoft’s Recognition Engine was chosen as the       the speech team’s website http://www.microsoft.com/speech.
commercially available recognizer to compare against because
it is widely available and was easily implemented. eWUW           Microsoft’s Speech API Software Development Kit is
                                                                  currently in version 5.1. For this experiment the current


                                                                                                                              10
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                            11

version of the Speech API was used to interface with the             score. A second callback function was registered to the
Microsoft’s English (U.S.) version 6.1 Speech Recognition            StreamEnd event from the recognizer. The event triggered
Engine.                                                              once the file was fully processed by the recognizer. When the
                                                                     event triggered, a new file was selected out of the list and
To examine Microsoft’s Speech Recognition Engine’s                   loaded into the recognition engine. The process is repeated
capabilities in a Wake-Up-Word environment a wrapper                 until each file in the given list is processed.
program was needed to prepare the wake-up-word grammar,
load the test corpora files and store the results. The control       A. Engine Parameter Selection
application was written in C# and used Microsoft’s Speech
API to interface to the recognition engine.                          Initial tests were performed to determine the optimal
                                                                     recognition parameter settings for Microsoft’s Speech
                                                                     Recognition (SR) engine. Using the CCW17 Corpora list the
                                                                     correct acceptance and correct rejection values were observed
                                                                     and an optimal set chosen.

                                                                        Sensitivity:         1
                                                                          Speed:            ½
                                                                        INV Reco:        Total INV     OOV Reco:      Total OOV
                                                                             0              401            0            3833

                                                                         Correct          Correct        Overall
                                                                       Acceptance        Rejection        Acc.      Expected Acc.
                                                                          0.00%          100.00%         90.53%        90.53%
                                                                        Sensitivity:        ½
                                                                          Speed:            ½
                                                                        INV Reco:        Total INV     OOV Reco:      Total OOV
      Fig. 9. Microsoft Speech Controller Application                      212              401            1            3833

                                                                         Correct          Correct        Overall
The test files were Microsoft Windows .wav formatted files             Acceptance        Rejection        Acc.      Expected Acc.
containing a small period of silence followed by the utterances          52.87%           99.97%         95.51%        90.53%
to be examined followed by another small period of silence.             Sensitivity:        1/4
Each test corpus contains an unique collection of speakers                Speed:            1/2
with recorded segments. The CCW17 and Phonebook corpora                 INV Reco:        Total INV     OOV Reco:      Total OOV
contain individual utterances per file, while the WUWII and
                                                                           307              401            19           3833
Call Home corpora contain periods of continuously recorded
spoken (conversational spontaneous speech) data about 30                 Correct          Correct        Overall
minutes long. A transcription file for each corpus provided a          Acceptance        Rejection        Acc.      Expected Acc.
written record of the utterances in each file.                           76.56%           99.50%         97.33%        90.53%
                                                                        Sensitivity:        1/8
Inputs to the program consisted of either a list of files plus the
                                                                          Speed:            1/2
wake up word to be examined or a combination space
                                                                        INV Reco:        Total INV     OOV Reco:      Total OOV
delimited list file containing both the wake up word to be
                                                                           326              401            48           3833
examined along with the path to the associated test file. The
list files for each test corpus were generated from the
                                                                         Correct          Correct        Overall
transcription files and directory structure. The files were            Acceptance        Rejection        Acc.      Expected Acc.
separated into one of the following three categories: Files              81.30%           98.75%         97.09%        90.53%
which contained only the Wake-Up-Word, files which
                                                                        Sensitivity:         0
contained the Wake-Up-Word in a sentence, and files which
                                                                          Speed:            1/2
did not contain the Wake-Up-Word.
                                                                        INV Reco:        Total INV     OOV Reco:      Total OOV
Each file in a given list was loaded in sequence and passed to             366              401           386           3833
the recognition engine. A callback function was registered to
the Correct Recognition event from the recognition engine.               Correct          Correct        Overall
                                                                       Acceptance        Rejection        Acc.      Expected Acc.
Each time the event fired, a new line was added to the output
                                                                         91.27%           89.93%         90.06%        90.53%
comma separated value file which contained the full file name,
the utterance’s offset from start at beginning, the utterance’s
offset from start at the end, and the recognizer’s confidence


                                                                                                                                  11
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                      12

 Table 2. Microsoft Speech Recognition Engine Parameter
 Tests using CCW17 Corpora with “Operator” as the Wake-          Full tests were performed on the CCW17, WUWII,
                        Up-Word.                                 Phonebook and Call Home corpora with the selected engine
                                                                 parameters. Matlab scripts were written to process the output
The selected parameters were 1/8 the total for sensitivity and   csv files and develop the following summaries for the test
1/2 the total for speed. Microsoft’s speech control panel was    corpora.
used to adjust the parameters of the recognition engine.

                                                                 B. CCW17 Corpus Tests

                                                                 The CCW17 test corpus consists of up to 10 words for 607
                                                                 telephone recordings with a single utterance per file. In this
                                                                 experiment lists were generated for each of the ten words as
                                                                 the Wake-Up-Word. For each potential Wake-Up-Word an in-
                                                                 vocabulary list was created of those files which contained the
                                                                 Wake-Up-Word. A second list of all other files which did not
                                                                 contain the Wake-Up-Word became the out-of-vocabulary
                                                                 list. One of the ten potential Wake-Up-Words was excluded
                                                                 from this experiment due to grammar constraints. The word
                                                                 IOBI (pronounced eye-o-bee) was not included as a WUW
                                                                 test as the word was likely not contained in the Microsoft’s
                                                                 dictionary however the spoken utterances were included in the
                                                                 out-of-vocabulary lists for the remainder of the potential
                                                                 Wake-Up-Words.

                                                                 Each list was passed through the Microsoft engine using the
                                                                 custom SAPI interface program. In the following table the
                                                                 results for each set of test runs have been compiled. For each
                                                                 of the potential Wake-Up-Words the number of recognition
                                                                 responses from the engine have been recorded. Using the
                                                                 recognition count, total files in the list, and in or out of
                                                                 vocabulary context, the correct acceptance, correct rejection,
                                                                 overall accuracy, and expected accuracy could be calculated
           Fig. 1. Microsoft Speech Control Panel                using the following formulas.

            Correct Acceptance               INV Re cognitions
                                           =
                                               TotalINVFi les
            Correct Rejection                   OOV Re cognitions
                                           = 1−
                                                  TotalOOVFi les
            Overall Accuracy                 INV Re cognitions + TotalOOVFi les − OOV Re cognitions
                                           =
                                                        TotalINVFi les + TotalOOVFi les
            Expected Accuracy                         OOV Re cognitions
                                           = 1−
                                                TotalINVFi les + TotalOOVFi les
                          Table 3. Microsoft Speech Recognition Engine Test Results Calculations




                                                                                                                            12
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                         13


                                           Microsoft    Recognition   006     –
Test Run:          6/5/2007    Using:      Engine                     Messages      Sensitivity:   1/8
                                                                                    Speed:         1/2
000 – Call      Sensitivity:   1/8                                                                                OOV
                                                                                    INV Reco:      Total INV      Reco:         Total OOV
                Speed:         1/2
                                           OOV                                               137           243              1           3991
                INV Reco:      Total INV   Reco:        Total OOV
                         515         607        226           3627                  Correct        Correct        Overall       Expected
                                                                                    Acceptance     Rejection      Acc.          Acc.
                Correct        Correct     Overall      Expected                         56.38%      99.97%        97.47%          94.26%
                Acceptance     Rejection   Acc.         Acc.
                     84.84%      93.77%     92.49%          85.66%    007       -
                                                                      Operator      Sensitivity:   1/8
                                                                      Results
001      -                                                            from test     Speed:         1/2
Contacts        Sensitivity:   1/8                                    run      on                                 OOV
                Speed:         1/2                                    5/1/2007      INV Reco:      Total INV      Reco:         Total OOV
                                           OOV                                               326           401            48            3833
                INV Reco:      Total INV   Reco:        Total OOV
                         212         250        363           3984                  Correct        Correct        Overall       Expected
                                                                                    Acceptance     Rejection      Acc.          Acc.
                Correct        Correct     Overall      Expected                         81.30%      98.75%        97.09%          90.53%
                Acceptance     Rejection   Acc.         Acc.
                     84.80%      90.89%     90.53%          94.10%
                                                                      008      –
                                                                      Return        Sensitivity:   1/8
                                                                                    Speed:         1/2
002         –                                                                                                     OOV           Total
Dialer          Sensitivity:   1/8                                                  INV Reco:      Total INV      Reco:         OOV
                Speed:         1/2                                                           489            568        372          3666
                                           OOV
                INV Reco:      Total INV   Reco:        Total OOV                   Correct        Correct        Overall       Expected
                         244         267        905           3967                  Acceptance     Rejection      Acc.          Acc.
                                                                                         86.09%          89.85%    89.35%         86.58%
                Correct        Correct     Overall      Expected
                Acceptance     Rejection   Acc.         Acc.
                                                                      009 – Stop    Sensitivity:   1/8
                     91.39%      77.19%     78.08%          93.69%
003     –                                                                           Speed:         1/2
Disconnect      Sensitivity:   1/8                                                                                OOV           Total
                                                                                    INV Reco:      Total INV      Reco:         OOV
                Speed:         1/2
                                           OOV                                               235            396        389          3838
                INV Reco:      Total INV   Reco:        Total OOV
                         257         559           11         3675                  Correct        Correct        Overall       Expected
                                                                                    Acceptance     Rejection      Acc.          Acc.
                Correct        Correct     Overall      Expected                         59.34%          89.86%    87.01%         90.65%
                Acceptance     Rejection   Acc.         Acc.          Table 4. Microsoft Speech Recognition Engine Test Results –
                     45.97%      99.70%     92.61%          86.80%                          CCW17 Corpus

004 – Drop      Sensitivity:   1/8                                    C. WUWII Corpus Tests
                Speed:         1/2
                                           OOV
                                                                      The WUWII corpus consists of 317 telephone recordings each
                INV Reco:      Total INV   Reco:        Total OOV     with 11 files containing single utterances or recorded
                         482         591        211           3643    sentences. Unlike the CCW17 corpus, the WUWII corpus
                                                                      contains a mixture of single utterances per file and continuous
                Correct        Correct     Overall      Expected      recorded sentences. In this corpus however there are five
                Acceptance     Rejection   Acc.         Acc.          words which are used in a Wake-Up-Word command context;
                     81.56%      94.21%     92.44%          86.04%    Onword, Operator, ThinkEngine, Voyager and Wildfire.
                                                                      For this experiment there was an added level of complexity
                                                                      over the single utterance CCW17 Corpora. For each of the
                                                                      potential Wake-Up-Words, three list files were generated. One
                                                                      list contained all files which had the Wake-Up-Word spoken
                                                                      as a single utterance with a small period of silence before and
                                                                      after the utterance. A second list was generated which


                                                                                                                                               13
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                    14

contains all files which have the Wake-Up-Word spoken and           the generated lists. The remaining four lines contain
also contains other text. A third file was generated containing     summarized and calculated results based on the raw output
all files which did not have the Wake-Up-Word spoken.               data.
The table below contains the compiled results for each of the
potential Wake-Up-Words. The first three lines contain the
raw number of recognitions output by the engine for each of
                                          WUW        WUW in
             Onword                       only       sentence     Not WUW
             Number of Detections:           191           446        1293
             Total Words:                    208         2629        12625
             Total WUW Words:                208           328           0
             Correct Recognition:            191           311           0     Correct Recognition:       93.66%
             Correct Rejection:                  0       2166        11332     Correct Rejection:         90.43%
             False Acceptance:                   0         135        1293     False Acceptance:           9.57%
             False Rejection:                 17            17           0     False Rejection:            6.34%

                                          WUW        WUW in
             Operator                     only       sentence     Not WUW
             Number of Detections:           186           403         696
             Total Words:                    213         2578        12670
             Total WUW Words:                213           369           0
             Correct Recognition:            186           343           0     Correct Recognition:       90.89%
             Correct Rejection:                  0       2149        11974     Correct Rejection:         94.92%
             False Acceptance:                   0          60         696     False Acceptance:           5.08%
             False Rejection:                 27            26           0     False Rejection:            9.11%

                                          WUW        WUW in
             ThinkEngine                  only       sentence     Not WUW
             Number of Detections:           184           365         184
             Total Words:                    202         2770        12489
             Total WUW Words:                202           322           0
             Correct Recognition:            184           302           0     Correct Recognition:       92.75%
             Correct Rejection:                  0       2385        12305     Correct Rejection:         98.35%
             False Acceptance:                   0          63         184     False Acceptance:           1.65%
             False Rejection:                 18            20           0     False Rejection:            7.25%



                                          WUW        WUW in
             Voyager                      only       sentence     Not WUW
             Number of Detections:           173           332         976
             Total Words:                    208         2763        12514
             Total WUW Words:                207           300           0
             Correct Recognition:            173           276           0     Correct Recognition:       88.56%
             Correct Rejection:                  1       2407        11538     Correct Rejection:         93.11%
             False Acceptance:                   0          56         976     False Acceptance:           6.89%
             False Rejection:                 35            23           0     False Rejection:           11.44%

                                          WUW        WUW in
             Wildfire                     only       sentence     Not WUW
             Number of Detections:           191           291         351
             Total Words:                    216         2805        12440
             Total WUW Words:                216           290           0
             Correct Recognition:            190           259           0     Correct Recognition:       88.74%
             Correct Rejection:               -1         2483        12089     Correct Rejection:         97.43%
             False Acceptance:                   1          32         351     False Acceptance:           2.57%
             False Rejection:                 26            31           0     False Rejection:           11.26%
                            Table 5. Microsoft Speech Recognition Engine Test Results – WUWII Corpus


                                                                                                                          14
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                      15




D. Phonebook Corpus Tests                                               Test Run: 7/12/2007
The phonebook corpus consists of 93,267 files, each                     Set: DevTest
containing a single utterance surrounded by two small periods
of silence. The files are utterances recorded by 1,358 unique           Sensitivity:                            1/8
individuals.                                                            Speed:                                  1/2
For this experiment the chosen Wake-Up-Word was selected                Total Out-of-Vocabulary Words:           42089
as “Operator.” The word Operator did not appear in the data
                                                                        Total False Acceptance:                     333
set making each of the 93,267 utterances out-of-vocabulary
                                                                        Total Correct Rejection:                 41756
labeled. The same engine parameters were used from the
CCW17 and WUWII tests.
                                                                        Correct Acceptance:                     0.00%
        Test Run: 7/12/2007                                             Correct Rejection:                     99.21%
                                                                        False Acceptance:                       0.79%
        Sensitivity:                         1/8                        False Rejection:                        0.00%
        Speed:                               1/2
        Total Out-of-Vocabulary                                         Set: Train
        Words:                             93267
        Total False Acceptance:            11603                        Sensitivity:                           1/8
        Total Correct Rejection:           81664                        Speed:                                 1/2
                                                                        Total Out-of-Vocabulary Words:         165546
         Correct Acceptance:             0.00%                          Total False Acceptance:                    934
         Correct Rejection:             87.56%                          Total Correct Rejection:               164612
         False Acceptance:              12.44%
         False Rejection:                0.00%                          Correct Acceptance:                   0.00%
Table 6. Microsoft Speech Recognition Engine Test Results –             Correct Rejection:                   99.44%
                     Phonebook Corpus                                   False Acceptance:                     0.56%
                                                                        False Rejection:                      0.00%
                                                                 Table 7. Microsoft Speech Recognition Engine Test Results –
E. CallHome Corpus Tests                                                              CallHome Corpus
The callhome corpus consists of two data sets, DevTest and
Train. Each dataset contains files of recorded continuous
speech up to one-half of an hour long. The conversations were
                                                                 F. Voice Activity Detector Tests (VAD)
recoded by volunteers over international phone lines.
The DevTest data set contains 42089 utterances, each             The voice activity detector (VAD) is responsible for
belonging to one of 3685 unique words. The Train data set is     separating out the voiced portions of the speech signal in the
larger, containing 165546 utterances, each belonging to one of   time domain from the non-voiced components. The separated
8577 unique words.                                               spoken data is then passed on to the scoring algorithms for
For this experiment the chosen Wake-Up-Word was selected         processing. While recognition algorithms exist which do not
as “Operator.” The word Operator did not appear in the data      require preprocessing to remove non-voice data, the voice
set making each of the utterances out-of-vocabulary labeled.     activity detector reduces the computational load and may
The same engine parameters were used from the CCW17,             increase accuracy by scoring only the data which contains
WUWII, and Phonebook tests.                                      human speech.
                                                                 The following figure is a representation of a speech signal
                                                                 manually annotated to represent the ideal points where the
                                                                 automatic voice activity detector should activate.




                                                                                                                            15
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                            16


                               Manual Vad Segmentation Marking − Four "Operator" Utterances
                                                                                                     For each nx1 input vector a label of “1” for containing
                                              Waveform        Voice Activity Labels                  voice/speech data or “-1" for containing non-voice/non-speech
                 1
                                                                                                     data was assigned to the window. The label assignment of
                0.8
                                                                                                     voice/speech was selected if more than one-half of the input
                0.6
                                                                                                     vectors constituting the classifier input were contained in a
                0.4
                                                                                                     region specified as voice/speech by the VAD training labels.
                0.2
                                                                                                     For the data sets used in training and testing of the neural
   Amplitude




                 0


               −0.2
                                                                                                     network and support vector machines models the VAD
               −0.4
                                                                                                     training labels were generated using Microsoft’s speech
               −0.6
                                                                                                     recognition engine. Microsoft’s engine’s voice activity engage
               −0.8
                                                                                                     and disengage points are generated by forced alignment,
                −1
                                                                                                     which utilizes the recognizer and a grammar containing the
                      1000    2000         3000          4000
                                                         Milliseconds
                                                                         5000         6000    7000
                                                                                                     word actually contained in the file. In return the recognizer in
                                                                                                     addition to the score returns the start and end points of the
                             Fig. 2. Manual VAD Segmentation                                         recognized utterance. By using Microsoft’s engine to generate
                                                                                                     the VAD timing for the input data the training labels for large
The first revision of the voice activity detector used during the                                    sets of data can be created quickly and accurately.
start of the program was based on differences of frame Mel-                                          1) Test Data Sets
Filtered Cepstral Coefficients (MFCC), Log Energy, and                                               Initial test and training datasets contained input vectors
Spectral difference from its loger-term average computed                                             consisting of 20, six feature-vectors combined into a 120x1
from non-speech sections of utterance. A series of threshold                                         input vector. The 20 feature vectors were selected as the
values were selected based on an analysis of the performance                                         current vector being examined plus the previous nineteen
of VAD in actual data. The threshold values were utilized to                                         vectors. A label of positive 1, indicating voice/speech data
classify a given feature frame of data as either VAD_ON,                                             was assigned if ten or more feature vectors contained in the
indicating that the feature contained voice/speech data, or                                          window of 20 were overlapped in a voiced region of the
VAD_OFF, indicating that the feature did not contain human                                           signal. The resulting datasets contained two matrices for both
voice/speech data. Post processing aiming at delineating                                             training and testing data. The first matrix was size 120 by m
individual words in a utterance determined final VAD                                                 containing all m input vectors where m was the number of
classification result that determined if for the current                                             feature vectors in the source files minus the window size of 20
speech/non-speech section was part of a word or part of a                                            vectors. The second matrix was size 1 by m and contained the
inter-word silence. This algorithm required buffering window                                         voiced or non-voiced label for the associated column in the
of 20 frames (100 msec of speech).                                                                   data matrix.
Upon examining the results from initial round one tests on                                           Additional test and training data sets were also created
both the CCW17 and WUWII corpora a few issues were                                                   containing input vectors with window sizes of thirty and five
found with the VAD triggering positions. The voice activity                                          feature vectors respectively. The setup and labeling of these
detector was proficient in picking out voiced data but was not                                       two additional sets were consistent with the twenty vector
accurate enough in determining whether the leading and                                               window set described above.
trailing edges of select utterances were voiced or not. As a                                         2) Artificial Neural Networks
result some of the words in the test corpora were truncated                                          The artificial neural network theory was developed as an
when passed to the recognition engine, resulting in an                                               attempt to mathematically describe the functioning of
inaccurate score.                                                                                    biological neural systems specifically and their learning
In order to improve the performance of the VAD two new                                               capabilities. One such a neural network is feed-forward
approaches were selected to investigate potential alternative                                        network with back-propagation learning algorithm [Ref]. It is
solutions, specifically Automatic Neural Networks, and the                                           a network comprised from a series of layers containing n
Support Vector Machines.                                                                             nodes combined via weighted connections to each node in the
Both potential new voice activity detection methods utilized                                         previous and following layers. The number of nodes in each
the same means of inputting data, as a input vector. The data                                        layer and the values selected for the connection weights
chosen was generated by choosing a subset of the Mel-                                                determines the behavior of the network when presented with
Filtered Cepstral Coefficients. The first five of twelve                                             an input.
coefficients and the energy feature were selected and arranged                                       To train an artificial neural network, input vectors from the
into a 6x1 input vector for each feature vector. Depending                                           dataset are presented as a group along with their desired
upon the experiment a group of input vectors were combined                                           labels. The training algorithm then adjusts the connection
to form an nx1 input vector for the classifier.                                                      weights to reduce the error between the results of the network
To develop the training data a series of single utterance files                                      when presented with an input vector and the desired output as
were combined together to form a continuous file containing                                          given by the training voiced or non-voiced label.
the individual utterances separated by the non-voiced silence                                        For this experiment the Resilient Backpropogation feed-
data padding the utterances in their individual files.                                               forward back-propagation algorithm was ultimately used for
                                                                                                     training. The Resilient Backpropogation algorithm was
                                                                                                     selected because of its efficient use of computational memory

                                                                                                                                                                  16
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                                                                                              17

compared to Matlab’s implementation of the Levenberg-                                                                                      300 Utterance Trained Neural Network Segmentation on WUWII Sentence
                                                                                                                                                       Waveform        40 Neuron Neural Network Output

Marquardt training algorithm.
                                                                                                                         0.1
The network topology chosen included the input, which was
sized according to input vector, a single hidden layer                                                                  0.05


containing a number of nodes dependent on the model being




                                                                                                           Amplitude
                                                                                                                          0

generated, and an output layer containing a single node. The
                                                                                                                       −0.05
network models were trained on the input data generated
above using the input vectors as the training data and the                                                              −0.1


voice activity labels as the training labels.                                                                                     2000        4000         6000            8000        10000         12000        14000
                                                                                                                                                                      Milliseconds
Due to memory and training time limitations, initial networks
were trained on small data sets of one to five utterances                                          Fig. 4. 300 Utterance Trained 40 Neuron Neural Network Test
containing between eight and fifty-thousand input vectors.                                                     on a WUWII Corpus Sentence Test File
These initial tests proved compelling as the results of tests on
single utterance files were fairly accurate. Unfortunately,
                                                                                                                                     300 Utterance Trained Neural Network Segmentation on WUWII Sentence
when examining larger test files which contained a spoken                                                                                            Waveform           40 Neuron Neural Network Output
                                                                                                                        0.5
sentence, the network response degraded. For example see
                                                                                                                        0.4
Fig. 14.
                              Dual Utterance Trained Neural Network Segmentation                                        0.3
                              Waveform       Neural Network Output   Microsoft VAD Labels
                                                                                                                        0.2
                 0.5




                                                                                                        Amplitude
                 0.4                                                                                                    0.1
                 0.3
                                                                                                                          0
                 0.2

                                                                                                                       −0.1
    Amplitude




                 0.1

                  0
                                                                                                                       −0.2
                −0.1
                                                                                                                       −0.3
                −0.2

                −0.3                                                                                                   −0.4
                                                                                                                              0     1000         2000              3000         4000            5000             6000      7000
                −0.4                                                                                                                                                  Millieconds
                       3000          3500               4000           4500                 5000
                                                  Milliseconds
                                                                                                   Fig. 5. 300 Utterance Trained 40 Neuron Neural Network Test
 Fig. 3. Dual Utterance Trained Neural Network Test on a                                                   on a Single Utterance CCW17 Corpus Test File
         Single Utterance CCW17 Corpus Test File

                                                                                                   When the neural network topology was adjusted to include
In an attempt to increase network accuracy a number of test                                        twice the input vector values (240 nodes) in the hidden layer
models were conducted. To analyze the performance of the                                           the network response accuracy remained poor. In the
network a number of model and training data parameters were                                        following two figures the 240 neuron hidden layer model
adjusted: increasing the size of the training data, adjusting the                                  trained on the same data set was applied to the same two files
number of nodes in the hidden layer, and adjusting the number                                      as the network above. The output of the network has been
of feature vectors per window in the training data set.                                            plotted over top of the input waveforms.
For each of the later models the training and testing data sets                                                                            300 Utterance Trained Neural Network Segmentation on WUWII Sentence
                                                                                                                                                        Waveform         240 Neuron Neural Network Output
were increased to three-hundred single utterance files pulled
from the CCW17, WUWII and Phonebook corpora. Each new                                                                   0.1


training and testing set contained approximately one-hundred                                                           0.05


and eighty six thousand input vectors. The exact number of                                                                0
                                                                                                       Amplitude




vectors is dependent upon the size of the specific utterance                                                        −0.05

files selected. An additional and separate validation set was                                                          −0.1

generated to be used in training. For each pass of data,                                                            −0.15

training was halted if further adjustment of the connection
                                                                                                                                  2000       4000          6000           8000         10000          12000        14000    16000
weights would increase the error on the validation set.                                                                                                                Milliseconds


The models generated using the large three-hundred word,                                            Fig. 6. 300 Utterance Trained 240 Neuron Neural Network
twenty feature window data sets performed poorly on the                                                     Test on a WUWII Corpus Sentence Test File
continuous sentence test file. In the following two figures the
output of the neural network with forty neurons in the hidden
layer has been overlaid on top of the full sentence file from
the WUWII corpus then the single utterance file from the
CCW17 corpus.




                                                                                                                                                                                                                                    17
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                                                                     18

                                                    300 Utterance Trained Neural Network Segmentation on CCW17 Utterance
                                                               Waveform            240 Neuron Neural Network Output
                                                                                                                                              tools may be developed to choose optimal threshold values
                     0.4
                                                                                                                                              given a set data set.
                     0.3
                                                                                                                                              Research into the neural network voice activity detector was
                     0.2                                                                                                                      suspended in place of dedicating resources to examining a
                                                                                                                                              support vector machines implementation. Initial research into
       Amplitude




                     0.1

                      0                                                                                                                       support vector machines indicated that greater accuracy may
                    −0.1
                                                                                                                                              be achieved by classifying the same input data vectors.
                    −0.2

                    −0.3
                                                                                                                                              G. Support Vector Machines Classifier
                           2500                 3000                      3500                  4000                   4500           5000
                                                                                 Milliseconds
                                                                                                                                              Support Vector Machines is a classification algorithm used to
Fig. 7. 300 Utterance Trained 240 Neuron Neural Network                                                                                       determine a group label given a particular input. In this
    Test on a Single Utterance CCW17 Corpus Test File                                                                                         example freely available tools were utilized to implement the
                                                                                                                                              actual training and prediction components while custom
                                                                                                                                              interfaces were developed to package the data and respond to
As with the 40 neuron hidden layer model the 240 neuron                                                                                       the output.
hidden layer model’s response was unacceptable when                                                                                           In a similar manner as the neural network training above,
compared to the current voice activity detector                                                                                               initial models were generated using the twenty feature
implementation.                                                                                                                               window datasets. Further tests were performed on larger
To investigate potentially better models a number of                                                                                          datasets and progressed to fewer feature vectors per window,
experiments were run by changing the number of neurons in                                                                                     comparable to the method used to investigate the neural
the hidden layer and adjusting the number of feature vectors                                                                                  network implementation.
per input window.                                                                                                                             1) Training Time
The optimal models generated were created with a hidden                                                                                       Initial tests showed that training time for the SVM classifier
layer containing ten to fifteen nodes on the training set                                                                                     models was significantly reduced from the neural network
containing one feature vector per window. The following                                                                                       models. To train a neural network model on approximately
network response shows the result of the single sentence                                                                                      fifty-thousand, 20 features per window input vectors would
WUWII test file presented to the 10 neuron hidden layer                                                                                       take hours compared to an SVM model which would train on
network.                                                                                                                                      the same data set in less then thirty minutes.
                              300 Utterance − 1 Feature Window Trained Neural Network Segmentation on WUWII Sentence
                                                                Waveform           10 Neuron Neural Network Output
                                                                                                                                              Training an SVM model on the large three-hundred utterance
                                                                                                                                              data set took approximately three hours to complete with the
                     0.1
                                                                                                                                              computational resources available.
                    0.05
                                                                                                                                              2) Model Training
                                                                                                                                              Practical implementation of SVM model creation, training and
    Amplitude




                       0
                                                                                                                                              testing         was        performed             using libsvm
                −0.05
                                                                                                                                              (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/).
                                                                                                                                              Three types of SVM classification training algorithm were
                    −0.1                                                                                                                      examined. The majority of the models were trained using a
                                      2000           4000          6000            8000         10000          12000          14000   16000
                                                                                 Milliseconds                                                 kernel based upon the radial basis function, while additional
Fig. 8. 300 Utterance Trained, 1 Feature Window 10 Neuron                                                                                     models were trained using kernels based upon polynomial and
Neural Network Test on a WUWII Corpus Sentence Test File                                                                                      linear functions.
                                                                                                                                              Radial Basis Function:           − gamma*(u − v )2
                                  300 Utterance − 1 Feature Window Trained Neural Network Segmentation on CCW17 Utterance
                                                                                                                                                                            e
                                                                 Waveform           10 Neuron Neural Network Output                           Polynomial:                   (gamma * u'*v + coef0) degree
                     0.4

                     0.3
                                                                                                                                              Linear:                       u'*v
                     0.2


                                                                                                                                                          Table 8. Kernel Types with Functions
        Amplitude




                     0.1

                       0

                    −0.1

                    −0.2

                    −0.3
                                                                                                                                              The polynomial basis function did not converge when training
                       2500                  3000               3500                  4000              4500              5000        5500
                                                                                                                                              on the data during the three day period it was allowed to run.
                                                                                 Milliseconds
                                                                                                                                              The linear model took longer to converge than the radial basis
Fig. 9. 300 Utterance Trained, 1 Feature Window 10 Neuron                                                                                     function and was used in the end of the experiment after direct
 Neural Network Test on a CCW17 Corpus Single Utterance                                                                                       observation of the feature data. For the majority of the models,
                          Test File                                                                                                           the radial basis function provided the best balance between
                                                                                                                                              training time and accuracy.
                                                                                                                                              3) SVM Challenges
By adjusting the selected threshold value, accuracy of the                                                                                    The initial models generated from the twenty feature vector
correct VAD on and off labeling can be varied. Additional                                                                                     windows on the large data sets were fairly accurate. The


                                                                                                                                                                                                           18
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                                                                                                                  19

challenge with these models was the creation of a large                                                              of approximately 97%, thus bringing the calculation into the
number of support vectors. Early models contained                                                                    range of real-time evaluation.
approximately sixty-thousand support vectors. Each support                                                           The following figure contains details of a single utterance file
vector represents a vector calculation which must be done for                                                        from the CCW17 corpora. The test file waveform is plotted
each input feature vector. The larger the input vectors were in                                                      with the VAD segmentation.
                                                                                                                                                             Support Vector Machine Voice Activity Detection with Microsoft Voice Activity Labels
size the more computationally intensive each calculation                                                                                                                  Waveform          SVM VAD Output             Microsoft VAD Labels
became. For a live voice activity detection implementation a
                                                                                                                                            0.5
feature vector is processed every five milliseconds. The                                                                                    0.4

resulting early models proved too computationally intensive to                                                                              0.3

run in real time.                                                                                                                           0.2




                                                                                                                               Amplitude
Two methods were examined to reduce the number of support                                                                                   0.1

                                                                                                                                              0
vectors while retaining a reasonable level of accuracy. The                                                                                −0.1
first was to adjust model training parameters to increase the                                                                              −0.2

generality of the model. The second was to reduce the number                                                                               −0.3


of feature vectors per window in the training data, much in the                                                                                   2                          2.5                         3                        3.5                       4
                                                                                                                                                                                              Samples                                                   x 10
                                                                                                                                                                                                                                                            4

same way as when examining neural network models.
Two support vector machine model training parameters were                                                              Fig. 11. Voice Activity Labels for Microsoft and Support
available for adjusting; gamma and cost. Adjusting the value                                                                 Vector Machines on CCW17 Single Utterance
selected for gamma affected the change of shape for the
classification surface. Adjusting the value selected for cost
changed the degree of error associated with points which were                                                        The following figure is obtained from a single sentence file
outliers in the region labeled as the opposing classification. In                                                    from the WUWII corpus.
principle by increasing gamma and decreasing cost, a more                                                                                                300 Utterance − 1 Feature Window Trained Support Vector Machine on WUWII Sentence
general model could be generated.                                                                                                                                                       Waveform             SVM VAD

4) Implementation Experimentation
As with the artificial neural network implementation, initial                                                                               0.1


tests using the small set of data containing one to five                                                                                   0.05

utterances produced promising results. In the following figure
                                                                                                                         Amplitude




                                                                                                                                             0
a SVM classifier output is plotted in red along with the
voice/speech or non-voice/non-speech labels in green, and the                                                                        −0.05

Microsoft engine based training labels in blue. The Microsoft                                                                              −0.1
labels can only be clearly observed in areas where there is no
overlap with the labels chosen by SVM.                                                                                                            2000             4000              6000            8000           10000           12000           14000
                                                                                                                                                                                              Milliseconds


                            SVM Voice Activity Detector Raw Ouput with Microsoft VAD Labels                          Fig. 12. Voice Activity Labels from Support Vector Machines
                  4
                                 Microsoft VAD Labels          Raw SVM Output            SVM VAD Labels
                                                                                                                                    VAD on WUWII Long Sentence
                  3



                  2
                                                                                                                     5) Neural Network VAD / SVM VAD Comparison
     Amplitude




                  1                                                                                                  The following figures represent the voice activity detection
                  0
                                                                                                                     labels output from both the Neural Network VAD and SVM
                                                                                                                     VAD implementations. In both cases the chosen models were
                 −1
                                                                                                                     trained on the 300 utterance dataset with 1 feature vector per
                      1.5        1.6                 1.7                1.8                  1.9          2
                                                                                                                     window. For the majority of the data, the neural network and
                                       Feature Vectors (where 1 Feature Vector / 5 Milliseconds)                 4
                                                                                                              x 10
                                                                                                                     support vector machine voice activity detectors have similar
 Fig. 10. Voice Activity Labels for Microsoft and Support                                                            network responses. In some cases however the SVM version
                     Vector Machines                                                                                 outperforms the neural network equivalent.
                                                                                                                     The following two figures contain a subset of the output of the
                                                                                                                     Neural Network and SVM VAD implementations when the
Like the neural networks, the optimal balance between                                                                WUWII single sentence test file was passed as the input. Like
accuracy and the number of support vectors was achieved on                                                           the single utterance example above there are instances where
the single feature vector window training and testing sets. A                                                        the SVM version of the voice activity detector outperforms
model was generated with a gamma value of 8*10-6 and a cost                                                          the corresponding neural network detector.
value of 35. This model contained 30677 support vectors.                                                             The neural network response to a given sub-section of input
The total number of support vectors was reduced from the                                                             waveform is shown first. With the proper threshold value
close to sixty-thousand of the initial models while the size of                                                      chosen, the network performs similarly to the support vector
the feature window was reduced from 120 values per input                                                             machine version for the first utterance. The third utterance
vector to 6 values per input vector. The reduction in model                                                          however is dropped by the neural network while the SVM
and data size resulted in a computational complexity reduction                                                       detector assigns voice active correctly.


                                                                                                                                                                                                                                                        19
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                                               20

                    300 Utterance − 1 Feature Window Trained Neural Network Segmentation on WUWII Sentence
                                                  Waveform        10 Neuron Neural Network Output
                                                                                                                     and time allotted. To circumvent these limitations a limited
                                                                                                                     search was devised which examined only a select subset of the
                  0.1                                                                                                potential combinations, chosen with practical evaluation and
                 0.05                                                                                                past results in mind.
                                                                                                                     A Matlab script was devised to input a group of arrays with
    Amplitude




                    0
                                                                                                                     the test cases to be examined, generate the models using a
                −0.05
                                                                                                                     chosen standardized set of utterance files and test the model
                 −0.1                                                                                                against a data set. Of the six potential model parameters three
                            9200         9400        9600           9800         10000          10200     10400
                                                                                                                     were adjusted and tested; States, Mixtures, and Skip States.
                                                               Milliseconds
                                                                                                                     The remaining three, Dimensions, Silence States, and
Fig. 13. Neural Network Output on a WUWII Long Sentence                                                              Iterations, were held constant at values deemed reasonable for
                        Subsection                                                                                   the type of model being generated.
                                                                                                                     The following table contains the results from the search
                                                                                                                     experiment. From these results the optimal model parameters
To obtain a more accurate VAD labeling the neural network                                                            chosen as the best balance between accuracy and
threshold must be set very low. While in this example the                                                            computational complexity were 39 dimensions, 30 states, 6
third utterance would still be excluded, the VAD_ON and                                                              mixtures, 1 skip state 2 silence states, and 2 training iterations.
VAD_OFF times for the fourth utterance could be more
accurate with a threshold value set to -0.075. The challenge
with a lower threshold is that a greater number of false
triggering occurs as can be observed in the noise leading the
first utterance.
Support vector machines VAD classification alleviates many
of these challenges by providing a balanced more accurate
result. SVM VAD is not without its own issues, as can be
observed leading the final utterance where the voice activity
labels can be seen to be intermittent leading into the full
spoken word.
                    300 Utterance − 1 Feature Window Trained Support Vector Machine Segmentation on WUWII Sentence
                                                 Waveform      Support Vector Machines VAD Output



                  0.1



                 0.05
    Amplitude




                   0



                −0.05



                 −0.1


                           9200         9400        9600           9800        10000          10200     10400
                                                              Milliseconds


Fig. 14. Support Vector Machines Output on a WUWII Long
                    Sentence Subsection

The accuracy of the neural network implementation varies
dependent upon the value chosen for the decision threshold.
Regardless of the value chosen though, it is clear that the
SVM implementation provides a clearer segmentation with
fewer incorrect VAD_ON labels. When comparing the chosen
models for each of the two methods using a sixty-utterance
test set and Microsoft’s VAD labels, the Automatic Neural
Network implementation scored an overlap of 96.45%. The
SVM implementation scored an overlap of 97.00% on the
same data.

                                   XI. HMM PARAMETER SEARCH
The accuracy of Hidden Markov Model scoring algorithm is
dependent upon the model parameters used in training. To
find ideal parameters a grid search could be done over all
possible parameter combinations. With 6 parameter
dimensions a full grid search over all possible combinations
was not practical given the available computational resources


                                                                                                                                                                                     20
2007 AMALTHEA REU SITE; Stiles, et al.                                                                    21




                                                                               Correc
                  Mix-                       Iter-   Correct                   t
Dimen     State   ture            Silence    ation   Accept    Total     In-   Reject-   Total Out-Of-
-sions    s       s       Skips   States     s       -ance     Vocabulary      ion       Vocabulary
                                                                                          381
     39      20       2       0          2       2      350    379   92.35%      3803       3   99.74%
                                                                                          387
     39      20       2       1          2       2      349    379   92.08%      3866       8   99.69%
                                                                                          381
     39      20       4       0          2       2      361    379   95.25%      3811       3   99.95%
                                                                                          387
     39      20       4       1          2       2      355    379   93.67%      3875       8   99.92%
                                                                                          381
     39      20       6       0          2       2      368    379   97.10%      3810       3   99.92%
                                                                                          387
     39      20       6       1          2       2      371    379   97.89%      3874       8   99.90%
                                                                                          378
     39      25       2       0          2       2      363    379   95.78%      3774       3   99.76%
                                                                                          387
     39      25       2       1          2       2      353    379   93.14%      3869       8   99.77%
                                                                                          378
     39      25       4       0          2       2      360    379   94.99%      3781       3   99.95%
                                                                                          387
     39      25       4       1          2       2      360    379   94.99%      3874       8   99.90%
                                                                                          378
     39      25       6       0          2       2      370    379   97.63%      3780       3   99.92%
                                                                                          387
     39      25       6       1          2       2      369    379   97.36%      3876       8   99.95%
                                                                                          377
     39      30       2       0          2       2      356    378   94.18%      3772       8   99.84%
                                                                                          387
     39      30       2       1          2       2      353    379   93.14%      3873       8   99.87%
                                                                                          377
     39      30       4       0          2       2      362    378   95.77%      3776       8   99.95%
                                                                                          387
     39      30       4       1          2       2      361    379   95.25%      3875       8   99.92%
                                                                                          377
     39      30       6       0          2       2      371    378   98.15%      3775       8   99.92%
                                                                                          387    100.00
     39      30       6       1          2       2      370    379   97.63%      3878       8        %
                                                                                          375
     39      35       2       0          2       2      358    378   94.71%      3755       8   99.92%
                                                                                          384
     39      35       2       1          2       2      360    379   94.99%      3844       9   99.87%
                                                                                          375
     39      35       4       0          2       2      364    378   96.30%      3757       8   99.97%
                                                                                          384
     39      35       4       1          2       2      362    379   95.51%      3846       9   99.92%
                                                                                          375    100.00
     39      35       6       0          2       2      365    378   96.56%      3758       8        %
                                                                                          384
     39      35       6       1          2       2      372    379   98.15%      3848       9   99.97%
                                                                                          369
     39      40       2       0          2       2      358    377   94.96%      3697       9   99.95%
                                                                                          381
     39      40       2       1          2       2      356    379   93.93%      3809       3   99.90%
     39      40       4       0          2       2      366    377   97.08%      3697     369   99.95%


                                                                                                          21
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                          22

                                                                                                            9
                                                                                                          381
      39         40        4         1          2          2       369     379    97.36%         3809       3      99.90%
                                                                                                          369
      39         40        6         0          2          2       367     377    97.35%         3698       9      99.97%
                                                                                                          381       100.00
      39         40        6         1          2          2       369     379    97.36%         3813       3           %
                                                                                                          362
      39         45        2         0          2          2       359     377    95.23%         3627       9      99.94%
                                                                                                          378
      39         45        2         1          2          2       361     379    95.25%         3785       8      99.92%
                                                                                                          362       100.00
      39         45        4         0          2          2       366     377    97.08%         3629       9           %
                                                                                                          378
      39         45        4         1          2          2       367     379    96.83%         3785       8      99.92%
                                                                                                          362       100.00
      39         45        6         0          2          2       367     377    97.35%         3629       9           %
                                                                                                          378
      39         45        6         1          2          2       374     379    98.68%         3786       8      99.95%
                                                                                                          356
      39         50        2         0          2          2       360     377    95.49%         3563       6      99.92%
                                                                                                          378
      39         50        2         1          2          2       363     379    95.78%         3780       3      99.92%
                                                                                                          356
      39         50        4         0          2          2       363     377    96.29%         3561       6      99.86%
                                                                                                          378
      39         50        4         1          2          2       369     379    97.36%         3780       3      99.92%
                                                                                                          356
      39         50        6         0          2          2       368     377    97.61%         3565       6      99.97%
                                                                                                          378
      39         50        6        1        2       2        370 379 97.63%            3782                3      99.97%
                            Table 9. Hidden Markov Models – Optimized Model Parameter Search


   XII. INTRODUCTION TO E-WUW PROCEDURES, ERROR                    As to date the current recognition package that makes up e-
            ANALYSIS, AND PITCH DETECTION                          WUW deals solely with data that has had the majority of its
                                                                   pitch data removed. To gain accurate results for the method
The already high standard set by the wake up word                  being used this was completely necessary. This was not to
technology, both Dynamic Time Warping and Triple Scoring           imply that the data being removed did not in some way carry
Methods begs for improvement. This undertaking took two            information that might also be prudent to Wake-Up Word
forms for one section of the group, the first form being that of   Technology. This data takes the form of the human range of
error analysis, so as to better understand the few mistakes that   pitch and it was theorized by the research party that there
the recognizer did make, and the development of new features       might exist a connection between pitch and the use of Wake-
that might increase accuracy in the future.                        up Words. So as to move forward and examine this
                                                                   possibility it was required that some method of detecting pitch
The research team went through several methods to analyze          be implemented. This led to an implementation of the
errors made by the recognizer and remove outliers and              enhanced Super Resolution Fundamental Frequency Detection
anomalies so that training data would be the possible, and to      algorithm (eSRFD).
track down problems that might be caused by the extensive
programming infrastructure already laid down at the                  XIII. INSTRUCTIONS FOR THE IMPLEMENTATION OF TSM
                                                                                    USING EXISTING TOOLS
beginning of the research problem. Error analysis and the
removal of these outliers allowed for improved model                  The process by which we build our models and test them
building and development of several tools that could be used       can be broken into three distinct phases. The first phase is the
to analyze the data that entered the recognizer. Error analysis    feature generation phase, in which programs are called to
was also necessary to prove that the Triple Scoring Method         analyze sound files, in this case .ulaw, and extract different
truly was superior to other recognition packages available in      features. Features can be viewed as different characteristics of
the applications of Wake-Word.                                     the sound. 39 different features are then stored in a FEF file,
                                                                   to be used for the second part of the process. The second and
                                                                   third distinct phases are the actual model training and

                                                                                                                                22
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                        23

classification phases, where features are loaded into memory       A. Features
and the computer runs programs that build models and then          Feature generation is done through a command line program
using those models different test cases are classified as either   called run_genFEF.exe, programmed by Dr. Veton Kepuska.
in-vocabulary (INV) or out-of-vocabulary (OOV). Once the           The purpose of this program is to run through a number of
model is built we move into the last phase of our design           files and extract features from the files. The program has the
process which is testing the model, simply new data that is not    following syntax:
included in the creation of the model is presented to the          Help File from Program (This can be reached by typing
recognizer and its response observed and recorded based on         run_genFEF.exe with no parameters into the command line):
which, knowing the truth, the performance of the model can
be quantified.


         Usage: run_genFEF
              Options:
                  <filename.conf>
              and/or
                  -s    sample rate    <[8]|16> in kHz
                  -f    sample type    <[ulaw], alaw, short, nist>
                  -h     header size   <[1024]>
                  -i    input data dir <d:/path or ../path>
                  -T     trans file     <c:/path/corpus.trans or ../path/corpus
         .trans>
                  OR
                  -L     list file       <c:/path/list.dat or ../path/list.dat>
                  -o     output FEF dir <o:/path or ../path
                  -M      model files   <m:/path/WUWmodel1.dtw m:/path/WUWmodel2.dtw ...
                  -G      generate FEFs <true, [false]> binary switch - if used is set to "true"
                  -m      monitor       <true, [false]> binary switch - if used is set to "true"

                                                                   There are several parameters in this DIR command. The /A
  The following is an example of the syntax used to call the       options assigns parameters to the search in this case the
function:                                                          parameter is –D which is the directories attribute. The /S
  run_genFEF -L <full path to the list file> -i <path to           option displays files in directory and all subdirectories. The
  where the corpus is stored locally> -o <nonexistent output       /B option tells the dir command to print in bare format,
  directory> -d <directory where feature files should be           meaning it will not print any headings or summary
  stored> -G                                                       information. The last part of the DIR command is “> C:\Path
                                                                   of file\name of file.list” tells the DIR command to not output
The above example requires some explanations in order for a        to the screen but to a file in a specified path (if no path is
user of the of the run_genFEF program to understand its            specified the file will output to whatever location the
usage. Before this program can run it will require a list file.    command is called from) with the file name specified in the
This file can be typed manually in a text editor such as           command and the extension .list.
notepad or can be created from the command prompt using
commands (e.g., windows command window) that will be               There are some other important properties of the
provided here. The list file needs to contain the relative path    run_genFEF.exe command called above. First the parameter
from the directory that you input in parameter –i, and each        –G tells the program to generate the features when it runs.
entry must be separated by a newline.                              This flag must be set or the program will appear to do
                                                                   absolutely nothing and will not create any data for the later
The list file contains the paths to all the files you would like   programs to use. The –o parameter does not do anything, but
the run-genFEF.exe program to extract features from. The           still must be set. The directory does not even need to exist,
easiest way to create this file is to use the following command    however the program cannot run without the –o parameter
in the command prompt. Take the command prompt to the              being set, the parameter acts as a flag in some parts of the
path that is one directory above, or preceding, the place where    program, undoubtedly left over from its early development
the .ulaw file directories are stored, the following command       days. Below is an example of an actual call made by our team
will output a file of the .list format with the path to these      using the run_genFEF.exe program.
directories.
                                                                   run_genFEF                                      –L
  dir /A:-D /S /B > C:\Path of file\name of file.list              C:\Amalthea\Corpra\WUWII_Corpus\lists\OOVUlaw.list
                                                                   -i C:\Amalthea\Corpra\WUWII_Corpus\calls -G -o


                                                                                                                              23
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                              24

fef_release     -d              C:\Amalthea\run_e-                      gen_model(<filename.bhmm>, <number of dimensions>,
wuw\hmm\data\input\seg_data\WUWII\OOV                                   <number of states>, <number of mixtures>, <number of skip
                                                                        states>, <number of silent states>)
The run_genFEF program can be intensive and take some time
to run, though this is mostly determined by the size of the               The different parameters are fairly self explanatory from the
Corpus it is analyzing (i.e., number of files in the list and their    example above. The first parameter is the filename of the
size). It will output .fef files as well as .dat files into the        blank model, it will be stored in the hmm/data/models
specified folder. The .fef files are the files containing the          directory which is necessary to run the program. The rest of
features for the segment analyzed by the run_genFEF.exe                the parameters are for the HMM training, and will define
program. The .dat files contain begin and end point in msec            those properties of the training. This is where skip states,
of the segments detected by the recognizer; more specifically          number of dimensions, etc. are defined for a HMM model. A
determined by the Voice Activity Detector (VAD). This                  real world example is as follows:
information describes the time frames that the run_genFEF
actually analyzed and generated features. The .dat files are
used mostly for debugging the program and analysis of
statistical anomalies found during the testing phase of the
procedure. These files are not necessary for model building or          % 39 dimensions, 40 states, 2 mixtures, 0 skips, 2 silence
for testing and can be deleted if a user will not be using them..       states
About the .fef files, they are files with no header; the values         gen_model('my_model.bhmm', 39, 40, 2,0,2);
are stored in binary and need to be read in as 32 bit floating
point values when they are interpreted in later code.                    This is the example code for a blank model called
                                                                       my_model.bhmm, with 39 dimensions, 40 states, 2 mixtures,
                                                                       no skip states, and 2 silent states.
B. Model Building
                                                                          After a blank model has been generated HMM training
The model building process is where the core of our research           needs to take place. This can be done one of two ways. A
has taken place and where the Triple Scoring Method (TSM)              MATLAB script called hmm_train.m exists that calls the
actually takes place. Several different programs have to be            necessary function in the command prompt that runs HMM
run to be able to successfully build a model and automatic             training. Another option is to call HMM.exe manually
script utilizing configuration files have been created, but first      without using HMM training. The hmm_training.m script
a description of each program and how to call it will be               runs with a few parameters, all of which are used when it calls
provided.                                                              the hmm.exe program. Below is an example of the hmm_train
   Before any of these programs can be run the user must first         syntax:
build two list files. One list file should contain a path to all
the features (just the .fef files, not the .dat) that are considered     hmm_train(<model file>, <input list name>, <number of
Out-of-Vocabulary (OOV - words or phrases that are not the              training iterations>)
Wake-up-Word), and one file should contain words that
consisting In-Vocabulary (INV - features for the Wake-up-
Word). For certain corpora, such as the CCW17 corpus, there               The first parameter input into hmm_train is the name of the
are only single words stored and a words a repeated in a               blank model that was created using the gen_model program,
certain order within the file system, a simple DIR command,            by default the program will look in the hmm/data/models
like the one used in the Feature generation section above, can         directory for the file, if the file is another location a path
be used implementing a wild card to create both lists. For             relative to that location must be given. The second parameter
some corpora however, Wake-up-Words are mixed with                     is the list your in vocabulary files . The hmm_train.m script
regular vocabulary and phrases. For these corpora the                  by default looks in the hmm/data/input for the file given in the
transcription file provided with the corpora must be used to           second parameter, again all files in another path must be
determine the correct data usually using a parser of some sort         relative to this location. The last parameter is the number of
to create a list of files that can be used as the list file.           iterations that the user specifies for the HMM training to go
The first part of building a model is train the Hidden Markov          through.       A real world example of the use of the
Models (HMMs) these are used to build a model based on                 hmm_train.m script is given below.
features. The program that is required to do this a MATLAB
script called gen_model.m. The gen_model does not actually              % HMM training, 2 iterations
run the HMM training, it simply creates a blank model using             hmm_train('my_model.bhmm', 'seg_data\operator.list', 2)
the parameters provided and outputs it into a .bhmm file. The
program can be run with the following syntax (note MATLAB                 This example access the my_model.bhmm file located at
must be used run a .m script):                                         hmm\data\model\my_model.bhmm, a list file, operator.list,
                                                                       located at hmm\data\input\seg_data\operator.list, tells the
                                                                       program to do two iterations of training.


                                                                                                                                    24
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                              25

                                                                        A real world example of calling hmm_reco.m is given
   The hmm.exe program can be called without using the               below
hmm_train.m MATLAB script.              This program unlike           % HMM recognition of INV and OOV data
run_genFEF.exe has no help file so syntactic information is           hmm_reco('my_model.bhmm', 'seg_data\pca\operator.list',
fairly rigid, as additional parameters and their affects are hard     'outINV.csv');
to discern from the program. Below is a syntactic example:
                                                                      hmm_reco('my_model.bhmm',                'seg_data\pca\all.list',
 hmm.exe –M <name of model> –I <input list name> –h                   'outOOV.csv');
 <length of header>
 -t<number of iterations> -s <name of output file>                      Please note two things concerning the above real world
                                                                     example. First note that there are two examples given,
   These parameters a very similar to the parameters for the         hmm_reco.m needs to be run twice, once for in vocabulary
hmm_train.m script. Unlike hmm_train.m, hmm.exe will not             files and a second time for out of vocabulary files, or vice
use a nice directory structure. It will by default only look for     versa. Also note that the last parameter, the output parameter,
files in the directory where it is placed, if you want to use a      outputs in Comma Separated Variable (CSV) files. These
directory structure or place files in another directory relative     files can be read by MATLAB or can be viewed in an Excel
paths must be provided to the hmm.exe program for it to know         spreadsheet. The file type must be specified in the output
where those directories are. The last parameter, The –h              parameter, if the third parameter is not set the program will
parameter can set the length of the header, newest versions of       automatically output all results to the MATLAB screen.
the VAD have a header length of zero, however older versions
might still use a header of 1024 bytes. It is important to note         Hmm.exe must be run in a similar fashion, it needs to be
that the hmm.exe program is designed to do several tasks, so         run twice once for both the in vocabulary and the out of
as well as training it is implemented also in the classification     vocabulary files and it will also output CSV files, the big
stage. The parameters set also act as flags, in this case the –t     difference is that using hmm.exe removes the necessity of
parameter tells not only tells the program to perform two            running MATLAB and that hmm.exe requires a few more
iterations of training, but that the program should be running       parameters to run. An example of the syntax necessary to run
it’s training algorithms. Without the –t parameter the program       hmm.exe in classification mode is given below:
will try and run classification, so users need be sure when
using this program they use the correct syntax to specify what        hmm.exe –I <rel. path/name of input list> -M
they want the program to do , due to its robust nature. This          <rel.path/name of model> > <rel. path/name of output file>
will finish all the necessary steps for training a model.
                                                                     Hmm.exe, as in training mode, will assume that all files are
C. Classification                                                    located in the directory where it is, so all files will need a
Classification is the last step toward developing a working          relative path from that directory if they are not located there.
recognizer. This is where the computer will decide which             The last part of the code “> <rel. path/name of output file>”
scores are in-vocabulary (INV), and which scores are out-of-         tells the program to take its output and put it in the file
vocabulary (OOV). A final classification model using                 specified, again with a relative path to the hmm.exe directory.
Support Vector Machine (SVM) technology will be produced             The output of the file will be CSV, and the file extension
to be used in testing. The SVM model will essentially                needs to specified in the parameters given to the program. A
determine the classification boundary where it decides out-of-       real world example is given below:
vocabulary and in-vocabulary words. This classification is
based on the scoring received from the hmm(s) with its               hmm.exe            -I..\data\input\operator.list      –M
corresponding model. The first program necessary to run              ..\data\models\my_model.bhmm > ..\data\output\outINV.csv
classification is a MATLAB script called hmm_reco.m.
Hmm_reco.m, is very similar to the hmm_train.m script, in               Note that the lack of the parameter –t in this case is what
that they both call the hmm.exe executable, and pass along           tells hmm.exe to run classification instead of training. Also
parameters to it. As in the training section of this text both the   remember that like hmm_reco.m, hmm.exe needs to be run
script method and the first hand method of calling hmm.exe           twice, once for the out of vocabulary words and once for the
will be presented here.                                              in vocabulary words. The scores will saved in different CSV
                                                                     files, and can then be compared to each other using a program
Hmm_reco only has a few parameters required to make it run.          such as MATLAB or Excel to graph the results.
The syntax for calling and using hmm_reco.m is as follows:
                                                                       The final step to classification is making a Support Vector
 hmm_reco(<model name.bhmm>, <name/rel. path of input                Machine model, or SVM model. A MATLAB MEX-file is
 list>,<name/rel.path of output file>)                               required to train the SVM model. These are files coded in
                                                                     other languages such as C++ and Fortran, and have been
                                                                     saved in away that allows MATLAB to use them. The MEX-


                                                                                                                                    25
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                               26

file used to create an SVM model is the svmtrain.mexw32              vocabulary as in vocabulary and out of vocabulary as out of
script. The syntactic example is as follows:                         vocabulary.

 model = svmtrain(<training labels>, <training vectors> ,
 <parameters>)                                                       E. Automation Scripts

   This svmtrain function is called and it’s output is saved as a    To expedite the process of running full tests, an automation
MATLAB variable in the example above. The function                   script was made that calls the MATLAB functions used in the
requires three inputs to run. The first input to the function is     procedure above. The script incorporates a text based
the training labels. Labels are either one or negative one, and      configuration file for the experimenter to edit. The script also
the length of the input should be the same as the length of the      allows for the batch testing of different words, allowing as
out of vocabulary and in vocabulary files combined. Negative         many as ten different models to be built in one experiment.
one specifies that a file is out of vocabulary and one specifies     The script, run_ewuw.m has a simple syntax, and is especially
that a file is in vocabulary. The easiest way to construct the       easy to run after the configuration file has been setup
training labels matrix is to create two 1XN matrices, where N        correctly. Run_ewuw can be run from any directory it is
is either the length of the in vocabulary or out of vocabulary       placed in as long as it has the correct directories to the files it
files. Use the ones() function found in MATLAB to then fill          calls. Below is the syntactic example for running the
the matrix with either ones or negative ones (hint: to fill with     run_ewuw.m script:
negative ones, multiply the ones function by -1) and then
concatenate the matrices together. The second input is the
training vectors input and should be the scores for the in            run_ewuw(<rel. path/name of configuration file>,<number
vocabulary and out of vocabulary files. These matrices need to        of runs>)
be read into MATLAB and then concatenated together in the
same order as the in vocabulary and out of vocabulary labels            The first input into the function should be the path and
where for the labels. The last inputs into the program are the       name of the file relative to the directly where the automation
parameters for the program. Typical parameters used are ‘-g          script is running from. The second input is the number of
0.008 –c 15’, though other values can be used.                       tests that you will be running. It should be brought to the
                                                                     reader’s attention that a configuration file must be preceded
                                                                     by a number, and that the files being run need to be in
                                                                     sequential order starting at one. This means the first file
D. Testing
                                                                     should contain a 1 as the last character in its filename and the
                                                                     second file should have a two as its last character and so on
The last step of the procedure is to test the model against data     (see example below).
that was not used in training, to determine that models have
not been over trained and that the model is robust enough to           run_ewuw('test_config1.txt',1)
handle new data. The procedure for this is similar to the
procedure used to create models and run the classifier. First
features must be generated for the data set being tested. This          This example will run the configuration file
process is identical to the generation of features done above.       ‘test_config1.txt’ located in the same directory as the
After features have been generated hmm_reco is used to               automation script. It will run only one iteration so it will only
compare the model built against the new data. Hmm_reco is            run ‘test_config1.txt’.     The configuration file has the
used in the same way it was above, except the that the list file     following format.
should be a list of the features for the test data and there
should be a different output, or you’ll overwrite your old
scores. These scores can then be analyzed using the SVM
classifier, by running the program svmpredict.m in MATLAB.
Its syntax is as follows:

 svmpredict(<testing      labels>,    <testing    labels/scores>,
 <model>)

The first input to the function are the testing labels which are
made the same way testing labels are made for the svmtrain.m
function used above in the classifier stage of model design.
The second input is a matrix that contains all the scores for the
test data, and the third input is the model used to score the test
data. Svmpredicts.m output will be a percentage, that will tell
how accurately the model was able to correctly score in


                                                                                                                                     26
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                           27



                 %switch set 1 to generate features, 0 to not generate features
                 gen_features 0

                 %************************************************************
                 %start feature variable section, these variables need only be
                 %correct when generating features (gen_features set to 1)
                 %************************************************************

                 %directory of the release files
                 release_dir C:\amalthea\code\run_e-wuw\Win32\Release

                 %directory of where the corpra calls are stored
                 corpra_dir C:\amalthea\Corpra\CCW17_CORPUS\calls\

                 corpra_list C:\amalthea\code\run_e-wuw\lists\CCW17_Corpus.list

                 segdata_dir C:\amalthea\code\hmm\data\input\seg_data\newold

                 %************************************************************
                 %End feature variable section
                 %************************************************************

                 list_dir C:\amalthea\code\run_e-wuw\lists\
                 output_dir C:\amalthea\code\hmm\data\output\models\vkvad
                 matlab_dir C:\amalthea\code\hmm\matlab
                 graph_dir C:\amalthea\automation
                 input_list_dir seg_data\vkvad\
                 operator_list invoc.list
                 all_list outvoc.list

                 n_dim 39
                 n_states 30
                 n_mixtures 6
                 n_skip 1
                 n_silence 2
                 iter 2

                 %excepname is flag to add additional naming parameters
                 excepname 1

                 title nameinschemehere
                                             Table 10. Example of a configuration file


The configuration file is not inherently complicated once it         then only the part of the configuration file from list_dir down
has been explained. The first section deals with whether or          needs to be set. For generating features there are 4 variables.
not a person would like to generate features. If the                 These variables are release_dir; which points to the directory
gen_features is set to one run_ewuw will generate features           containing run_genFEF.exe, corpra_dir; which contains the
and use the variables listed below it. If gen_features is set to 0   directory with where the corpus’s .ulaw files are stored, the



                                                                                                                                 27
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                            28

corpora_list; point to the list file for generating features, and   information necessary to run all the training and classification
segdata_dir; which points to where feature files should be          for the model. Below is an annotated chart to show the
output to.     The second part of the list file contains the        purpose of each variable:

      list_dir           - points to the directory containing feature lists
      output_dir         - directory for the output of model files
      matlab_dir         - directory containing MATLAB scripts
      graph_dir          - Directory containing graphs
      input_list_dir     - points to the directory containing in vocabulary and out of vocabulary lists
      operator_list      - name of in vocabulary list
      all_list           - name of out of vocabulary list
      n_dim              - number of dimensions
      n_states           - number of states
      n_mixtures         - number of mixtures
      n_skip             - number of skip states
      n_silence          - number of silent states
      iter               - number of iterations
      excepname          - this variable acts as a flag, if set to one the script will accept additional naming
      title              - if excepname is set to one this variable will be added on to the name of the
                            output files
                         Table 11. A list of variables with corresponding explanation on how it is used.


It should be noted that this program cannot run straight            dialect caused by foreign accent (i.e., Indian), child’s voice or
through from generating features into model training and            combination of factors. A significant portion of time was
classification. This is because it does not have the necessary      spent analyzing and removing these anomalies through several
functionality to create in vocabulary and out of vocabulary         methods to prevent model degradation. This analysis proved
files. If those files have already been made then this script       important in proving the robustness of the Triple Scoring
should work fine, but in most cases this script works best for      Method and assisted in the addition of features to the logic of
running scripts after features have been generated and list files   the Voice Activity Detector to improve the quality of data
have been created.                                                  given to the model trainer and therefore the accuracy of the
                                                                    training models.

XIV. ANALYSIS AND REMOVAL OF STATISTICAL ANOMALIES                  Analysis of statistical anomalies led to the realization of
                                                                    several problems which were broken into four groups; bad
                                                                    data, VAD error, the incorrect handling of multiple utterances
   Early analysis of models built showed that while there
                                                                    for one sound file, and strong non-American accents. The
might not be significant overlap of out-of-vocabulary and in-
                                                                    first words to show consistent problems were the words
vocabulary scores, some cases appeared to have been
                                                                    “voyager” and “onward”. Analysis of the waveforms was at
significantly miss-scored.      Graphs of scores showed a
                                                                    first misleading during the investigation. Early theories were
seemingly random distribution of in-vocabulary scores that
                                                                    that the recognizer was picking up on various parts of the
had been scored as out-of-vocabulary words. These points
                                                                    waveform and that it was incorrectly scoring the words due to
were labeled as statistical anomalies typically caused by Voice
                                                                    the fact that they had somewhat similar waveform/features.
Activity Detector (VAD) error or in some instances by
unusual




                                                                                                                                  28
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                        29




                         Fig. 15 Example output of the DisplayUlaw tool, the word above is voyager




                                    Fig. 16 Example of DisplayUlaw for the word operator

Both of the waveforms are have similar properties, especially    Analysis of the statistical anomalies took several other forms
in that they have a strong initial energy and then a gap where   also, the most basic of which was making sure that
the energy drops to almost zero at the beginning of the word.    transcriptions of the corpus had been done correctly. This
Because of the misleading nature of the waveform -               usually involved listening to a file and insuring that the wake
recognizer works from the features extracted from the            up word was in fact said in the sound file. A variety of tools
waveforms, the direct analysis of waveforms was soon             were developed to aid in the auditory analysis of files, most of
abandoned for the preferred method of direct audio               which were combined into one tool called DisplayUlaw
confirmation and first-person analysis of the actual files.      depicted below.




                                          Fig 17. Example of the DisplayUlaw Tool




                                                                                                                              29
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                          30

The DisplayUlaw tool had the ability to allow analysis of           up-word phrase hampering the scoring method and noise
individual features and observe the sounds that were                reduction techniques used by the current technology. Some
associated with a specific feature. Previous tools had only         instances were directly related misspoken words, for example
allowed for the analysis of an entire .ulaw file. This made         one sound file upon analysis was found to contain a
analysis difficult due to the fact that all scoring was done on     distinctive ‘F’ sound at the beginning of the wake-up-word, in
the specific features, of which a .ulaw file might have several.    this case forming the word ‘foperator’. For events where
DisplayUlaw, besides allowing for the analysis of specific          transcription errors where found the transcriptions were
features, also incorporated the drawing of VAD triggering on        simply fixed.
top of the waveform for a .ulaw file. This allowed analyzers
to look for potential VAD triggering errors.                        One of the most significant results of this analysis was on the
                                                                    VAD. Throughout the analysis of anomalies, many features
Another tool provided at the beginning of the research              where found to have been incorrectly picked up by the VAD,
program was the run_ewuw tool, that gave a variety of               or in some instance the VAD had triggered, cutting words off,
information concerning the sound being analyzed, including          and in more instances noise was kept on the end to beginning
VAD triggering, spectrograph readings, wave from                    of a word. These problems caused many of the anomalies, as
information, and cepstral information. The run_ewuw tool            only half of a wake-up word might be recognized, or as in
also had the advantage of being able to analyze a single file       some cases a wake up word might be twice as long, mostly
including the generation of its own features.                       filled with random background noise. Frequently noise was
                                                                    kept at the end or beginnings of words or the VAD triggered
Bad data was the most basic of the three problems. On several       for word to early. Another common problem was the VAD
occasions people were incorrectly transcribed as saying the         triggering in the middle of a wake up words and essentially
wake up word when in fact they had said another word.               cutting the word in half. VAD also was found to have
Transcription errors were also found to be present when a           triggered on beeps and clicks that were part of the background
person did not say the wake-up-word in the correct context.         noise in the sound file. Many of these problems were fixed by
This is understandable due to the relative subjectiveness of        making adjustments to the VAD’s logic and its internal
determining what is in context of a wake-up-word and what           thresholds. Because of the many problems that this analysis
isn’t. For these instances where a transcription was called into    found with the current VAD being used, research was
doubt researchers used their own judgment to determine if a         implemented into other forms of getting VAD triggering,
transcription file warranted changing or if a file required being   including using Artificial Neural Networks and Support
removed altogether. Other instances of bad data was some            Vector Machines try and develop and improved system for
clipping occurred when people were to close or to loud for          detecting start and end times of a person’s speech, though the
their recording device, and instances where background noise        application and development is outside the scope of this
hampered the scoring because of its unusual nature. The             conversation (for more information concerning VAD
current technology that had been developed Wake-up word             alternatives see Brandon Schmitt’s contributions to this
was designed to adjust to background noise, however in some         Technical report).
instance a noise might occur during the utterance of the wake-




   Fig. 18. Example of the VAD triggering on a sound that is not a word, in this case a click right before the word was said

                                                                    devastating. While the method had been designed to handle
The last major problem noticed in the outlier files were what       more than one utterance in a file, the actual lists, which were
was referred to as utterance problems. Utterance problems           mostly made with DIR commands at the Windows Command
occurred when the VAD incorrectly triggered, either on noise        Prompt), would list both utterances in a file as in vocabulary.
or some background sound and created more than one feature          This led to many out of vocabulary sounds to be introduced
file for a word that did in fact only have one word in it. For      into the in vocabulary data. A tool called findutter was
out of vocabulary these utterances made little difference in the    developed to remove statistical anomalies caused by the
results, however for the in vocabulary words they were              utterance problem. This tool went through the score file made


                                                                                                                                30
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                         31

by the hmm_reco program and identified all files with more        In conclusion the removal of statistical anomalies resulted in a
than one utterance (denoted by files ending with anything         large improvement in percentage of correct classifications.
other than 00). The tool was implemented after model
training and classification had taken place, so the model
developed for that group of files was already in place. This
model was used to determine which files are outliers, even
though the model is contains corrupt data it is assumed to be
fairly accurate and can acts a base line for the removal of the
anomalies. Once these files were removed the in vocabulary
list file was rewritten, and model training and classification
happens again with the new in vocabulary list.

Other anomalies often observed where caused by foreign
(non-American) accents. Because non-native speakers of the
English language tended to not observe common speech
patterns of native English speakers words and caused words to
often be misspoken, such as not speaking the entire word or
mumbling in a hard to comprehend manner. Much of our data
set contained people with strong Indian accents. These
speakers tended to have a strong stress on the leading ‘O’                   Fig. 19. Scores before outlier removal
sound but lacked the stress that American English speakers
placed on ‘Op’ sound that makes up the first syllable. Both
these sounds were typically followed by a strong drop in
energy for a short period of time before the rest of the word ,
‘erator’ could be formed, but since the two distinct portion of
energy contained slightly different emphasis and phonemes
many Indian speakers were mislabeled as out of vocabulary
words. Such data and data that was felt to have too strong of
an accent was removed from the training data, so as not to
have an ill effect on the data. This was done because the
models being developed were primarily designed to be used as
independent models for American English speakers, so any
data construed as being harmful in the training phase of the
research toward that end was removed so as to allow for the
creation of a model that was more accurate for that dialect.
The effect of the Indian dialect could largely have been
counteracted if more data for Indian speakers had been                        Fig 20. Scores after outlier removal
available, allowing for good Indian speaking models to be
built.
 Score1 & Score 3 (All, Cor. Rej., Cor. Acc.)                     Score1 & Score 3 (All, Cor. Rej., Cor. Acc.)
 Accuracy = 96.129% (4172/4340) (classification)                  Accuracy = 99.0707% (4051/4089) (classification)
 Accuracy = 99.1062% (3881/3916) (classification)                 Accuracy = 99.8114% (3704/3711) (classification)
 Accuracy = 68.6321% (291/424) (classification)                   Accuracy = 91.7989% (347/378) (classification)
 Score 1 & Score 2                                                Score 1 & Score 2
 Accuracy = 97.788% (4244/4340) (classification)                  Accuracy = 99.6087% (4073/4089) (classification)
 Accuracy = 99.617% (3901/3916) (classification)                  Accuracy = 99.9731% (3710/3711) (classification)
 Accuracy = 80.8962% (343/424) (classification)                   Accuracy = 96.0317% (363/378) (classification)
 Score 1, 2 & 3                                                   Score 1, 2 & 3
 Accuracy = 98.2719% (4265/4340) (classification)                 Accuracy = 99.731% (4078/4089) (classification)
 Accuracy = 99.8212% (3909/3916) (classification)                 Accuracy = 100% (3711/3711) (classification)
 Accuracy = 83.9623% (356/424) (classification)                   Accuracy = 97.0899% (367/378) (classification)
                  Table 12.1. Results Prior To Outlier Removal     Table 12.2. Results After Outlier Removal

                                                                  the robustness of the system being implemented as many of
This analysis proved to be vital to the improvement of the        the anomalies were sounds that did not form the wake-up
wake-up word method being developed. It also helped show          word and therefore should have been listed as out of


                                                                                                                               31
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                  32

vocabulary files. This shows the implementations ability to        which included the implementation of the eSRFD, which was
handle imperfect modeling situations.                              essentially the SRFD formula with additional empirically
                                                                   deduced logic, compared itself to several other pitch detection
                                                                   algorithms and was found to be significantly better than other
      XV. IMPLEMENTATION OF THE ENHANCED SUPER                     algorithms, with only one of the algorithms having
        RESOLUTION PITCH DETECTION ALGORITHM                       comparable (though still worse) performance. Information
                                                                   provided in the review of the literature on the eSRFD
A. Introduction                                                    algorithm also implied that it was not computer intensive and
                                                                   handled noise fairly well.
   The enhanced Super Resolution Fundamental Frequency
Detection (eSRFD) was done so as to gain further
                                                                   C. Algorithm
enhancement to the Triple Scoring Method of e-WUW. The
eSRFD method was chosen because of its performance
comparisons to several other pitch detection algorithms8. It       As stated in the introduction the pitch algorithm used in this
was hypothesized by this research group that Wake-up-Words         study was the enhanced Super Resolution Fundamental
(WUWs) have a different pitch trajectory associated with           Frequency Detector (eSRFD)5. This algorithm relied on the
them then words spoken in other contexts other then as             use of two cross correlation functions for scoring and several
WUWs. It was hoped that by adding the ability of pitch             empirically deduced set of weights. The algorithm relies on
detection, Wake-Up Word, could become even more accurate           the processing of three frames, each containing a varying
and robust. The method of pitch detection chosen relies solely     number of samples.
on the periodic nature of the signal, leaving noise analysis

                                                                                  x n = {x(i ) = s (i − n ) i ∈ 1,...., n}
out7.    Many other implementations of Pitch Detection
Algorithms had problems with doubling, and halving errors,

                                                                                  y n = {y (i ) = s (i ) i ∈ 1,...., n}
errors caused by the easy mathematical division of one
periodic wave into a wave twice or half its size. Surrounding

                                                                                  z n = {z (i ) = s (i + n ) i ∈ 1,...., n}
noise and the application of filtering also frequently caused
the distortion of the periodic signal. By studying the periodic
nature of a signal pitch was able to be detected within a fair
amount of accuracy.
                                                                   As the number of samples shrink the frame size represents
                                                                   larger and larger frequencies. The values form each frame are
B. Literature Review                                               then scored using an autocorrelation function of the first and
                                                                   second frame.
Before the implementation of an algorithm was done a review
of some literature was performed to help determine the most                                              [n / L ]
superior of methods for determining the pitch of human voice.                                             ∑ x( jL ) ⋅ y ( jL )
                                                                                    p x , y (n ) =
The most referenced and influential paper in this endeavor for                                                j =1
choosing a method was a paper introduced by Peter Veprek                                             [n / L ]            [n / L ]
                                                                                                     ∑ x( jL ) ⋅ ∑ y( jL )
                                                                                                                     2              2
and Michael Scordilis of Panasonic laboratories and
University of Miami respectively. Their paper contained a                                              j =1                j =1
comparison of five different pitch determination techniques8.
The researchers compared simple inverse filter tracking,                            p x , y (n ) = rx , y (n ) if L = 1
spectrum decimation/accumulation, comb filter energy
maximization, optimal temporal similarity, and dyadic wavelet
transform. Of these five techniques for pitch determination
optimal temporal similarity was determined to be the superior      {n = N   min   + i.L i ∈ 0,1,...; N min ≤ n ≤ N max }
method by testing each method on equivalent speech data and         Items with a significantly high score, in this case between
recording incorrect classification of voiced/unvoiced regions      0.88 and 0.75 depending on the previous frame, are then given
in detector, pitch period insertion/deletion, and overall          another score by doing an autocorrelation between the first
inaccuracy of speech. Due to this method being the superior        and third frame. Items that are not above the threshold are
of the other four modern methods, research was then focused        labeled as unvoiced are not given a pitch. This method returns
on the Super Resolution Fundamental Frequency Detector             many of the same problems of other Pitch Detection
algorithm, a predecessor of the eSRFD algorithm, and an            Algorithms, notably doubling and halving errors. To reduce
algorithm that was of the optimal temporal similarity type.        the number of errors, the algorithm automatically doubles the
Many methods are either based on some form of periodic             first correlation value if it is close to the voiced value
analysis or noise analysis; in this case the algorithm             preceding it. This is one of the weights that was added to the
completely removed magnitude of sound from consideration           SRFD algorithm to fix the doubling and halving errors that
and solely relied on the analysis of periodicity. A dissertation   plagued it. It does however have the side effect of producing


                                                                                                                                        32
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                            33

unwanted ‘voiced’ regions.         After the first two cross
correlations are done all values that have a second correlation
score above the threshold are kept and analyzed further. If no      D. Pre-processing Techniques
values have a second correlation score higher than further
analyses is run only on values that had a first correlation score   Several processing techniques were implemented to insure
higher than the threshold. The last and final value calculated      low noise and to improve upon the quality of the results. The
is the qnm coefficient which a cross correlation value              first is a simple low pass filter. Since human beings only have
determined by performing a cross correlation on two frames          a fairly small vocal range it is expected that no values that are
of the same size but of variable distance apart.                    voice can have a frequency of above 400 Hz. Low pass
                                                                    filtering has the advantage of removing noise but the
                          nM                                        disadvantage is that the method used creates harmful artifacts.

                          ∑s( j)⋅ s( j +n    M   +nm)               The low pass filter when implemented in MATLAB seems to
                                                                    take noise and cause it to have a very periodic signal. This
            q(nm) =        j=1                                      can be counteracted by using a filter that removes all sounds
                        nM           nM                             below a certain magnitude. Because the noise is has such a
                        ∑s( j) ⋅∑s( j +n             +nm)
                                 2                      2           low magnitude this can usually be done without damaging the
                                                 M                  quality of the voiced regions. The data used for development
                         j=1         j=1
                                                                    was recorded in an environment that was much cleaner than
                                                                    the environments typically encountered in real world speech
                                                                    recognition applications. Because of this real world files were
It is assumed that the value with the lowest frequency is the       obtained using a desktop mike in an environment that
fundamental frequency, and therefore the lowest frequency is        simulated real world noise. These environments proved to be
said to have the optimal value. If another frequency can get a      particularly hard for the algorithm to handle initially and an
qnm value that is .77 times higher than the optimal value then it   aggressive method of filtering was applied, as can be seen
is said to be the optimal value. After this calculation is          below in a comparison of filtered and unfiltered figures.
performed the frame size that is determined to be the optimal
value can then be converted back into a frequency.




                                           Fig. 21. Example of Waveform before filtering




                                     Fig. 22. A sound Waveform after aggressive noise removal.


                                                                                                                                  33
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                         34

                                                                   E. Results
The filtered signal has a problem that occurs in much signal
analysis, in that when it was filtered aggressively as in the      So as to insure that our results were comparable to the results
above example and fair amount of clipping does occur.              of those who first implemented eSRFD we obtained results
Despite this clipping much of the periodicity of the human         from the implementation performed by Paul Bagshaw5 who
speech is maintained and the effects on the final graphs results   developed the enhanced version of the SRFD algorithm as the
seem negligible.                                                   graphs show below, the implementation done during this
                                                                   research is comparable to results from the original
                                                                   implementation:




Fig. 33. Above is the pitch output shown with the waveform on top, the second one down is the FIT implementation output, and
                             the bottom graph is the output form the original eSRFD implementation.

F. Future Work                                                     G. Conclusion
   In the future it is planned to have a C++ implantation of the      In conclusion the FIT implementation of the eSRFD
Pitch detector designed at FIT. This will allow for future e-      algorithm has yet to mature enoguh to give definitive results.
WUW programs to read in pitch and analyze the pitch                Despite the immaturity of our technolgy, the eSRFD has the
contour. If the wake-up words do in fact happen to have a          ability to add increased accuracy to the Triple Scoring Method
different pitch contour then pitch detection can be used to        by mimiking and modeling the contours of human speech.
increase the overall accuracy of the method, perhaps as a new
score or by some method of weighting words.




                                                                                                                               34
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                                                35

                                                                            XVI. RESULTS


                                                                                                                          Relative Error Rate
                                                                    Recognizer                                                Reduction
                                                                                                                        e-WUW vs e-WUW vs
Corpora                                 HTK                 Microsoft SDK 5.1                          e-WUW               HTK          SDK
     InVoc.
Correct
Acceptance                          81.68%                               86.98%                        99.30%           2517.14%           1760.00%
False Rejection                     18.32%                               13.02%                        0.70%            2517.14%           1760.00%
   PhoneBook
Correct Rejection                   81.09%                               87.56%                        98.78%           1449.75%            919.67%
False Acceptance                    18.91%                               12.44%                        1.22%            1449.75%            919.67%
CallHome
                                                                                                                        15166.15
Correct Rejection                   93.91%                               99.44%                        99.96%              %               1303.78%
                                                                                                                        15166.15
False Acceptance                        6.09%                            0.56%                          0.04%              %               1303.78%

                                                         Score Distribution − CCW17+WUWII INV, Phonebook OOV
                                         Out of Vocabulary Scores          In Vocabulary Scores       OOV Percentage      INV Percentage
                     100

                       90

                       80

                       70

                       60
           Percent




                       50

                       40
                                                                                   Operating Point
                                                                                   Acceptance Error: 0.348%
                       30
                                                                                   Rejection Error: 1.222 %
                       20
                                                                                    Equal Error Rate: 0.8%
                       10

                        0
                                  −10           −8           −6             −4          −2             0          2         4               6
                                                                                     Score

                                                                                 Fig. 34.

                                                           Score Distribution − CCW17+WUWII INV, Callhome OOV
                                         Out of Vocabulary Scores          In Vocabulary Scores        OOV Percentage      INV Percentage


                       90

                       80

                       70

                       60
             Percent




                       50
                                                                                      Operating Point
                       40                                                             Acceptance Error: 0.0584%
                                                                                      Rejection Error: 1.396%
                       30
                                                                                       Equal Error Rate: 1.047%
                       20

                       10

                        0
                            −10                                     −5                                  0                             5
                                                                                     Score




                                                                                                                                                      35
2007 AMALTHEA REU SITE; Stiles, et al.                                                                                          36

                                                              Fig. 35.


                                                                                        XVIII. REFERENCES:
A. Discussion of Results
                                                                    [1] Shafranovich, Y. October 2005. Common Format and
   The above results show a significant improvement over the        MIME Type for Comma-Separated Values (CSV) Files.
results of the Microsoft SDK and the HTK speech recognition         http://www.faqs.org/rfcs/rfc4180.html (accessed July
packages. While neither package is as uniquely designed for         13, 2007).
this application the result still shows a momentous increase in
accuracy. E-WUW outperforms both recognition packages in             [2] Gillick, L. and Cox, S.J., “Some statistical issues in the
all the categories examined for this application.                   comparison of speech recognition algorithms,” Acoustics,
                                                                    Speech, and Signal Processing, vol.1, pp 532 – 535, May
                                                                    1989.
XVII. SUMMARY OF RESULTS
                                                                    [3] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A
                                                                    Practical Guide to Support Vector Classification. Taipei 106,
In summary the e-WUW recognition package has shown to be
                                                                    Taiwan.          National          Taiwan          University.
a superior system when compared to other recognizers at the
                                                                    http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
task of single word recognition. As an example of an
                                                                    (accessed July 13, 2007).
academic, freely available recognizer, HTK was used to
represent a solid baseline, under the assumption that it is the
                                                                    [4] Young, Steve, Gunnar Evermann, and Mark Gales eds.
most mature of the available technologies. HTK, using
                                                                    2006. The HTK Book. Cambridge University Engineering
parameters the group felt were the best that could be achieved
                                                                    Department.
with their knowledge of the system, was able to obtain equal
error rates of 7.2% using the Callhome corpus, and 18.3 %
                                                                    [5] Bagshaw, Paul. Automatic Prosodic Analysis for
using the Phonebook Corpus.              Both these values are
                                                                    Computer Aided Pronunciation Teaching. PhD Dissertation,
significantly higher than the values obtained by the e-WUW
                                                                    Edinburgh: The Univeristy of Edinburgh, 1994.
package which had an equal error rate of 0.6% for the
Phonebook Corpus and 1.047% for the Callhome Corpus.
                                                                    [6] Medan, Yoav, Eyal Yair, and Dan Chazan. "Super
Readers should be aware that this comparison takes place
                                                                    Resolution Pitch Determination of Speech Signal." IEEE
using identical list files, so as to give HTK all advantages that
                                                                    Transactions on Signal Processing 39, no. 1 (1991): 40-48.
e-WUW received. These results represent an astounding
15166.15% improvement over the HTK package. The second
                                                                    [7] Seneff, Stephanie. "Real-Time Harmonic Pitch Detector."
comparison took place using the Microsoft Speech
                                                                    IEEE Transactions on Acoustics, Speech, and Signal
Recognition Package, which claims to be the most popular
                                                                    Processing ASSP-26, no. 4 (1978): 358-365.
commercial package available.             In these comparisons
Microsoft got 87.56% for the Phonebook corpus and 99.44%
                                                                    [8] Veprek, Peter, and Michael Scordilis. "Analysis,
for the Callhome Corpus in the category of correct rejection.
                                                                    enhancement and evaluation of five pitch determination
This compares to e-WUW’s correct rejection scores of
                                                                    techniques." Speech Communication, 2001: 249-270.
98.78% and 99.96%. Correct rejection represents the number
of words that are not in-vocabulary that the system correctly
recognized as being out-of-vocabulary utterances. In this
category the results for the e-WUW package show a
1303.78% improvement over the Microsoft Speech
Recognition Software.

  It should be stressed that every attempt was made to
compare the HTK and Microsoft Packages. No attempt has
been made to skew results in the favor of the e-WUW method.
The researchers do realize that the methods discussed in this
paper have the distinct advantage of being designed
specifically for this task. It is not implied that these results
show an inadequacy of either package for the task they were
designed to be used for, just to show the vast improvement the
e-WUW package makes in the field of Wake-Up-Word.




                                                                                                                                36

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:7/6/2011
language:English
pages:36