Neural Network Based Isolated Word Recognition Improvement by by byrnetown67


									     Neural Network Based Isolated Word Recognition Improvement by
                            Syllable Detection
                                    Ján Olajec, Roman Jarina
            Department of Telecommunications, Faculty of Electrical Engineering
             University of Žilina, Univerzitná 1, 010 26 Žilina, Slovak Republic,

        The paper deals with a possibility to improve recognition results of the Automatic
Speech Recognition (ASR) system. The system uses 2-D cepstral analysis for speech feature
extraction and neural network (NN) for classification. In the first part of the article, the ASR
system is briefly defined. In the second part, the proposed method of improvement, based on
syllable detection, and experiments on ASR systems are described. The syllable detection is
based on temporal contour of short-time energy. The number of syllables in utterances is
detected by additional NN.

Keywords: ASR, neural networks, perceptron, Slovak digits, speech recognition, syllable
detection, 2-D cepstrum

Current systems of ASR are divided into two primary groups: systems for recognition of
continuous speech and systems for recognition of isolated words. The article is focused on
ASR system for isolated words recognition. Such systems are simpler to design end develop
than the systems that work with continuous speech. That’s why they are mostly used in
current embedded devices, such as mobile phones, PDAs, etc. In these devices, they can by
used in applications of type: dialling phone number by voice, control of menu in software
applications. The isolated words recognition systems can by very helpful for handicapped
(turn light, radio on/off, etc.). If such system is to be used in practice it has to be easily build
and computationally inexpensive.
        The oldest ASR applications used DTW algorithm. The systems working with HMM
and NN have better results in recognition of words, but they are computationally demanding.
Some restrictions in HMM or NN (for example: less feature observations, less states of
HMM, simplified architecture of NS), can decrease computational and memory requirement
[1, 2]. Performance of simple isolated word recognition methods can be further improved by
counting syllables in words that being recognized. Thus output classes are divided into groups
according to the number of syllables in the word belonging to the given class. This procedure
depicted in Figure 1.

“3TDCM – NN” Model of Isolated Word Recognition
In the paper, the ASR system for recognition of isolated words is described. Feature
extraction of an acoustic signal is performed by 2-D cepstral analysis (Two-Dimensional
Cepstrum = TDC) [3]. Dealing with variable time length of speech units is still a problem of
ASR models. At present, most advanced ASR systems modelled a time length of speech units
by finite-state HMM [4]. Here we use an alternative method, in which variations in time
length of words are eliminated by describing an entire word by only three overlapping 2-D
cepstral matrices (3TDCM) of constant size [5].
        Block scheme of the ASR proposed is depicted in Figure 2. The recognizer is
composed of: ASR front-end – pre-processing acoustic signal and feature extraction for NN
(feature vector); NN – it classifies feature patterns to classes. Number of classes is identical to
the number of words being recognized.


        ASR                                            ASR

       a)                                      b)      No. of syllables

             Figure 1. a) Classifications in to N classes. b) Classifications in to sections.

     s(n)              ASR                 per word               NN
                preprocessing                              classification

                                         Figure 2. Block diagram ARR

        In ASR front-end, speech signal s(n) is converted into cepstral vector. The cepstral
vectors are then transformed into three two-dimension cepstral matrices (3TDCM) [5].
Advantage of this method is an ease of computation and a small amount of elements in the
feature pattern comparing to conventional methods such as short-time cepstrum with the time
derivatives (known as delta and delta-delta cepstrum). TDC is obtained by applying 2-D
cosine transform on sequences of log filter-bank energies vectors. The same results can be
obtained also by applying 1-D cosine transform on block of MFCC vectors along time
dimension. A detailed procedure of TDC computation is in [5]. The feature vector for each
pattern is led into the NN for classification. The type of the NN is multilayer perceptron with
backpropagation training (BP) [6].

Syllable detection
In Figure 3, ASR system is extended about a block of syllable detection. The syllable
detection block detects the number of syllables from the word in the input of ASR system.
Information about the number of syllables is supplied into the classifier. This supplied
information divides classes into subgroups. In the cause of Slovak digits, there are the
following two groups: 1) monosyllable words, and 2) dissyllable words.
        The block of syllable detection consists of the neural net NN and the block of pre-
processing input signal s(n) ES, as shown in Figure 4. The task for NN is to classify patterns
into classes according to the number of detected syllables. There classes represent the number
of syllables in the recognized words. Monosyllables and disyllables are separated. The
features are based on the energy contour of the speech signal, which is calculated in the block
ES (energy of signal).

                                       Syllable                                      Classes

  s(n)              ASR            per word                              1
               Pre-processing                  Sub-classification

Figure 3. Block diagram of ASR model with syllable detection.

                                    Syllable detection

             s(n)                 ES                     NN               No of
                                [s(n)]2              classification      syllables

                               Figure 4. Block diagram of syllable detection

Speech Database description

        The database contains twelve isolated words. They represent Slovak digits {jeden,
jedna, dva, dve, tri, štyri, päť, šesť, sedem osem, deväť, nula}. They are recorded with sample
frequency fs = 8 kHz and 8 bit/sample resolution. The training part contains different
speakers, than the part for testing. The speech corpus is summarized in Table 1.

                               Table 1. Train end test speech database
       Fs = 8kHz, 8bit             No. of patterns        No. of speakers        Total No. of
                                      per word                                     patterns
  Database:         Training        40 x 4 = 160                  40           160 x 12 = 1920
                    Testing          21 x 4 = 84                  21            84 x 12 = 1008

Neural network setup and training
        Perceptron is a main element of a NN [6]. The perceptron has the sigmoidal non-linear
logistic function. The architecture of NN is defined by [h1 h2 o], where h1, h2, and o are the
number in the first, second hidden layer and output layer respectively. The number of neurons
in output layer relates to the number of classes. Such architecture is referred as 3-layer MLP
(multilayer perceptron) or 4-layer MLP if the input layer is also taken into account. The
number of neurons is set up according to [2]. The NN was originally trained with Matlab
algorithm “learngdm” (Gradient descent with momentum weight and bias learning function).
        In the proposed system two NN are used. The first neural network (NNW) classify into
twelve classes (“jeden”, “jedna”, “dva”, … “deväť”, “nula”). Note, there are only 10 semantic
classes, thus during testing NNW outputs represented the same digits are merged. The second
neural network (NNS) classifies the speech into two classes, which represents the number of
syllables in the word.

                Table 2. Division of classes on monosyllable and dissyllable subgroups
    classes      jeden    jedna     dva      dve         tri     štyri        päť         šesť         sedem         osem      deväť     nula
 Monosyllable                          ●        ●        ●                     ●          ●
 Dissyllable      ●        ●                                         ●                                   ●            ●         ●         ●

The test of the ASR system is done in Matlab 7. The neural network for the syllable detection
(NNS) is trained with accuracy 96,23% (for [70 30 2]). This NN is trained no more. The NNW
(architecture [96 67 12]) is trained gradually with decreasing maximal error of training. A
period of training is divided in 30 parts. In these 30 parts of training periods, the NNW was
tested on the test set of the database. Recognition results through the training are shown in
Table 3. The table displays words recognition rates with syllable detection.
        It is obvious, that the ASR system would work with maximal recognition performance,
if the NNS worked with no errors in syllable detection (RRS = 100%). This is the upper limit
of performance of such ASR system with syllable detection. The results of recognition
accuracy are in Table 3 and in Figure 5.

Table 3. Word recognition rates for ASR with one NN (NNW) and ASR with syllable detection (NNW
+ NNS). goal – NNW training goal; NNW – neural network classifying into 10 classes {Slovak digits};
NNS – neural network classified into 2 classes {monosyllables, disyllables} with recognition rates RRS
                              = 96% or RRS = 100% (theoretical limit)

No. of period    1        2        3        4        5          6         7          8            9           10          11        12        13
Goal for NNw    0,361    0,215    0,109    0,068    0,040      0,031     0,024      0,020        0,017       0,013     0,011     0,010    0,007
 Only NNw
                16,67    8,63     14,68    47,12    67,76      74,80     82,34      85,52        86,71       86,90     86,51     88,79    89,19
 NNw + NNs
 RRs = 96%      25,00    16,77    27,78    58,63    73,71      81,35     86,71      88,39        89,58       88,29     89,09     90,77    90,18
 NNw + NNs
RRs = 100%      26,19    16,77    29,37    61,61    75,79      84,82     89,38      92,06        93,06       91,37     92,66     94,05    93,75
No. of period    14       15       16       17       18         19        20         21           22          23          24        25        26
Goal for NNw 0,0062 0,0052 0,0049 0,0037 0,0035 0,0026 0,0022 0,0019 0,0015 0,0013 0,0011 0,0010 0,0008
 Only NNw
             90,18 89,29 90,97 90,77 90,58 90,87 91,47 91,96 91,17 91,87 91,57 91,07 91,37
 NNw + NNs
 RRs = 96%   90,77 90,58 90,87 91,37 90,48 91,67 91,77 92,46 91,87 91,87 92,06 91,87 92,26
 NNw + NNs
RRs = 100% 94,54 94,15 94,54 95,14 94,15 95,44 95,54 96,13 95,54 95,63 95,83 95,63 95,83

                                        ASR, NNw into 12 classes

40                                      ASR with syllable detection, NNs: RRs = 100%

                                        ASR with syllable detection, NNs: RRs = 96%
                                                                              No. of period

      0                  5              10              15               20                   25

                              Figure 5. Recognition rate from Table 3


               [% ]


 89                                                          NNw into 12 classes
                                                             NNs: RRs = 100% (theoretical limit)
 87                                                          NNs: RRs = 96%
                                      No. of period
          8                      13                     18                         23
              Figure 6. Detail of recognition rate from Table 3, (No. of period: 8 - 26)

14                                                    delta real: NNs with Rr = 96%
                                                      delta 100: NNs with Rr = 100%
 -2 0                    5              10             15              20         25
                                                                        No. of period
  Figure 7. Recognition performance contributions if NNS (syllable detection) is used.
          delta real = RR NNw+NNs - RR NNw; delta 100 = RR NNw+NNs - RR NNw
In the papers, an experiment on neural network based ASR system with syllable detection is
presented. The results of experiments on Slovak digit recognition are shown in Table 3 and
Figures 5 - 7. The best result of classification with NN was reached after 21 training periods
(Table 3). The neural network, which classifies words into 12 classes, has recognition
accuracy RR = 91.96 %. (Note, there are only 10 semantic classes, thus during testing NNW
outputs represented the same digits are merged.) If syllable detection, which has the detection
accuracy RRS = 96.23 %, is added to the system, the recognition accuracy of words increase
up to RRW+S = 92,46 %.
        It is also shown that if the syllable detection block had 100 % accuracy, the
recognition rate of the proposed system would increase further up to 96.13 %. That means
maximal increment of recognition rate would be about 4.17 % if the syllable detection was
incorporated in to this ASR system.
        It is interesting that during training, ASR system with the syllable detection may have
worse performance than the system without the syllable detection in some training parts. This
is seen in 16th an 18th training periods (Figure 5).

[1] Jarina, R., Kuba, M.: Speech Recognition using Hidden Makov Model with Low
    Redundancy in the Observation Space, Komunikacie, vol.4 2004, pp.15-19.
[2] Olajec, J., Jarina R.: An experiment in isolated digit recognition by neural network,
    Transcom 2005, Proceedings, Section 3, June 2005, Zilina, Slovak Republic, ISBN 80-
    8070-415-5, pp 187-190.
[3] Ariki, Y., Mizuta, S., Nagata, M., Sakai, T.: Spoken-word recognition using dynamic
    features analyzed by two-dimensional cepstrum, IEE Proceedings, vol.136, 1989.
[4] O’Shaugnessy, D.: Interaction with computer by voice: Automatic Speech Recognition
    and Synthesis, Proceedings of the IEEE, Vol.91, No.9 2003, pp 1272-1305.
[5] Jarina, R.: Kepstrálno spektrálny model pre rozpoznávanie rečových signálov, Doctoral
    thesis, Department of Telecommunications, University of Žilina, Nov. 1999.
[6] Haykin, S.: Neural Networks A Comprehensive Foundation, McMaster University,
    Hamilton, Ontario, Canada, 1994, ISBN 0-02-352761-7

To top