Neural Network Based Isolated Word Recognition Improvement by
Document Sample


Neural Network Based Isolated Word Recognition Improvement by
Syllable Detection
Ján Olajec, Roman Jarina
Department of Telecommunications, Faculty of Electrical Engineering
University of Žilina, Univerzitná 1, 010 26 Žilina, Slovak Republic,
E-mail: olajec@fel.utc.sk, jarina@fel.utc.sk
Abstract
The paper deals with a possibility to improve recognition results of the Automatic
Speech Recognition (ASR) system. The system uses 2-D cepstral analysis for speech feature
extraction and neural network (NN) for classification. In the first part of the article, the ASR
system is briefly defined. In the second part, the proposed method of improvement, based on
syllable detection, and experiments on ASR systems are described. The syllable detection is
based on temporal contour of short-time energy. The number of syllables in utterances is
detected by additional NN.
Keywords: ASR, neural networks, perceptron, Slovak digits, speech recognition, syllable
detection, 2-D cepstrum
Introduction
Current systems of ASR are divided into two primary groups: systems for recognition of
continuous speech and systems for recognition of isolated words. The article is focused on
ASR system for isolated words recognition. Such systems are simpler to design end develop
than the systems that work with continuous speech. That’s why they are mostly used in
current embedded devices, such as mobile phones, PDAs, etc. In these devices, they can by
used in applications of type: dialling phone number by voice, control of menu in software
applications. The isolated words recognition systems can by very helpful for handicapped
(turn light, radio on/off, etc.). If such system is to be used in practice it has to be easily build
and computationally inexpensive.
The oldest ASR applications used DTW algorithm. The systems working with HMM
and NN have better results in recognition of words, but they are computationally demanding.
Some restrictions in HMM or NN (for example: less feature observations, less states of
HMM, simplified architecture of NS), can decrease computational and memory requirement
[1, 2]. Performance of simple isolated word recognition methods can be further improved by
counting syllables in words that being recognized. Thus output classes are divided into groups
according to the number of syllables in the word belonging to the given class. This procedure
depicted in Figure 1.
“3TDCM – NN” Model of Isolated Word Recognition
In the paper, the ASR system for recognition of isolated words is described. Feature
extraction of an acoustic signal is performed by 2-D cepstral analysis (Two-Dimensional
Cepstrum = TDC) [3]. Dealing with variable time length of speech units is still a problem of
ASR models. At present, most advanced ASR systems modelled a time length of speech units
by finite-state HMM [4]. Here we use an alternative method, in which variations in time
length of words are eliminated by describing an entire word by only three overlapping 2-D
cepstral matrices (3TDCM) of constant size [5].
Block scheme of the ASR proposed is depicted in Figure 2. The recognizer is
composed of: ASR front-end – pre-processing acoustic signal and feature extraction for NN
(feature vector); NN – it classifies feature patterns to classes. Number of classes is identical to
the number of words being recognized.
Classes
1
Subgroups
ASR ASR
2
a) b) No. of syllables
Figure 1. a) Classifications in to N classes. b) Classifications in to sections.
classes
features
s(n) ASR per word NN
preprocessing classification
Figure 2. Block diagram ARR
In ASR front-end, speech signal s(n) is converted into cepstral vector. The cepstral
vectors are then transformed into three two-dimension cepstral matrices (3TDCM) [5].
Advantage of this method is an ease of computation and a small amount of elements in the
feature pattern comparing to conventional methods such as short-time cepstrum with the time
derivatives (known as delta and delta-delta cepstrum). TDC is obtained by applying 2-D
cosine transform on sequences of log filter-bank energies vectors. The same results can be
obtained also by applying 1-D cosine transform on block of MFCC vectors along time
dimension. A detailed procedure of TDC computation is in [5]. The feature vector for each
pattern is led into the NN for classification. The type of the NN is multilayer perceptron with
backpropagation training (BP) [6].
Syllable detection
In Figure 3, ASR system is extended about a block of syllable detection. The syllable
detection block detects the number of syllables from the word in the input of ASR system.
Information about the number of syllables is supplied into the classifier. This supplied
information divides classes into subgroups. In the cause of Slovak digits, there are the
following two groups: 1) monosyllable words, and 2) dissyllable words.
The block of syllable detection consists of the neural net NN and the block of pre-
processing input signal s(n) ES, as shown in Figure 4. The task for NN is to classify patterns
into classes according to the number of detected syllables. There classes represent the number
of syllables in the recognized words. Monosyllables and disyllables are separated. The
features are based on the energy contour of the speech signal, which is calculated in the block
ES (energy of signal).
Syllable Classes
detection
features
s(n) ASR per word 1
NN
Pre-processing Sub-classification
2
Figure 3. Block diagram of ASR model with syllable detection.
Syllable detection
s(n) ES NN No of
[s(n)]2 classification syllables
Figure 4. Block diagram of syllable detection
Speech Database description
The database contains twelve isolated words. They represent Slovak digits {jeden,
jedna, dva, dve, tri, štyri, päť, šesť, sedem osem, deväť, nula}. They are recorded with sample
frequency fs = 8 kHz and 8 bit/sample resolution. The training part contains different
speakers, than the part for testing. The speech corpus is summarized in Table 1.
Table 1. Train end test speech database
Fs = 8kHz, 8bit No. of patterns No. of speakers Total No. of
per word patterns
Database: Training 40 x 4 = 160 40 160 x 12 = 1920
Testing 21 x 4 = 84 21 84 x 12 = 1008
Neural network setup and training
Perceptron is a main element of a NN [6]. The perceptron has the sigmoidal non-linear
logistic function. The architecture of NN is defined by [h1 h2 o], where h1, h2, and o are the
number in the first, second hidden layer and output layer respectively. The number of neurons
in output layer relates to the number of classes. Such architecture is referred as 3-layer MLP
(multilayer perceptron) or 4-layer MLP if the input layer is also taken into account. The
number of neurons is set up according to [2]. The NN was originally trained with Matlab
algorithm “learngdm” (Gradient descent with momentum weight and bias learning function).
In the proposed system two NN are used. The first neural network (NNW) classify into
twelve classes (“jeden”, “jedna”, “dva”, … “deväť”, “nula”). Note, there are only 10 semantic
classes, thus during testing NNW outputs represented the same digits are merged. The second
neural network (NNS) classifies the speech into two classes, which represents the number of
syllables in the word.
Table 2. Division of classes on monosyllable and dissyllable subgroups
classes jeden jedna dva dve tri štyri päť šesť sedem osem deväť nula
Monosyllable ● ● ● ● ●
Dissyllable ● ● ● ● ● ● ●
Classification
The test of the ASR system is done in Matlab 7. The neural network for the syllable detection
(NNS) is trained with accuracy 96,23% (for [70 30 2]). This NN is trained no more. The NNW
(architecture [96 67 12]) is trained gradually with decreasing maximal error of training. A
period of training is divided in 30 parts. In these 30 parts of training periods, the NNW was
tested on the test set of the database. Recognition results through the training are shown in
Table 3. The table displays words recognition rates with syllable detection.
It is obvious, that the ASR system would work with maximal recognition performance,
if the NNS worked with no errors in syllable detection (RRS = 100%). This is the upper limit
of performance of such ASR system with syllable detection. The results of recognition
accuracy are in Table 3 and in Figure 5.
Table 3. Word recognition rates for ASR with one NN (NNW) and ASR with syllable detection (NNW
+ NNS). goal – NNW training goal; NNW – neural network classifying into 10 classes {Slovak digits};
NNS – neural network classified into 2 classes {monosyllables, disyllables} with recognition rates RRS
= 96% or RRS = 100% (theoretical limit)
No. of period 1 2 3 4 5 6 7 8 9 10 11 12 13
Goal for NNw 0,361 0,215 0,109 0,068 0,040 0,031 0,024 0,020 0,017 0,013 0,011 0,010 0,007
Only NNw
16,67 8,63 14,68 47,12 67,76 74,80 82,34 85,52 86,71 86,90 86,51 88,79 89,19
[%]
NNw + NNs
RRs = 96% 25,00 16,77 27,78 58,63 73,71 81,35 86,71 88,39 89,58 88,29 89,09 90,77 90,18
[%]
NNw + NNs
RRs = 100% 26,19 16,77 29,37 61,61 75,79 84,82 89,38 92,06 93,06 91,37 92,66 94,05 93,75
[%]
No. of period 14 15 16 17 18 19 20 21 22 23 24 25 26
Goal for NNw 0,0062 0,0052 0,0049 0,0037 0,0035 0,0026 0,0022 0,0019 0,0015 0,0013 0,0011 0,0010 0,0008
Only NNw
90,18 89,29 90,97 90,77 90,58 90,87 91,47 91,96 91,17 91,87 91,57 91,07 91,37
[%]
NNw + NNs
RRs = 96% 90,77 90,58 90,87 91,37 90,48 91,67 91,77 92,46 91,87 91,87 92,06 91,87 92,26
[%]
NNw + NNs
RRs = 100% 94,54 94,15 94,54 95,14 94,15 95,44 95,54 96,13 95,54 95,63 95,83 95,63 95,83
[%]
100
Rr
[%]
80
60
ASR, NNw into 12 classes
40 ASR with syllable detection, NNs: RRs = 100%
ASR with syllable detection, NNs: RRs = 96%
20
No. of period
0
0 5 10 15 20 25
Figure 5. Recognition rate from Table 3
97
Rr
95
[% ]
93
91
89 NNw into 12 classes
NNs: RRs = 100% (theoretical limit)
87 NNs: RRs = 96%
No. of period
85
8 13 18 23
Figure 6. Detail of recognition rate from Table 3, (No. of period: 8 - 26)
16
delta
14 delta real: NNs with Rr = 96%
[%]
12
delta 100: NNs with Rr = 100%
10
8
6
4
2
0
-2 0 5 10 15 20 25
No. of period
-4
Figure 7. Recognition performance contributions if NNS (syllable detection) is used.
delta real = RR NNw+NNs - RR NNw; delta 100 = RR NNw+NNs - RR NNw
Conclusion
In the papers, an experiment on neural network based ASR system with syllable detection is
presented. The results of experiments on Slovak digit recognition are shown in Table 3 and
Figures 5 - 7. The best result of classification with NN was reached after 21 training periods
(Table 3). The neural network, which classifies words into 12 classes, has recognition
accuracy RR = 91.96 %. (Note, there are only 10 semantic classes, thus during testing NNW
outputs represented the same digits are merged.) If syllable detection, which has the detection
accuracy RRS = 96.23 %, is added to the system, the recognition accuracy of words increase
up to RRW+S = 92,46 %.
It is also shown that if the syllable detection block had 100 % accuracy, the
recognition rate of the proposed system would increase further up to 96.13 %. That means
maximal increment of recognition rate would be about 4.17 % if the syllable detection was
incorporated in to this ASR system.
It is interesting that during training, ASR system with the syllable detection may have
worse performance than the system without the syllable detection in some training parts. This
is seen in 16th an 18th training periods (Figure 5).
References
[1] Jarina, R., Kuba, M.: Speech Recognition using Hidden Makov Model with Low
Redundancy in the Observation Space, Komunikacie, vol.4 2004, pp.15-19.
[2] Olajec, J., Jarina R.: An experiment in isolated digit recognition by neural network,
Transcom 2005, Proceedings, Section 3, June 2005, Zilina, Slovak Republic, ISBN 80-
8070-415-5, pp 187-190.
[3] Ariki, Y., Mizuta, S., Nagata, M., Sakai, T.: Spoken-word recognition using dynamic
features analyzed by two-dimensional cepstrum, IEE Proceedings, vol.136, 1989.
[4] O’Shaugnessy, D.: Interaction with computer by voice: Automatic Speech Recognition
and Synthesis, Proceedings of the IEEE, Vol.91, No.9 2003, pp 1272-1305.
[5] Jarina, R.: Kepstrálno spektrálny model pre rozpoznávanie rečových signálov, Doctoral
thesis, Department of Telecommunications, University of Žilina, Nov. 1999.
[6] Haykin, S.: Neural Networks A Comprehensive Foundation, McMaster University,
Hamilton, Ontario, Canada, 1994, ISBN 0-02-352761-7
Get documents about "