1. INTRODUCTION:
The objective of this thesis is to research and develop prosodic features for
discriminating proper names used in alerting (e.g., “John, can I have that book?”)
from referential context (e.g., “I saw John yesterday”). Prosodic measurements
based on pitch and energy are analyzed to introduce new prosodic-based features
to the Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). During
the process of finding and analyzing the prosodic features, an innovative data
collection method was designed and developed.
In a conventional automatic speech recognition system, the users are required to
physically activate the recognition system by clicking a button or by manually
starting the application. Using the Wake-Up-Word Speech Recognition System, a
person can activate a system by using their voice only. The Wake-Up-Word
Speech Recognition System will eventually further improve the way people use
speech recognition by enabling speech only interfaces.
In the Wake-Up-Word Speech Recognition System, a word or phrase is used as a
“Wake-Up-Word” (WUW) indicating to the system that the user requires its
attention (e.g., alerting context). Any user can activate the system by uttering a
WUW (e.g., “Operator”), that will enable the application to accept the command
that follows (e.g., “Next slide please”). The non-Wake-Up-Words (non-WUWs)
include the WUWs uttered in referential context, other words, sounds, and noise.
1
Since the WUW may also occur within a referential context, and therefore
indicating that the user does not need attention from the system, it is important
for the system to be able to discriminate accurately between the two. The
following examples further demonstrate the use of the word “Operator” in those
two contexts:
Example sentence 1: “Operator, please go to the next slide.” (alerting context)
Example sentence 2: “We are using the word operator as the WUW.” (referential context)
The above cases indicate different user intentions. In the first example, the word
"operator" is been used as a way to alert the system and get its attention. In the
second example, the same word, “operator”, is used, but in this case it is used in a
“referential context”. Current Wake-Up-Word Speech Recognition system
implements only the pre and post WUW silence as a prosodic feature to
differentiate the alerting and referential contexts.
In this thesis, pitch and energy-based prosodic features are investigated. The
problem of general prosodic analysis is introduced in Section 1.1.In Chapter 2, the
use of pitch as a prosodic feature is described. In general, pitch represents the
intonation of the speech, and the intonation is used to convey linguistic and
paralinguistic information of that speech (Lehiste, 1970) . The definition and
characteristics of pitch will be covered in Section 2.1. In Section 2.2, a pitch
estimation method known as Enhanced Super Resolution Fundamental Frequency
Determinator or eSRFD (Bagshaw, 1994) is introduced. Finally, in Section 2.3,
2
derivation of multiple pitch-based features from pitch measurements are used to
find the best feature to discriminate the WUW used in alerting contexts and
referential contexts.
In Chapter 3, an additional prosodic feature based on energy measurement is
described. The definition of prominence, an important prosodic feature based on
energy and pitch, and its characteristics will be covered in Section 3.1. In Section
3.2, a description of energy computation is presented. Finally, in Section 3.3, a
derivation of multiple energy features from the energy measurement is presented
and analyzed.
In Chapter 4, an innovative idea of performing speech data collection is presented.
After a number of prosodic analysis experiments conducted using WUWII Corpus
(Tudor, 2007), the validation of the results obtained was deemed necessary using
a different data set. Since, to our knowledge, no specialized speech database is
available, an idea from Dr. R. Wallace was adopted to collect the data from the
movies. We designed a system which extracts speech from the audio channel and,
if necessary, video information from recorded media (e.g., DVD) of movies and/or
a TV series. This system is currently under development by Dr. Këpuska’s VoiceKey
Group.
The problem definition and system introduction will be explained in Section 4.1,
followed by the system design in Section 4.2.
3
1.1 PROSODIC ANALYSIS
The word prosody refers to the intonation and rhythmic aspect of a language
(Merriam-Webster Dictionary). Its etymology comes from ancient Greek, where it
was used in singing with instrumental music. In later times, the word was used for
the “science of versification” and the “laws of meter” (William J. Hardcastle, 1997),
governing the modulation of the human voice in reading poetry aloud. In modern
phonetics the word prosody is most often referred to those properties of speech
that cannot be derived from the segmental sequence of phonemes underlying
human utterances.
Human speech cannot be fully characterized as the expression of phonemes,
syllables, or words. For example, we can notice that the length of segments or
syllables are shortened or lengthened in normal speech, apparently in accordance
with some pattern. We can also hear that pitch moves up and down in some non-
random way, providing speech with recognizable melody. In addition, one can
hear that some syllables or words are made to sound more prominent than others.
Based on the phonological aspect, prosody can be classified into prosodic
structure, tune, and prominence which can be described as follows:
1. Prosodic structure refers to the noticeable break or disjunctures between
words in sentences which can also be interpreted as the duration of the
silence between words as a person speaks. This factor has been considered
4
in the current Wake-Up-Word Speech Recognition system where the
minimal silence period before the WUW and after must be present. The
silence period just before the WUW is usually longer than the average
silence period of non-WUW or other parts of the sentence.
2. Tune refers to the intonational melody of an utterance (Jurafsky & Martin)
which can be quantified by pitch measurement also known as fundamental
frequency of the speech. The details on the pitch characteristic, pitch
estimation algorithm, and the usage of pitch features are presented and
explained in Chapter 2.
3. Prominence includes the measurement of the stress and accent in a
speech. Prominence is measured in our experiments using the energy of
the sound signal. The details of energy computation, feature derivation
based on energy, and the experimental results are presented in Chapter 3.
5
2. PITCH FEATURES
In this chapter, the intonation melody of an utterance, computed using pitch
measurements, is described. The pitch characteristics and a comparison of various
pitch estimation algorithms (Bagshaw, 1994) are covered in chapter 2.1. Based on
the comparison results of multiple fundamental frequency determination
algorithms (FDA) the Enhanced Super Resolution Fundamental Frequency
Determinator (eSRFD) (Bagshaw, 1994) is selected as the algorithm of choice to
perform the pitch estimation. The details of the eSRFD algorithm are covered in
chapter 2.2. Derivation of multiple pitch-based features and their performance
evaluations are covered in chapter 2.3.
2.1 PITCH AND PITCH ESTIMATION METHODS
Intonation is one of the prosodic features that contain the information that may
be the key to discriminate between the referential context and the alerting
context. The intonation of speech is strictly interpreted as “the ensemble of pitch
variations in the course of an utterance”(Hart, 1975). Unlike tonal languages such
as Mandarin Chinese that has lexical forms that are characterized by different
levels or patterns of pitch of a particular phoneme, pitch in the intonational
languages such as English, German, the Romance languages, and Japanese, has
been used syntactically. In addition, intonation patterns in the intonational
languages are grouped with number of words, which are called intonation groups.
6
Intonation groups of words are usually uttered in one single breath. The pitch
measurement in the intonational languages reveals the emotion of a person
and/or the intention of his/her speech. For example, consider the following
sentence:
Can you pass me the phone?
The pattern of continuously rising pitch in the last three words in the above
sentence indicates a request.
Strictly speaking, pitch is defined as the fundamental frequency or fundamental
repetition of a sound. The typical pitch range for adult males is between 60-200 Hz
and 200-400 Hz for adult females and children. The contraction of vocal fold in
humans produces a relatively high pitch and, conversely, the expanded vocal fold
produces a lower pitch. This explains the reason a person’s voice rises in pitch
when he/she gets nervous or surprised. That human males usually have a lower
voice pitch than females and children can also be explained by the fact that males
usually have longer and larger vocal folds.
After years of development of pitch estimation algorithms, pitch estimation
methods can be classified into the following three categories:
7
1. Frequency based methods such as CFD (Cepstrum-based FØ determinator)
and HPS (Harmonic product spectrum), use frequency domain
representation of the speech signal to find the fundamental frequency.
2. Time domain based methods such as FBFT (Feature-based FØ tracker)
(Phillips, 1985) uses perceptually motivated features and PP (Parallel
Processing) methods to produce fundamental frequency estimates by
analyzing the waveform in the time domain.
3. Cross-correlation methods, such as IFTA (Integrated FØ tracking algorithm)
and SRFD (Super resolution FØ determinator), uses a waveform similarity
metric based on a normalized cross-correlation coefficient.
The method of eSRFD (Enhanced Super Resolution Fundamental Frequency
Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurement for
the Wake-Up-Word because of its high overall accuracy. According to Bagshaw’s
experiments, the accuracy of the eSRFD algorithm can have a voiced and unvoiced
combined error rate below 17% and a low-gross fundamental frequency error rate
of 2.1% and 4.2% for males and females, respectively. Figure 2-1 and Figure 2-2
show the error rate comparison charts between eSRFD and other FDAs for male
and female voices, respectively.
8
60
Gross Error Low
Gross Error High
50
Voiced
Unvoiced
40
30
20
10
0
CFD HPS FBFT PP IFTA SRFD eSRFD
Figure 2-1 FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994)
In Figure 2-1 and Figure 2-2, the purple bars indicate the low-gross FØ error which
refers to the halving error where the pitch has been estimated wrongly with a
value about half of the actual pitch. The green bars represent the high-gross FØ
error which refers to the doubling error where the pitch has been estimated
wrongly with a value about twice that of the actual pitch. The voiced error
represented by red bars refers to those unvoiced frames which have been
misidentified as voiced ones by the FDA. Finally, the blue bars show the unvoiced
errors which means the voiced data has been misidentified as unvoiced data.
9
70
Gross Error Low
Gross Error High
60
Voiced
Unvoiced
50
40
30
20
10
0
CFD HPS FBFT PP IFTA SRFD eSRFD
Figure 2-2 FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994)
Figure 2-1 and Figure 2-2 refer to male and female fundamental frequency
evaluation charts. They show that the eSRFD algorithm achieves the lowest overall
error rate. This result was confirmed in a more recent study (Veprek & Scordilis,
2002). Consequently, eSRFD was chosen to be the FDA used in the present project.
10
2.2 ESRFD PITCH ESTIMATION ALGORITHM
The eSRFD (Bagshaw, 1994) is the advanced version of SRFD (Medan, 1991). The
program flow chart of the eSRFD FDA is illustrated in Figure 2-3.
The theory behind the SRFD (Medan, 1991) algorithm is to use a normalized cross-
correlation coefficient to quantify the degree of similarity between two adjacent,
non-overlapping sections of speech. In eSRFD, a frame is divided into three
consecutive sections instead of two as in the original SRFD algorithm.
At the beginning, the sample waveform is passed through a low-pass filter to
remove the signal noise. The sample utterance is then divided into non-
overlapping frames of 6.5 ms length (tinterval = 6.5ms) and each frame contains a set
of samples, SN, where s N {s(i) | i N max ,..., N N max} , which is divided into 3
consecutive segments with each containing an equal number of a varying number
of samples, n. The definition of segmentation is defined by Equation 2-1 and is
further described in Figure 2-4 below.
xn {x(i) s (i n) | i 1,...n}
y n {x(i ) s (i ) | i 1,...n}
z n {x(i ) s (i n) | i 1,...n}
Equation 2-1
11
Figure 2-3 eSRFD Flow chart
12
Figure 2-4 Analysis segments of eSRFD FDA
In eSRFD each frame is processed by a silence detector which labels the frame as
unvoiced or silence if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin
and zmax is smaller than a preset value (e.g., 50db signal-to-noise level); conversely,
the frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin
and zmax is equal to or larger than a preset value (e.g., 50db signal-to-noise level).
No fundamental frequency will be searched if the frame is marked as an unvoiced
frame. In those cases where at least one of the segments of xn, yn, or zn is not
defined, which usually happens at the beginning of the speech file and the end of
the speech file, these frames will be labeled as unvoiced and no FDA will be
applied to them.
If the frame of the sample is not labeled as unvoiced, then candidate values for
the fundamental period are searched from values of n within the range N min to
Nmax by using the normalized cross-correlation coefficient Px,y(n) as described by
Equation 2-2.
13
[n / L]
x( jL) * y( jL)
j 1
Px , y (n)
[n / L [n / L]
x( jL) 2 * y( jL) 2
j 1 j 1
{n N min iL | i 0,1,...; N min n N max}
Equation 2-2
In Equation 2-2, the decimation factor L is used to lower the computational load of
the algorithm. Smaller L values allow higher resolution but also causes increase in
the computational load of the FDA. Larger L values produce faster computation
with a lower resolution search. The L value is set to 1 since the purpose of this
research is to find as accurately as possible the relationship between pitch
measurements in WUW words. Therefore, computational speed is considered
secondary and thus is not taken into account. However, the variable L will be
considered when this algorithm is integrated into the WUW Speech Recognition
System.
Figure 2-5 Analysis segments for Px,y(n) in the eSRFD
The candidate values of the fundamental period of a frame are found by locating
peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a
14
specified threshold, Tsrfd, then the frame is further considered to be a voiced
candidate. This threshold is adaptive and is dependent on the voice classification
of the previous frame and three preset parameters. The definition of T srfd is
described in Equation 2-3. If the previous frame is unvoiced or silent, then Tsrfd is
arbitrarily set equal to 0.88. If the previous frame is voiced, then Tsrfd is equal to
the larger value between 0.75 and 0.85 times the value of Px,y of the previous
frame P’x,y. The threshold value is adjusted because the present frame has a higher
possibility to be classified as voiced if the previous frame is also voiced.
Tsrfd 0.88 If the previous frame is unvoiced or silent.
Tsrfd max[ 0.75,0.85P' x, y (n'0 )] If the previous frame is unvoiced or silent.
Equation 2-3
In case no candidates for the fundamental period are found in the frame, then the
frame is reclassified as ‘unvoiced’ and no further processing will be applied to the
unvoiced frame. On the other hand, if the frame is classified as voiced, then the
following process will be used to find the optimal candidate as described next.
After getting the first normalized cross-correlation coefficient Px,y, the second
normalized cross-correlation coefficient Py,z, will be calculated for the voiced
frame. The normalized cross-correlation coefficient Py,z is described by Equation
2-4.
15
[n / L]
x( jL) * y( jL)
j 1
Py , z (n)
[n / L [n / L]
x( jL) 2 * y( jL) 2
j 1 j 1
{n N min iL | i 0,1,...; N min n N max}
Equation 2-4
After the second normalized cross-correlation, the score will be given to all
candidates. If the candidate pitch value of a frame has both Px,y and Py,z values
larger than Tsrfd, then a score or value of 2 is given to the candidate. If only Px,y is
above Tsrfd values, then a score of 1 is assigned to the candidate. The higher score
indicates a higher possibility for the candidate to represent the fundamental
period of the frame. After candidate scores are given, if there are one or more
candidates with a score of 2, then all candidates’ scores with 1 in that frame are
removed from the candidate list. If there is only one candidate with a score of 2,
then that candidate is assumed to be the best estimation of the fundamental
period of that particular frame. If there are multiple candidates with a score of 1
but no candidate scores of 2, then an optimal fundamental period is sought from
the remaining candidates.
For the case of multiple candidates with scores of 1 but no candidate scores of 2,
then the candidates are sorted in ascending order of fundamental period. The last
16
candidate of the list which has the largest fundamental period represents a
fundamental period of nM and nm for the m-th candidate.
Figure 2-6 Analysis segments for q(nm) in the eSRFD
Then the third normalized cross-correlation coefficient, qnm, between two sections
of length nM spaced nm apart, is calculated for each candidate. The section nM in a
frame is illustrated in Figure 2-6, and Equation 2-5 describes the normalized cross-
correlation coefficient, q(nm) used in this case.
[ nM ]
s( j ) * s( j n
j 1
M nm )
q ( nm )
[ nM ] [ nM ]
s ( j ) 2 * y ( j n M nm ) 2
j 1 j 1
Equation 2-5
After the third normalized cross-correlation coefficient is generated, the q(nm)
value of the first candidate on the list is assumed to be the optimal value. If the
following q(nm), multiplied by 0.77, is larger than the current optimal value, then
the candidate for which q(nm) is considered to be the new optimal value. We
17
apply the same concept throughout the entire list of candidates, resulting in the
optimal candidate value.
For the case where only one candidate has a score of 1 and there are no candidate
scores of 2, then the possibility for the candidate to be the true fundamental
period of the frame is low. In such a case, if both previous frames and the next
frame are silent, then the current frame is an isolated frame and is reclassified as a
silent frame. If either the previous or the next frame is a voiced frame, then we
assume the candidate of the current frame is the optimal one and it defines the
fundamental period of the current frame.
The above algorithm has a high possibility to misidentify voiced frame as an
unvoiced or silent frame. In order to counteract this imbalance, biasing is applied
when all of the following three conditions are satisfied:
The two previous frames were voiced frames.
The fundamental period of the previous frame is not temporarily on hold.
The fundamental frequency of the previous frame is less than 7/4 times
the fundamental frequency of its next voiced frame and is greater than 5/8
of the next frame.
18
After obtaining the fundamental frequency, and in order to further minimize the
occurrence of doubling or halving errors, the pitch contour is passed through a
median filter.
The median filter will have a default length of 7, but the size will decrease to 5 or 3
in case there are less than 7 consecutive voiced frames. Figure 2-7 is an example
of doubling points being corrected by the medium filter. In Figure 2-7, the top row
shows the pitch measurement generated by eSRFD FDA and the bottom row
shows the fixed measurement passed through a medium filter. As we can see
from the figure, the two points marked as doubling errors were corrected by the
medium filter.
Doubling Error
Figure 2-7 medium filter example
We applied the above pitch estimation method to the WUWII (Wake-Up-Word II)
corpus. The WUWII corpus contains 3410 sample utterances and each utterance
19
sentence contains at least one of the five different WUWs. The five WUWs are
‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. Figure 2-8 displays a
sample utterance containing the following sentence where the word “Wildfire” is
the WUW of the sentence.
“Hi. You know, I have this cool wildfire service and, you
know, I'm gonna try to invoke it right now. Wildfire”
Figure 2-8 Example, WUWII00073_009.ulaw
In Figure 2-8, the first row shows the waveform of the speech, the second row
shows the pitch estimation from eSRFD FDA, the third shows the pitch estimation
after the median filter, and the last row shows the audio spectrogram of the
20
speech. The WUW of this sentence is ‘Wildfire’ which is the section delineated
between two red lines.
21
2.3 PITCH-BASED FEATURES
The pattern of the fundamental frequency contour of utterance waveforms
represents the intonation of the speech. To the best of our knowledge the
problem of discriminating between the uses of words in an alerting context from
words used in a referential context has never been done before. To accomplish
this, a specialized speech data corpus containing WUWs is necessary. In this
project, the corpus named WUWII (Këpuska V. ) was chosen. The WUWII corpus
contains 3410 sample utterances and each utterance sentence contains at least
one of the five different WUWs. The five WUWs are ‘Wildfire’, ‘Operator’,
‘ThinkEngine’, ‘Onword’ and ‘Voyager’.
In our hypothesis, the intonation will rise when the WUW is spoke, thus there
should be an increment on the average pitch and/or maximum pitch on the
WUWs sections compared to the non-WUWs sections in the utterance sentence.
Based on the above hypothesis, the average pitch and maximum pitch of the
WUWs are considered and twelve pitch-based features are derived and listed in
Table 2-1. The features are represented as the relative change between A and B
which is defined in Equation 2-6 as:
Relative Change between A and B = (A-B)/B.
Equation 2-6 Relative Change
22
Feature Name Feature Definition
APW_AP1SBW The relative change of the average pitch of the WUW to the
average pitch of the previous section just before the WUW.
AP1sSW_AP1SBW The relative change of the average pitch of the first section of the
WUW to the average pitch of previous section just before the
WUW.
APW_APAll The relative change of the average pitch of WUW to the average
pitch of the entire speech sample excluding the WUW sections.
AP1sSW_APAll The relative change of the average pitch of the first section of the
WUW to the average pitch of the entire speech sample excluding
the WUW sections.
APW_APAllBW The relative change of the average pitch of the WUW to the
average pitch of entire speech sample before the WUW.
AP1sSW_APAllBW The relative changes of the average pitch of the first section of
the WUW to the average pitch of the entire speech sample
excluding the WUW sections.
MaxPW_MaxP1SBW The relative change of the maximum pitch in the WUW sections
to the maximum pitch in the previous section just before the
WUW.
MaxP1sSW_MaxPAllBW The relative change of the maximum pitch in the first section of
the WUW to the maximum pitch of the previous section just
before the WUW.
MaxPW_MaxPAll The relative change of the maximum pitch of the WUW to the
maximum pitch of the entire speech sample excluding the WUW
sections.
MaxP1sSW_MaxPAll The relative change of the maximum pitch of the first section of
the WUW to the maximum pitch of the entire speech sample
excluding the WUW sections.
MaxP1sSW_MaxPAllBW The percentage changes of the maximum pitch in the first section
of the WUW to the maximum pitch of the entire speech before
the WUW.
MaxPW_MaxPAllBW The percentage changes of the maximum pitch in the WUW
sections to the maximum pitch of the entire speech sample
before the WUW.
Table 2-1 Pitch Features definition
23
The pitch-based feature readings have been calculated for combinations of all five
different WUWs and each of the individual of five different WUWs. The detail
performance results are shown in Appendix A. In this section, the results of pitch-
based features are shown and explained using the combination of all five WUWs.
This is presented in Table 2-2 below.
Pitch-Based Features Valid
Pt > 0 %>0 Pt = 0 %=0 Pt 0 %>0 Pt = 0 %=0 Pt 0 %>0 Pt = 0 % =0 Pt 0 %>0 Pt = 0 %=0 Pt 0 %>0 Pt = 0 %=0 Pt 0 %>0 Pt = 0 %=0 Pt < 0 %<0
WUW: All WUWs Data
AEW_AE1SBW 1479 1164 79 0 0 315 21
AE1sSW_AE1SBW 1479 1283 84 1 0 240 16
AEW_AEAll 2175 1059 49 9 9 1116 51
AE1sSW_AEAll 2175 1155 53 2 0 1018 47
AEW_AEAllBW 1969 1427 72 0 0 542 28
AE1sSW_AEAllBW 1969 1562 79 3 0 404 21
MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15
MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17
MaxEW_MaxEAll 2175 1373 63 13 1 245 17
MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37
MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38
MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39
Table 5-2 Energy Features Result of All WUWs
One can see from the , there are several energy-based features with positive
relative changes above 80%. In addition, some individual WUWs achieve multiple
energy-based features having positive relative change of 90% or more which is
covered in section 3.3 and detailed in Appendix B. These results provide firm
evidence that there are significant increases for the energy measurement when
50
WUWs are spoken. These results confirm that the prominence of WUWs is more
significant than the prominence of non-WUWs. Therefore, we can conclude that
energy-based features can be used to discriminate between WUWs and non-
WUWs. A future improvement would be to quantify the level of change comparing
WUWs to non-WUWs.
6. Future Work
Two potential solutions aare are being considered addressing the insufficient
accuracy reported in this work for pich based features are outlined as follows:
1. Build a specialized corpus which contains the same words in both
WUWs and non-WUWs. The speech sentences in the current corpus,
WUWII, only contain WUWs but no non-WUWs. A new speech data
collection system is presented in Chapter 4, which will allow creation of
a database from the collected data that includes both WUWs and non-
WUWs.
2. Use different approaches in defining pitch-based features. For
example, when using average and maximum pitch measurements of
the WUW, how the pitch pattern changes should also be considered.
51
Finally the new data collection system which collects both WUWs and non-WUWs
has been designed and partially implemented. Work on this data collection system
will be continued by VoiceKey group at Florida Institute of Technology. The
ultimate goal of this speech data collection project is to build a suitable specialized
corpus of data samples in order to find suitable prosodic features to reliably
discriminate between WUWs and non-WUWs.
52
53