AUTOMATIC PRONUNCIATION VERIFICATION OF ENGLISH LETTER-NAMES
FOR EARLY LITERACY ASSESSMENT OF PRELITERATE CHILDREN
Matthew Black, Joseph Tepperman, Abe Kazemzadeh, Sungbok Lee, and Shrikanth Narayanan
Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA
{matthepb,tepperma,kazemzad,sungbokl}@usc.edu, shri@sipi.usc.edu
ABSTRACT reduced to one of letter-name recognition. That is, we are
not interested in specifying which letter-name the child said,
Children need to master reading letter-names and letter-
but rather whether the letter-name pronunciation was read
sounds before reading phrases and sentences. Pronunciation
acceptably. In most letter-name recognition research (an
assessment of letter-names and letter-sounds read aloud is
application that arises, for example, when a person spells
an important component of preliterate children’s education,
aloud an out-of-vocabulary word), the intended letter is not
and automating this process can have several advantages.
known ahead of time, but the assumption is that it is spoken
The goal of this work was to automatically verify letter-
correctly [3-6]. For this paper, we know what letter-name
names spoken by kindergarteners and first graders in
the child was prompted to say. The difficulty is robustly
realistic classroom noise conditions. We applied the same
detecting the innumerable ways a child could produce an
techniques developed in our previous work on automatic
unacceptable pronunciation, while not penalizing a child for
letter-sound verification by comparing and optimizing
acceptable pronunciation variations (such as nonnative
different acoustic models, dictionaries, and decoding
accent).
grammars. Our final system was unbiased with respect to
There are numerous engineering challenges in automatic
the child’s grade, age, and native language and achieved
letter-name verification for children. Children’s speech has
93.1% agreement (0.813 kappa agreement) with human
high variability within and between speakers [7], and the
evaluators, who agreed among themselves 95.4% of the
data used in this research was collected in noisy classrooms
time (0.891 kappa).
from children with multiple language backgrounds. These
Index Terms— Children’s speech, pronunciation conditions make it difficult to train representative acoustic
verification, automatic reading assessment, letter-names models. Furthermore, many of the letter-names are
acoustically similar (e.g., /eh m/ and /eh n/), and almost all
1. INTRODUCTION of them share at least a common phoneme (e.g., /b iy/, /c iy/,
/d iy/, /iy/, /jh iy/, /p iy/, /t iy/, /v iy/, and /z iy/). In addition,
Children’s future reading proficiency and their ability to
there is no word or letter context for this isolated letter-
learn effectively through reading has been shown to be
name reading task, so we cannot train language models, as
correlated with the mastery of reading the names of the
is typically done in letter-name recognition tasks when the
letters (letter-names) and producing the sounds of the letters
speaker is spelling real words [4].
(letter-sounds) at an early age [1]. Assessing children’s
We experimented with different acoustic models,
skills in these reading tasks is an important element of early
dictionaries, and decoding grammars with the goal of
education to confirm that the children are learning.
attaining automatic letter-name verification with accuracy
Automatic assessment of letter-sounds and letter-names
that neared human agreement. Section 2 describes the data
can have several advantages. The personalized assessment
we analyzed. Section 3 briefly describes our verification
required to properly score a child’s reading level takes one-
method, which builds upon our previous work on automatic
on-one time, which a teacher may not always be able to
letter-sound verification [2]. Section 4 shows the
provide. Furthermore, an automatic system may remove
experimental results, with a discussion following in Section
some of the personal biases inherent in the judgment of the
5, including an in-depth error analysis and comparison to
child’s reading level and standardize the grading process.
the letter-sound task and results. We conclude in Section 6.
In addition, automatic systems can provide teachers with a
fine-grained analysis of the child’s pronunciation, offering
2. CORPUS
them insight for future instructional planning.
This paper concentrates on automatically verifying We used data from the Technology-based assessment of
letter-names spoken by preliterate children, complementing language and literacy (Tball) Project [8,9]. The Tball
our previous work addressing the letter-sound task [2]. corpus [10] was recorded in kindergarten to second grade
Please note that the letter-name verification task is not classrooms in the greater Los Angeles area. Typical
background noise included speech from other children and were trained on 12 hours of isolated word-reading data
the teacher. The corpus contains both native English and (without letter-names), also recorded for the Tball Project.
Spanish speakers; thus, we can expect certain pronunciation A background model was trained on silent and background
trends, as described in [11]. All 26 English alphabet noise portions of the utterances, and a single generic phone-
characters were tested for the letter-name reading task. One level “garbage” model was trained on all speech segments.
lowercase letter was displayed on a computer screen for a Five sets of acoustic models were iteratively trained directly
maximum of five seconds before the next letter was shown. on the letter-name train set, as described in [2]. All feature
These transition times were automatically recorded and used extraction and model training was done with HTK [12].
to segment the files into single letter-name utterances.
We manually verified (accept/reject) 3508 letter-name 3.2. Dictionaries
utterances, of which 25.1% were rejected. 23.4% of these
A recognition dictionary that included all the acceptable
rejected utterances were due to the child saying nothing.
letter-name pronunciations served as a baseline dictionary.
8.27% of all the utterances were marked as having at least
This dictionary was not ideal since it did not take into
one disfluency (fillers, repetitions, and/or repairs). Table 1
account the fact that we knew what letter-name the child
shows performance across various demographics that were
was prompted to say. For this reason, we also constructed
provided for some of the children. Using the manual
five additional dictionaries that included unacceptable letter-
annotations, we created a test set with 780 files (30 files per
name pronunciations from foreseeable categorical errors
letter-name) and a train set with the remaining 2728 files
(Table 2). We then produced 32 sets of verification
(approximately 105 files per letter-name). The data were
dictionaries through all 25 combinations of the five
partitioned so that the proportion of acceptable to
categories (none, LS, PE, SI, …, LS-PE, LS-SI, …, all).
unacceptable pronunciations was the same between the train
Each dictionary set contained a dictionary for each letter
and test set for each letter-name. To compute human
with acceptable letter-name pronunciations and appropriate
agreement statistics, three trained native speakers verified
unacceptable ones. We refer to the verification dictionary
the same 260 files (10 files per letter-name), randomly
set that did not include any unacceptable pronunciations as
selected from the test set. Mean pairwise evaluator
the “none” set, and the one that included all the
agreement was 95.38%, with kappa agreement of 0.8914.
unacceptable pronunciations as the “all” set.
Demographic Number % Accepted
Label Description # of Entries Examples
Female 1820 72.36
Gender LS English letter-sounds 45 v: /v/, /v ah/
Male 1582 77.81
PE Perceptual confusions 43 m-n, f-s, c-z
K 3012 75.13
Grade SI Sight confusions 21 b-d, p-q, o-c
1st 420 70.48
SP Spanish confusions 14 j: /hh ey/
5 1897 78.12
Age SPLN Spanish letter-names 28 d: /d ey/
6 556 76.80
Native Spanish 1203 72.98 Table 2. Description of the five unacceptable pronunciation
Language English 1151 82.71 categories, with the corresponding number of entries and examples
Table 1. Children performance (based on manual verification)
across various demographics. Bold numbers indicate the 3.3. Grammars
difference in proportion is statistically significant (p 0.1).
However, the mean SNR for utterances in which the system optimized for each letter separately. Our final automatic
erred (disagreed with the manual verification) was system agreed with humans 93.1% of the time (0.813
significantly lower than the mean SNR for utterances in kappa), nearing inter-evaluator agreement of 95.4% (0.891
which the system was correct (p<0.01). This implies that kappa), and was unbiased with respect to the child’s grade,
noise did not affect human evaluator agreement but age, and native language. This system also performed
adversely affected automatic verification performance. significantly better than the one we previously developed to
verify the more difficult letter-sounds [2]. In the future, we
S
SNR = 10 log 10
1
S ∑ s =1
As2
(1)
want to improve system performance in the presence of
noise through improved acoustic modeling and/or by
N
1
N ∑ n =1
2
An automatically detecting when there is too much background
noise to reliably verify the utterance.
# in test SNR Statistics [dB]
Partition of test data 7. ACKNOWLEDGEMENTS
data mean std. dev.
agree 193 9.335 3.712 This works was supported in part by the National Science
Inter-evaluator
disagree 33 8.632 3.292 Foundation. Special thanks to Matthew Tan and Isaac
correct 648 9.623 3.533
System Rottman for their help in transcribing the letter-name data.
error 42 7.810 3.796
Table 6. SNR statistics comparing the effect of noise on inter- 8. REFERENCES
evaluator agreement and system performance. Bold numbers
means the difference in means is statistically significant (p<0.01). [1] National Reading Panel, “Teaching children to read: an
evidence-based assessment of the scientific research literature
5.2. Comparison between letter-names and letter-sounds on reading and its implications for reading instruction,”
NICHD, NIH Publication 00-4769, Washington, DC, 2000.
According to the manual verification, children performed [2] M. Black, J. Tepperman, A. Kazemzadeh, S. Lee, and S.
significantly better on the letter-name task (74.9% accepted) Narayanan, “Pronunciation verification of English letter-
than the letter-sound task (72.2% accepted), with p<0.05. sounds in preliterate children,” Proc. Interspeech, Antwerp,
This is probably because all letter-names have a one-to-one Belgium, 2007.
mapping for their pronunciations, while many of the letter- [3] M. Fanty and R.A. Cole, “Spoken letter recognition,”
sounds have alternative pronunciations depending on word Advances in Neural Information Processing Systems 3, San
context. The letter-sounds are also shorter and less natural Mateo, CA: Morgan Kaufmann, 1991.
to pronounce aloud, which may have been a factor in the [4] H. Hild and A. Waibel, “Recognition of spelled names over the
letter-sounds having twice as many disfluencies (16.9%), a telephone,” Proc. ICSLP, Philadelphia, PA, 1996.
significant difference with p<0.05. Human agreement [5] P.C. Loizou and A.S. Spanias, “High performance alphabet
statistics for both tasks were nearly identical. recognition,” IEEE Trans. Speech and Audio Processing,
We found the same trends in our automatic verification 4(6):430-445, 1996.
performance for both the letter-name and letter-sound tasks, [6] M.E. Munich and Q. Lin, “Explicit modeling of common
in that the baseline models were worse than models trained acoustic features for character recognition,” Proc. EUSIPCO,
on in-domain data, with grammar 2 and the letter-specific Vienna, Austria, 2004.
dictionary providing the best results. English letter-name [7] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of
substitutions and alternative pronunciations were the most children's speech: developmental changes of temporal and
common categorical errors for the letter-sound task, with spectral parameters,” J. of Acoust. Soc. Am., 105:1455-1468,
sight confusions and Spanish letter-name errors dominating Mar. 1999.
the letter-name task. Overall, we attained higher [8] Tball. http://diana.icsl.ucla.edu/Tball/assess_frame.html.
verification accuracy on the letter-name task (93.08% [9] A. Alwan et al., “A system for technology based assessment of
accuracy), compared to the letter-sound task (87.95% language and literacy in young children: the role of multiple
accuracy), with p<0.01. We feel this difference is mostly information sources,” Proc. MMSP, Greece, 2007.
due to the acoustic models. Whereas HMMs using MFCC [10] A. Kazemzadeh, H. You, M. Iseli, B. Jones, X. Cui, M.
features model letter-name phonemes well, they seem to be Heritage, P. Price, E. Anderson, S. Narayanan, and A. Alwan,
less suited for the more noise-like letter-sounds. Future “Tball data collection: the making of a young children's speech
research on letter-sound specific features will hopefully help corpus,” Proc. Eurospeech, Lisbon, Portugal, 2005.
bridge this gap. [11] H. You, A. Alwan, A. Kazemzadeh, and S. Narayanan,
“Pronunciation variations of Spanish-accented English spoken
6. CONCLUSION by young children,” Proc. Eurospeech, Lisbon, Portugal, 2005.
[12] Cambridge University, HTK 3.2, htk.eng.cam.ac.uk.
We showed that we could accurately verify letter-name
pronunciations through acoustic modeling at the phoneme-
level. We achieved the best results using a dictionary