Evaluation of speech recognition

Document Sample
Evaluation of speech recognition Powered By Docstoc
					        Performance-intensity functions for normal-hearing adults and children using CASPA

                                            DO NOT CITE

     Ryan McCreery1), Rindy Ito, Merry Spratford, Dawna Lewis, Brenda Hoover, and Patricia G.


                               Boys Town National Research Hospital

                                         555 North 30th Street

                                       Omaha, Nebraska 68131


          In order to fully understand the capabilities of the peripheral auditory system to process

speech, the influence of cognitive and linguistic factors in speech recognition tasks should be

minimized. Attempts to limit the variability in speech recognition related to developmental

factors include using lists of stimuli that are more likely to be in a child’s lexicon, such as

Phonetically Balanced Kindergarten word lists (Haskins, 1949) or use of a closed-set response

task, such as the Word Intelligibility by Picture Identification test, (Ross & Lerman, 1970).

Even when test materials are specifically developed for children, individual differences in lexical

knowledge have been shown to influence speech recognition (Sanderson-Leepa & Rintelmann,

1976). Additionally, the deleterious effects of noise on speech perception are greater for young

children than for adults (Nittrouer & Boothroyd, 1990; Johnson, 2000). Some researchers have

hypothesized that these age-related differences in performance are the result of a combination of

sensory, cognitive and linguistic factors (Elliot, 1979). Age-related differences in performance

confound interpretation of clinical speech recognition results with children.

 Ryan McCreery, Research Audiologist, Boys Town National Research Hospital, Omaha, NE.
       Several studies have attempted to quantify the contribution of linguistic content on the

speech recognition of children. Using speech materials that varied in syntactic and semantic

predictability, Nittrouer and Boothroyd (1990) evaluated speech recognition of consonant-vowel-

consonant (CVC) syllables and sentences in children ages 4 years 6 months to 6 years 6 months

as well as for a group of young adults and older adults. For syllables, half of the stimuli were

commonly occurring CVC words, while the other half were CVC nonsense syllables. Three

types of sentences were used in this study. Zero probability sentences were both syntactically

and semantically anomalous (e.g., girls white car blink). Low predictability sentences were

syntactically correct, but semantically anomalous (e.g., duck eats old tape), and high

predictability sentences had both normal syntactic structure and meaning (e.g., most birds can

fly). By comparing performance across stimuli with differing levels of semantic and syntactic

predictability, the relative contributions of each to speech recognition for children were


       The results indicated that the best speech recognition for both adults and children

occurred when both syntactic and semantic cues were available. However, when comparing

performance for the low- and high-predictability sentences, children showed less improvement

for meaningful sentences than young adults. The researchers concluded that the performance

differences observed across these two groups were related to the fact that young children had

limited knowledge of semantic constraints. Performance on syntactically-preserved low-

predictability sentences was similar for adults and children, suggesting that children are able to

utilize syntactic information in short sentences to facilitate speech recognition when semantic

information is limited. In addition, although the overall word recognition scores for children

were poorer than the adults, when results were scored by phonemes, the differences between
children and young adults were not significant. These results suggest that the impact of lexical

knowledge on speech recognition may be minimized when children’s responses are scored


       Mackersie and colleages (2001) have suggested that performance intensity (PI) functions

provide a more comprehensive picture of speech recognition than the typical clinical practice of

assessing performance at a single level. PI functions can be obtained in quiet or in the presence

of competing noise and have a variety of clinical applications in situations where a measure of

speech recognition at a single intensity level is inadequate. Some of the situations where PI

functions have been suggested for clinical use include: (1) a comparison of aided and unaided

speech recognition results with hearing aids and/or implantable devices, (2) evaluation of the

effect of hearing-aid settings on speech recognition, and (3) the dynamic range of speech

perception for individual listeners (Mackersie, Boothroyd, & Minniear, 2001). PI functions also

aid in the detection of retrocochlear hearing loss by testing for rollover effects at high intensities

(Jerger & Jerger, 1971). However, because the variability of speech recognition is influenced by

both the number of items correct as well as the number of stimuli (Thornton & Raffin, 1978),

obtaining a PI function with an adequate number of stimuli at each intensity requires a greater

time commitment than the currently used monosyllabic word list, single intensity-level


       Although it has been well documented that PI functions for speech can provide a more

comprehensive picture of speech recognition ability than the current clinical practice of word

recognition at a single intensity level (Donaldson & Allen, 2003; Sherbecoe & Studebaker,

2002), time constraints have precluded the widespread clinical use of PI functions. According to

a survey of 276 audiologists, 98% normally include speech recognition tasks in routine hearing
evaluations (Martin, Armstrong, & Champlin 1994). The most common speech recognition

tasks are the Speech Reception Threshold (SRT) and monosyllabic word recognition at a single

intensity in quiet. Given the time limitations in typical clinical settings, word recognition is

frequently only measured at a single intensity (typically 40 dB above the SRT). While it is often

assumed that this represents the maximum word recognition score (i.e., PB max), Kamm,

Morgan & Dirks (1983) observed that PB max estimates obtained at 40 dB sensation level only

reflected maximum word recognition in 60% of their subjects.

       To obtain valid PI functions in less time, Boothroyd (1999) developed the Computer

Aided Speech Perception Assessment (CASPA). In this computer-based speech recognition test,

lists of 10 CVC words are presented over a range of intensities and the tester enters subject

responses. The software automatically scores results in terms of words, phonemes, consonants,

and vowels correct, and generates separate PI functions for each analysis. CASPA can be used

to obtain a PI function in less than five minutes. Data can be obtained under earphones or in the

sound-field and in quiet or with competing noise. Phonemic scoring offers several advantages

compared to scores based on the number of words correct (Markides, 1978; Gelfand, 1998).

Specifically, scoring speech recognition tasks phonemically increases the number of data points

in CVC words by a factor of three, which decreases variability and improves interpretation of

small differences in performance (Gelfand, 2003). While several studies have evaluated the use

of phonemic scoring with adults, few have evaluated the effects of this approach with children.

Some studies have suggested that phonemic scoring might reduce the influence of differences in

linguistic knowledge on speech recognition for young children (Gelfand, 1998), but this

hypothesis has not been directly evaluated in previous studies.
       Normative PI functions have been developed for normal-hearing adults under three

conditions: binaural sound-field presentation in quiet, monaural headphone presentation in quiet

and monaural headphone presentation in noise. At present, normative PI functions using the

CASPA software are not available for children. Since CASPA word lists were developed for

adults, some test items may not be within the lexicon of young children (e.g., vice, wedge, teak,

siege, poach, rove, laze). Given the potential utility for CASPA as a speech recognition test for

children, any age-related differences in performance on the task would influence interpretation of

results. The goals of the current investigation were to: 1) Quantify potential differences in PI

functions in noise between children and adults with normal hearing, 2) Determine if CASPA is a

clinically feasible tool for the audiological assessment of young children, 3) Identify the age at

which normal-hearing children achieve speech recognition comparable to adults, and 4) Compare

speech recognition for the various scoring methods available in CASPA.



Forty-eight normal-hearing children between the ages of 5 years, 0 months and 12 years, 7

months with no history of speech and language concerns participated in this study. Each child

was assigned to one of four age groups: 5-6 year-olds, 7-8 year-olds, 9-10 year-olds and 11-12

year-olds. Each group consisted of 12 children (6 male: 6 female). The Bankson-Bernthal

Quick Screening of Phonology (Bankson & Bernthal, 1990) was administered to exclude

children with significant speech production errors. The adult group was comprised of 12 normal-

hearing adults (6 male: 6 female) between the ages of 24 and 37 years. The native language of

all participants was English. Immediately prior to participation in the study, hearing was

screened at 15 dB HL bilaterally at octave frequencies from 250 Hz through 8000 Hz using a
Grason-Stadler GSI-61 clinical audiometer. Participants that did not pass the screening at all

frequencies were excluded from the study.


       The CASPA (Boothroyd, 1999) is comprised of 20 lists with 10 isophonemic CVC words

per list. Each list contains one instance of the same set of 30 phonemes. The CASPA software

allows the experimenter to control the presentation level and presents the carrier phrase (Please

say the word ________ ) prior to each stimulus. In the version of CASPA utilized for this study

(Version 3.3), the stimuli are spoken by a female.


       The CASPA software was installed on a personal computer, that controlled the stimulus

level and recorded each subject’s responses. Stimuli were generated by a Sound Max Digital

PCI sound card and were routed to a Sony SMS-1P powered monitor speaker.


       Participants were tested in a sound-treated booth. Results were obtained in a single

session of 30 to 45 minutes duration. Participants were seated at a calibrated distance

(approximately 2 feet) from the loudspeaker and instructed to repeat the word presented through

the speaker. All subjects were encouraged to guess, if they were not sure what they heard.

Additionally, because some of the words were likely to be outside the lexicon of some children

in the study, children were told that some of the words would be real and others would be “made

up” and that they should repeat each word as heard, even if it did not sound like a word that they

knew. This procedure was used to deter children from responding with a real word that sounds

similar to the target word (i.e., a responding with room for the word womb).
       Speech-shaped noise was continually presented at 55 dB SPL for all stimulus

presentations. As shown in Table 1, the level of speech varied systematically for each list,

resulting in the signal-to-noise ratios (SNR) shown in Table 1. The specific word lists presented

to each subject were varied randomly by the software, but the order of SNR was held constant

across subjects.

Resulting SNR                 0           +10         -5          +5         -10         +15
Speech Level (dB SPL)         55          65          50          60         40          70

An examiner entered the participants’ responses into the CASPA software via the keyboard. .

Data analysis and scoring of responses also occurs via the software. Specifically, results are

scored by percentage of words, phonemes, consonants (initial, final, and total) and vowels

correct. Feedback as to the correctness of the response was not provided to the subjects. For

some of the younger participants, an adult sat with the child during the evaluation to help

maintain the child’s attention on the task. All participants were paid for their participation, and

children were also provided with a toy and a book.


All percentage correct speech recognition results were converted to Rationalized Arcsine Units

(RAU; Studebaker, 1985) for statistical analyses. A three-way mixed factorial ANOVA was

conducted with SNR (-10, -5, 0, 5 dB) and Scoring Method (word, phoneme) as within-subjects

variables and Age Group as a between-subject variable. Although figures show results for vowel

and consonant scoring, only word and phonemes scores were included in the ANOVA since

comparison of these scoring methods is often used in clinical applications of CASPA. In

addition, data at + 10 dB and +15 dB SNR conditions were excluded from the ANOVA due to
consistently high performance. Post-hoc comparisons were made using Bonferroni adjustment

for multiple comparisons (p < 0.05).

        The three-way interaction for SNR by Scoring Method by Age Group was not

statistically significant (F (12,165) = 0.701, p = 0.749; η2p = 0.049). The two-way interaction

between SNR and Age Group also was not significant (F (12,165) = 1.437, p = 0.154; η2p =

0.095). A two-way interaction between SNR and Scoring Method was significant (F (3,165) =

40.914, p < 0.001; η2p = 0.427). Post-hoc comparisons using Bonferroni correction indicated

that mean differences in performance between words and phonemes were significant at SNR

conditions from -10 to + 5 dB ( p < .05). A two-way interaction between Scoring Method and

Age Group also was statistically significant (F (4,55) = 4.297, p = 0.004; η2p = 0.155). A post

hoc analysis using Bonferroni correction revealed that mean differences between word and

phoneme scores were significant for all age groups (p < .05). The simple effect for SNR was

statistically significant (F (3,165) = 172.610, p < 0.001; η2p = 0.758) with a trend for increasing

performance as SNR increased. The simple effect for Scoring Method was also significant (F

(1,55) = 869.185, p < 0.001; η2p = 0.94) with phoneme scoring yielding higher speech

recognition than word scoring. The simple effect for Age Group was also significant (F (4,55) =

6.357, p < 0.001; η2p = 0.316). Post hoc analysis indicated a significant difference between the

5-6 year-old and the Adult group, and between the 7-8 year-old and Adult group. All other age

group differences in speech recognition were not significant. A separate one-way ANOVA

comparing initial and final consonant errors was not statistically significant (F (1,59) = 0.006, p

= .939; η2p = 0.018), suggesting that there was no tendency for subjects to commit errors related

to initial or final consonant position.
       Figure 1 shows speech recognition (in percentage) as a function of SNR for the four

different scoring methods. The filled squares and hatched area represent the mean and 95%

confidence intervals for adult listeners, respectively. The various open symbols represent

children’s mean performance by age group. Performance for all age groups was highest for

vowels, followed by phonemes (consonants and vowels combined), consonants, and words.

Performance increased as both age and SNR increased with differences in scoring method

creating the significant interactions.

       Figure 2 shows individual percentage correct data for phonemes as a function of SNR for

the children across the four age groups. Figure 3 shows similar data for words. As in Fig. 1, the

hatched area shows the 95% confidence intervals for the adult group. Performance for words

was more varied for each age group than performance for phonemes. There was a trend for
decreasing variability for words correct as SNR and age group increased.
       Figure 4 compares performance for phoneme versus word scoring by age group. The

differences in speech recognition for words were significant only between the adult group and

the 5-6 year-olds and 7-8 year olds. Differences in speech recognition for phoneme scoring were

not significant across age group.


       The main findings from the current results were that the CASPA can be reliably utilized

to assess speech recognition in children as young as 5 years old. Scoring CASPA results by the

number of words correct resulted in performance that was poorer for all age groups compared to

scoring by phonemes correct. When the effects of scoring method were evaluated across age

groups, there were no significant differences between any age groups when phoneme scoring
was utilized. Differences in speech recognition for word scoring followed the pattern of the

interaction between scoring method and age, where speech recognition was only significantly

different between adults and the two youngest age groups. Speech recognition increased as SNR


       Overall, speech recognition in noise for adults in the present study was similar to that

reported for a previous study using CASPA (Mackersie et al., 2001). The mean percentage of

phonemes correct in the previous study for adults at 0 dB SNR was 74%, while the percentage

correct for adults at a 0 dB SNR ratio in the current study was 79%. Mackersie and colleagues

did not evaluate the other SNR present in the current study, so an overall comparison at other

SNR evaluated in the present study is not possible.

       On average, speech recognition in noise was poorer for young children than for adults.

However, when results were analyzed by age, significant differences were only observed

between the adults and the two youngest groups of children. These results suggest that children

reach adult levels of speech recognition by age 9 or 10 years of age on the CASPA, regardless of

the scoring method utilized. Previous studies have attempted to evaluate differences in speech

recognition as a function of age, but results varied considerably depending on the age of subjects,

the task, and stimuli. Fallon and colleagues (2000) used SPIN sentences in multi-talker babble to

evaluate speech perception differences in noise and found no differences between adults and

three groups of children ages 5 years, 9 years and 11 years. The pattern of similar performance

for children in their study was consistent regardless of whether the stimuli were high-

predictability or low-predictability sentences. The availability of syntactic information in their

sentence stimuli may have improved performance for the children in their study, but differences

in the SNR and stimuli make direct comparisons of performance across studies difficult.
       Alternatively, Nittrouer and Boothroyd (1990) found that, at SNRs of 0 dB and 3 dB,

children between 4 ½ years and 6 ½ years had poorer performance for individual words and

nonsense syllables than a group of young adults. When the current results for the 5to 6 year-old

group at 0 dB SNR are compared to results obtained at the same SNR for children of similar age

in Nittrouer and Boothroyd’s study, phoneme recognition is similar for these age groups (64% in

the current study vs. 68% in the previous study). Therefore, the results of the current study

suggest similar phoneme correct performance as observed in a previous study of speech


       Differences in performance were observed between word and phoneme scoring in the

current study. When CASPA results are scored based on the number of words correct, children

in the 5-6 year-old and 7-8 year-old group had significantly poorer scores than the adult group,

suggesting that some of the words in the CASPA lists may be outside the lexicon of young

children. Performance-intensity functions for phonemes did not vary between adults and

children, which suggests that CASPA can be used to assess speech recognition and obtain

reliable performance-intensity functions in children as young as 5 years-old, if phonemic scoring

is used. It is important to note that the instructions given to children in the current study were

intended to invoke faithful repetition of words, regardless of familiarity for individual children.

Scoring results phonemically also resulted in reduced variability among subjects as a function of

age. When results were scored by the number of phonemes correct, scores between adults and

children were more similar with most children performing within the range 95% confidence

intervals for adult performance shown in Figure 2. This suggests that phoneme scoring can

reduce age-related differences in performance on this task. Differences between adults and

young children (4 ½ to 6 ½ years) for word scores, but not for phoneme scores were also
reported in the Boothroyd & Nittrouer study (1990). They suggested that phoneme scoring may

minimize performance differences associated with the use of lexical knowledge and thus reflect

more elemental speech recognition skills than word scoring.

       CASPA has several significant advantages over traditional speech recognition tests that

make it appealing to use with children. Obtaining a PI function with CASPA can be completed

in less than 10 minutes, and provides information about speech recognition over a range of

intensities. Characterizing performance at different input levels can be useful for optimizing

compression characteristics for hearing aid use, as well as comparing aided vs. unaided speech

recognition. Similar data for children with hearing loss and cochlear implants are needed to

determine the age at which phonemic scoring can be used to limit age-related variability in

speech recognition.


This study was supported by grants from the National Institute of Deafness and Other

Communication Disorders (R01 DC04300, P30 DC-4662 and T35 DC008757)