Shhhhh……… Why do people whisper? People whisper when they’re telling others a scandalous secret… People whisper when they’re ask to speak softly as not to disturb others (remember your elementary school librarian?)… People whisper when they’re too weak to speak normally… People whisper when…(can you think of other reasons?) It seems that whispering is the most effective and efficient vocal communication when it is better that only people within very short range of the speaker should hear the speech. Shhhhh……… So exactly what is whispering, or, whisper speech? Is it just a softer, a less intense version of regular speech? Why is it harder to understand whisper speech, even when it is spoken right next to your ear? Would it be easier or more difficult to build a speech recognizer for whisper speech? Can different voices be recognized in whisper speech? How is a word stressed, or emphasized, in whisper speech? In an attempt to answer these questions, we have… The Experiment Four subjects are asked to: Speak 10 medium length sentences (5 to 12 words) as naturally and as clearly as possible. Repeat the 10 sentences again, but this time in whisper speech. The first 9 sentences covers each of the phonemes in American English at least once, and the 10th sentence is repeated three times (both in regular speech and whisper speech), and each time a different word is stressed. The Subjects This experiment was made possible by four dear volunteers. They are: •A Young female native English speaker from the Midwest •A Young male native English speaker from eastern Canada •A Young male native English speaker from the Southeast •A Young male native English speaker from Texas These four subjects should give us a good idea of the differences between regular and whisper speech in North American English. Without further delay, let us look at the results… General Appearance of the Spectrograms Spectrogram of the phrase ―stole my house‖ in regular speech General Appearance of the Spectrograms Spectrogram of the phrase ―stole my house‖ in whisper speech General Appearance of the Spectrograms At first glance, the spectrograms of whisper speech looks like a string of fricative noises. It is definitely much less intense than regular speech, which would explain why it takes less energy to whisper than to speak. Now let’s take a closer look at what happens to each type of phonemes when we whisper, starting with vowels… Vowels – What’s all that hissing noise? ―I lied a lot on Saturday‖ in whisper speech Vowels – What’s all that hissing noise? ―Chang is not a China man‖ in whisper speech Vowels – What’s all that hissing noise? Regular ―stole my house‖ again, but this time notice the HH Vowels – What’s all that hissing noise? Whisper ―stole my house‖ again, can you tell where HH starts and stops? Vowels – What’s all that hissing noise? A closer look at the vowels shows us something interesting: They all look like HH’s! We all know that HH is a very ―transparent‖ phoneme, it does not warp the vowels around it. Actually, vowels seem to ―pass through‖ HH because we can make out the formants. Now it seems like all the whisper vowels are just HH’s with different vowels passing through. Can you guess what would the word ―is‖ sound like in whisper speech? Did you notice something peculiar with the formants? Vowels – What’s all that hissing noise? ―The boy will eat oat, pit, or soot…‖ ―…but only in small doses.‖ Vowels – What’s all that hissing noise? A second look shows us that low f1 on vowels seem to disappear entirely, which is also an attribute of HH’s. Fortunately, we can guess a low f1 on a whisper spectrogram from the lack of it, and f2 and f3 are good enough indicators of labial, velar, and dental phonemes. But how about voicing? Isn’t f1 going down usually an indicator of voicing? Let’s look at the voicing for… Fricatives and Stops – Why we don’t say ―bzzzd…‖ ―The fish thief stole my house‖ Fricatives and Stops – Why we don’t say ―bzzzd…‖ ―Can I pay tickets with tacos and pork?‖ Fricatives and Stops – Why we don’t say “bzzzd…” The whisper fricatives and stops seems to be relatively easy to spot in the spectrogram, just as in regular speech. Now let’s take a look at the voiced fricatives and stops… Fricatives and Stops – Why we don’t say ―bzzzd…‖ ―The very vexed zebra‖ in regular speech Fricatives and Stops – Why we don’t say ―bzzzd…‖ ―The very vexed zebra‖ in whisper speech Fricatives and Stops – Why we don’t say ―bzzzd…‖ ―Beat the good dog, boy!‖ in whisper speech Fricatives and Stops – Why we don’t say “bzzzd…” What happened? The voiced fricatives and stops look just like their unvoiced counterparts! It seems that they’ve lost their voicing! So how do we hear things like ―dog‖ and ―zebra‖? It is because we rely on high-level knowledge. If we play just the phoneme of the whispered voiced consonant by itself, we can hear that the unvoiced version is actually pronounced! Fricatives and Stops – Why we don’t say “bzzzd…” Fricatives and stops are relatively easy to spot in a whisper spectrogram but they can be confusing, which is exactly the opposite of… Nasals – Barely there ―Chang is not a China man‖ Nasals – Barely there It seems nasals follow suit with the other phonemes—no voice bars and no low f1 formants. Additionally nasals seem so faint that they almost look like pauses. However, we can see from the spectrogram that it isn’t difficult to identify which nasal it is; we can see the formants going up for N, going down for M, and velar pinch for NG. What about liquids and glides? They actually behave pretty well in whisper speech; identifying them is usually easier. Liquids and Glides ―Look, you wet your red leather boots!‖ Try this at home! Now that we have gone through the different types of phonemes, we can compile our results: Vowels resemble HH’s Voiced fricatives and stops lose their voicing Nasals become faint but can be differentiated Liquids and glides do not change much Much high level knowledge is required to recognize whisper speech We can do a little test to demonstrate this… F0 and Pitch What sort of f0 and pitch does whisper speech have? (Can you guess?) First, we can try using the Emu Labeler do the pitch analysis for us… F0 and Pitch Pitch analysis for ―Somebody set up us the bomb!‖ (stress on ―us‖) F0 and Pitch It seems that Emu Labeler has failed us (not too surprisingly). But that’s alright; we can still do it ourselves. Let’s make the broadband spectrograms into narrowband spectrograms… F0 and Pitch ―Somebody set up us the bomb!‖ (stress on ―us‖) Bandwidth=70 F0 and Pitch ―Somebody set up us the bomb!‖ (stress on ―us‖) Bandwidth=40 F0 and Pitch ―Somebody set up us the bomb!‖ (stress on ―us‖) Bandwidth=20 F0 and Pitch As we make the bandwidth smaller and smaller, we realize that we cannot make out the f0. But since pitch is so important in stressing and emphasizing parts of speech, how is stressing and emphasizing done in whisper speech? F0 and Pitch ―Somebody set up us the bomb!‖ (stress on ―somebody‖) F0 and Pitch ―Somebody set up us the bomb!‖ (stress on ―us‖) F0 and Pitch ―Somebody set up us the bomb!” (stress on ―bomb‖) F0 and Pitch As you may have expected, because of the lack of the ability to change the pitch, speakers uses the other two methods—more energy and longer duration—to emphasize something they want to stress in whisper speech. Try sing in whisper…can you do it? One Last Thought – Variability in Whisper Speech One thing we notice throughout the experiment is that many characteristics of regular speech are lost in whisper speech. On the other hand, some variability factors such as age, regional accent, and emotion may also be reduced to some extent in whisper speech. One Last Thought – Variability in Whisper Speech Which speaker whispered the sentence at the bottom? Speaker A ―Chang is not a China man.‖ in whisper Speaker B ―They treasured the very vexed zebra.‖ in whisper One Last Thought – Variability in Whisper Speech Now can you tell? Speaker A ―Chang is not a China man.‖ in regular speech Speaker B ―They treasured the very vexed zebra.‖ in regular speech One Last Thought – Variability in Whisper Speech It seems that whisper speech forces the speech to lose some of its variability. What can you guess anything about the speaker from the this speech? (sex, age, nationality, region, the person?) ―The fish thief stole my house.‖ in whisper speech ―The fish thief stole my house.‖ in regular speech Conclusion Whisper speech introduces more ambiguity into speech, therefore the recognition of whisper speech requires much high level knowledge. There is no detectable pitch dynamics in whisper speech. Whisper speech seem to reduce some variability in speech. Conclusion Would we ever need automatic speech recognition for whisper speech? For use in quiet places (library) For people with speech difficulty (throat cancer) Can you think of others? (secret agent watch?) Would it be more difficult than automatic speech recognition for regular speech? More ambiguity Need more high-level language modeling Less variability?