Docstoc

HMM and Speech Recognition

Document Sample
HMM and Speech Recognition Powered By Docstoc
					                   CSE - 503: Natural Language Processing
                   HMMS and SPEECH RECOGNITION


Three set questions on HMMS and SPEECH RECOGNITION
( Right sides indicate marks).

Full Marks: 45
Set-1
   a) Mention the key difference between „Continuous Speech‟ and „Isolated
        word Speech‟.                                                         2
   b) Draw and describe schematic architecture of a Speech Recognizer.        5
   c) Give the formal definition of a Simple Markov Model with example. 4
   d) Define Hidden Markov Model (HMM) with proper example .               4
Set-2
   a) Write down the Viterbi Algorithm for finding optimal sequence of
        states in Continuous Speech Recognition.                           5
   b) What are the main limitations of Viterbi Decoder?                    3
   c) What do you mean by word lattice ? Give proper example.              4
   d) What do you mean by “Tree Structure Lexicon”? Give proper
        example.                                                          3
Set-3
a) What are the important characteristics of a sound wave? What are the
    perceptual properties of a sound wave?                               2+2
b) What is Spectrum? What method is used to compute it?              1+1
c) Briefly describe feature extraction process of a speech signal.        3
d) Define Vector Quantization? Briefly describe Concatenative
    Synthesis.                                                       1+5
Answer of question set-1
a) Mention the key difference between „Continuous Speech‟ and „Isolated
word Speech‟.
Ans: The term continuous means that the words are run together naturally.
      Example: Rahim plays football.
In isolated-word speech recognition each word must be preceded and followed by a
pause.
      Example: Rahim (Stop) plays (Stop) football


b) Draw and describe schematic architecture of a Speech Recognizer.
Ans: Figure (1) shows an outline of the components of a speech recognition system. The
figure shows a speech recognition system broken down into three stages.

Feature extraction: In the signal processing or feature extraction stage, the acoustic
waveform is sliced up into frames (usually of 10, 15, or 20 milliseconds) which are
transformed into spectral features which give information about how much energy in the
signal is at different frequencies.

Sub word or phone Recognition: In the sub word or phone recognition stage, we use
statistical techniques like neural networks or Gaussian models to tentatively recognize
individual speech sounds like p or b. For a neural network, the output of this stage is a
vector of probabilities over phones for each frame (i.e. 'for this frame the probability of
[p] is .8, the probability of [b] is .1, the probability of [f] is .02, etc'); for a Gaussian
model the probabilities are slightly different.

 Decoding: Finally, in the decoding stage, we take a dictionary of word pronunciations
and a language model (probabilistic grammar) and use a Viterbi or A* decoder to find the
sequence of words which has the highest probability given the acoustic events.




Fig(1): Schematic architecture for a (simplified)Speech Recognizer
c) Give the formal definition of a Simple Markov Model with example.
Ans: The Simple Markov model (SMM) or weighted automata consisted of
      1. a sequence of states q = (q1q2...qn), each corresponding to a phone
      2. a set of transition probabilities between states, a01, a12, a13, encoding the
probability of one phone following another.
      3. a set of observation likelihoods B= bi(Ot).bi(Ot)=1 if state i matched the
observation Ot, otherwise bi(Ot)=0




            Fig(2) A simple weighted automaton or Markov chain pronunciation
            network for the word need, showing the transition probabilities, and a
            sample observation sequence.The transition probabilities axy between
            two states x and y are 1.0 unless otherwise specified.

d) Define Hidden Markov Model (HMM) with proper example .

Ans: An HMM formally defined by the following components:-
      1. States: A set of states Q = q1q2... qN.
      2. Transition probabilities: A set of probabilities A = a01 a02... an1 ... ann. Each aij
represents the      probability of transitioning from state i to state j. The set of these is the
transition probability matrix,
      3. Observation likelihoods: A set of observation likelihoods B = bi(Ot), each
expressing the       probability of an observation Ot being generated from a state i.




                 Fig(3): An HMM pronunciation network for the word need, showing
                 the transition probabilities, and a sample observation sequence
Answer of question set-2
a) Write down the Viterbi Algorithm for finding optimal sequence of states
in Continuous Speech Recognition.
Ans: Viterbi Algoriyhm for finding optimal sequence of states in Continuous Speech
Recognition is written below:




           Fig(4) Viterbi algorithm for finding optimal sequence of states in continuous
           speech recognition, simplified by using phones as inputs.


b) What are the main limitations of Viterbi Decoder?
Ans: There are two main limitations of the Viterbi decoder:-
         1. First, the Viterbi decoder does not actually compute the sequence of words
which is most probable given the input acoustics. Instead, it computes an approximation
to this: the sequence of states (i.e. phones or subphones) which is most probable given
the input.

       2. A second problem with the Viterbi decoder is that it cannot be used with all
possible language models. In fact, the Viterbi algorithm as we have defined it cannot
take complete advantage of any language model more complex than a bigram grammar.
c) What do you mean by word lattice ? Give proper example
Ans: Word lattice:A word lattice is a directed graph of words and links between them
which can compactly encode a large number of possible sentences. Each word in the
lattice is augmented with its observation likelihood, so that any particular path through
the lattice can then be combined with the prior probability derived from a more
sophisticated language model.




                 Fig(5): A visual representation of the implicit lattice of allowable
                 word sequences which defines a language. The set of sentences of
                 a language is far too large to represent explicitly, but the lattice
                 gives a metaphor for exploring substrings of these sentences.


d) What do you mean by “Tree Structure Lexicon”? Give proper example
Ans: Tree structure lexicon is a lexicon which stores the pronunciations of all the words
in such a way that the computation of the forward probability can be shared for words
which start with the same sequence of phones.
      Fig(6) shows an example of a tree-structured lexicon from the Sphinx-II recognizer
(Ravis- hankar, 1996). Each tree root represents the first phone of all words beginning
with that context dependent phone (phone context may or may not be preserved across
word boundaries), and each leaf is associated with a word.
                            Fig( 6): A tree-structured lexicon


Answer of question set-3
a) What are the important characteristics of a sound wave? What are the
perceptual properties of a sound wave?
Ans: Two important characteristics of a wave are its:-
         1. Frequency: The frequency is the number of times a second that a wave repeats
itself, or cycles.
         2. Amplitude: Amplitude measures the air pressure variation . A high value on
the vertical axis (a high amplitude) indicates that there is more air pressure at that point in
time, a zero value means there is normal (atmospheric) air pressure, while a negative
value means there is lower than normal air pressure (rarefaction).


     Two imported perceptual properties of a wave are as follows:-
       1. Pitch:-The pitch of a sound is the perceptual correlate of frequency; in general
if a sound has a higher-frequency we perceive it as having higher pitch, although the
relationship is not linear, since human hearing has differ- ent acuities for different
frequencies.
       2. Loudness:-Thefrequencies. Similarly, the loudness of a sound is the perceptual
correlate of the power, which is related to the square of the amplitude. So sounds with
higher amplitudes are perceived as louder, but again the relationship is not linear.


b) What is Spectrum? What method is used to compute it?
Ans: Spectrum:-The description of a signal using the frequency domain and containing
all of it’s components is called the frequency spectrum of that signal.
        It can be computed by Fourier transform.


c) Briefly describe feature extraction process of a speech signal.
Ans: Feature extraction process starts with the sound wave and ends with a feature
vector .It has two steps:-
i). Digitization:-An input sound ware is first digitized. This process of analog-to-digital
conversion has three steps:
  a. Sampling: A signal is sampled by measuring its amplitude at a particular time.
  b. Quantization: The process of representing a real-valued number as an integer is
called quantization.
  c. Coding:-In the coding process each integer value is represented by a binary value.
ii). Feature vector generation:-Once a waveform has been digitized, it is converted to
some set of spectral features. Features may be LPC (Linear Predicting Coding) features
or PLP (Perceptual Linear Predictive) features.




                                       100 101 1000
                                         111 100
d) Define Vector Quantization. Briefly describe Concatenative Synthesis.
Ans: Vector Quantization: One way to compute probabilities on feature vectors is to
first cluster them into discrete symbols that we can count; we can then compute the
probability of a given cluster just by counting the number of times it occurs in some
training set. This method is usually called vector quantization.
Concatenative Synthesis:
Ans: Concatenative synthesis is based on a database of speech that has been recorded by
a single speaker. This database is then segmented into a number of short units, which can
be phones, diphones, syllables, words or other units. The simplest sort of synthesizer
would have phone units and the database would have a single unit for each phone in the
phone inventory. By selecting units appropriately, we can generate a series of units which
match the phone sequence in the input. By using signal processing to smooth joins at the
unit edges, we can simply concatenate the waveforms for each of these units to form a
single synthetic speech waveform. The following figure shows the Concatenative
Synthesis Technique.




                     Fig.7: Concatenative Synthesis Technique

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:9/29/2011
language:English
pages:8
Description: 3 Set Question and answer.