Speech Technology A Practical Introduction Fall 2005

Document Sample
Speech Technology A Practical Introduction Fall 2005 Powered By Docstoc
					                  Overview on
            Text to Speech Systems
            (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)



                           Kishore Prahallad
                         Email: kishore@iiit.ac.in

International Institute of Information Technology (IIIT) Hyderabad, India
                                     &
     Language Technologies Institute, Carnegie Mellon University


                                                                             1
                    Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                        Topics
• Overview & Components of a Text to
  Speech System
• Text Normalization
• Linguistic Analysis
• Speech Generation
  – Formant Synthesis
  – Concatenative Synthesis
  – Statistical Parametric Synthesis
                                                                       2
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
A Text to Speech (TTS) system converts text
into spoken language

  Welcome to the world
  of text to speech
  systems…
                                 Text to Speech
                                     System
           Text                                                       Speech




                                                                                  3
                         Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
         Types of TTS Systems
• Limited domain
  – Voice built specifically for an application
     •   Limited set of words and sentences
     •   Weather forecasts
     •   Air/Rail Travel information systems
     •   Agriculture information systems etc..
• Unrestricted
  – A generic voice capable to reading anything!
     • News Reading
     • Story-telling
     • Desktop assistant etc
                                                                            4
                   Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    How to synthesize speech?
• Record a set of phones (say /a/,
  /aa/, /i/, /ii/, /k/, /kh/)
• Given a text, for each word obtain
  the sequence of phones to be
  concatenated
   – For example: amma  /a/ /m/ /m/ /a/
• Concatenate the *pre-recorded
  phones* to get the speech !!!!!!

• No!!!!!
                                                                         5
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
   What needs to be incorporated
              then?
• Coarticulation
  – Coupling effect, when two sounds are produced
    together
  – Production of /k/ and /a/ in isolation is different from
    producing /ka/
• Energy
  – Suitable energy contour
• Pitch                                                  Prosody
  – Pitch and its contour
  – (variation across the phones)
• Duration
  – How long each phone should be
                                                                          6
                 Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       • Document Structure Detection                  Handling numbers, symbols,
        Architecture of a TTS System
       • Conversion from Unicode and
       Fonts
                                                           abbreviations etc.



Text
                                      Tagged
              Pre-processing                            Text Normalization
                                      Text
                                      Word
                                      sequence
         Linguistic Analysis:                 Prosodic Prediction
         • Part of speech tagging    Phone    • Duration
         • Phrase breaks             sequence • F0 Contour
         • Letter to Sound Rules              • Energy




                                                      Waveform-Generation

                                                                       Speech
                                                                                    7
                        Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
          Why Preprocessing?
• Is the input to a TTS system a sequence of
  phones?
   – NO! NO! NO!


• The input is *text*
   – Raw text
   – Formatted text (MS Word, PDF/PS, MS PPT)
   – Tagged text (use XML like tags as markup for
     synthesis)
   – Encoded text
      • Multilingual text in Unicode, Fonts etc, etc…..
                                                                            8
                   Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       Preprocessing contd...
• Conversion from different formats
  (pdf/ps/doc) to a generic tagged format or
  raw text

• Handle Multilingual Text in Unicode
  – Unicode is similar to ASCII tables, but they
    can represent practically any language in the
    world

                                                                       9
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       Architecture of a TTS System
Text
                                    Tagged
            Pre-processing                            Text Normalization
                                    Text
                                    Word
                                    sequence
       Linguistic Analysis:                 Prosodic Prediction
       • Part of speech tagging    Phone    • Duration
       • Phrase breaks             sequence • F0 Contour
       • Letter to Sound Rules              • Energy




                                                    Waveform-Generation

                                                                     Speech
                                                                               10
                      Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
         Text Normalization
• An advanced TTS would be able to handle
  non-standard words

• Standard words are those whose entry
  can be found in pronunciation dictionary

• Pronunciation dictionary maps a word to a
  sequence of phones.
                                                                      11
             Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
  Abbreviations and Acronyms
• Title
  – Dr., MD, Mr., Mrs., St. (Saint), etc.
• Measure
  – ft., Hz, mm, cm, in, kg
• Place names
  – CO, LA, PA, USA, IN, St. (street), Dr. (drive)



                                                                        12
               Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                 Numbers
• Phone numbers
  – +91-40-23001412, (717)-809-8099
• Dates
  – mm/dd/yy, dd/mm/yy, July 4 05, 12-04-05
• Times
  – 13:00, 1:00 PM, 12:15:35
• Money
  – $20, 300 €
                                                                      13
             Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                      Numbers
• Account numbers
  – 13 digit, 9 digit numbers
• Ordinal numbers
  – 1st, 2nd, 1000th, ½, ¼, 1/100,
• Cardinal numbers - Amounts, statements
• 2426
  – two four two six
  – twenty four twenty six
  – two thousand four hundred and twenty six

                                                                         14
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       Architecture of a TTS System
Text
                                    Tagged
            Pre-processing                            Text Normalization
                                    Text
                                    Word
                                    sequence
       Linguistic Analysis:                 Prosodic Prediction
       • Part of speech tagging    Phone    • Duration
       • Phrase breaks             sequence • F0 Contour
       • Letter to Sound Rules              • Energy




                                                    Waveform-Generation

                                                                     Speech
                                                                               15
                      Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
               Linguistic Analysis
• Part of Speech (POS) Tagging
   – Proper noun/verb/adjective etc
   – A mapping table: word –> pos_tag
   – Statistically trained or manually prepared

• Prosodic Phrase breaks
   –   POS tags are useful to predict phrase breaks in a sentence
   –   Ex: man’triji ne kahaa ki aaj hamaare desh
   –        man’triji ne kahaa ki [pau] aaj hamaare desh
   –   “ki” is a preposition, and we give a short pause while speaking
   –   Task is to predict these phrase breaks, so that short pauses can
       be introduced during synthesis


                                                                              16
                     Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
        Letter to Sound Rules
• Given a word, output the sequence of phones
• How?
  – Pronunciation dictionary
  – Maps the spelling to a set of phones
  – ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/data/anonftp/proje
    ct/fgdata/dict/cmudict.0.4
  – An entry looks like
     • SPEECH    S P IY1 CH
     • [word]    [phones]



                                                                         17
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
   What if there is no pronunciation
              Dictionary?
• If no pronunciation dict. then – Use a set of simple rules

• Example: Indian languages

    – A direct correspondence between what is written and what is spoken
    – Hindi Word: namaskaara  /n/ /a/ /m/ /a/ /s/ /k/ /aa/ /r/ $
    – Note: last /a/ -> $ (null)

    – /a/ is a short vowel often referred to as schwa
    – Process of mapping /a/ -> $ is known as schwa deletion
    – Schwa deletion can be captured using a set of simple rules
        • Ex: when /a/ occurs at the end of word map it to $

• Letter to Sound rules can be learnt using statistical models too!!
    – CART, HMM, Neural Networks

                                                                                18
                       Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       Architecture of a TTS System
Text
                                    Tagged
            Pre-processing                            Text Normalization
                                    Text
                                    Word
                                    sequence
       Linguistic Analysis:                 Prosodic Prediction
       • Part of speech tagging    Phone    • Duration
       • Phrase breaks             sequence • F0 Contour
       • Letter to Sound Rules              • Energy




                                                    Waveform-Generation

                                                                     Speech
                                                                               19
                      Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
      Wave-Form Generation
• Formant Synthesis
• Concatenative Synthesis
  – Diphone synthesis
  – Unit selection synthesis
• Statistical Parametric Synthesis




                                                                       20
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                   Formant Synthesis
  • Each phone is produced by specifying the
    Formants and pitch
  • A set of rules are also specified to modify pitch
    and formants, so that transition from one phone
    to another phone is sufficiently smooth

                                                        Knowledge-base
                                                         (manually built)

                                                  Formats              Rules to generate
                                                  Pitch                the transitions
                                                                       (co-articulation)
Text   • Preprocessing                                                             Speech
       • Text Normalization
       • Linguistic analysis                         Formant Synthesizer
                                     Phones
                                                                                        21
                         Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
     Formant Synthesis contd..
• Formant Synthesizers were deployed in
  commercial market in late 70’s and early 80’s
   – DECTalk
• Pros
   – Flexible to able to change parameters
   – Generate intelligible speech with less number of
     parameters
• Cons
   – Synthesized speech is not natural
   – Knowledge base has to be built manually (not an
     easy task)
   – A new language needs brand new effort


                                                                            22
                   Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
 Concatenative Speech Synthesis
• Don’t build a knowledge-base – instead record a
  speech database and *select* the required
  phone
  – Needs a speech database and disk space to store
  – Needs CPU time to select the segment
  – Practical from the engineering perspective

• Then why people had built Formant synthesizers
  – Motivation from the speech science
  – Disk space and CPU time was much costlier in 70 &
    80’s
                                                                        23
               Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
               A Typical Architecture of
               Concatenative Synthesis
            A unit can be a phone OR
                 a set of phones.

               If the set of phones,
             corresponds to a word,                       A recorded
             then the unit is a word.                   Speech database




Text   • Preprocessing                                                             Speech
       • Text Normalization                              Unit Selection
       • Linguistic analysis                               Algorithm
                                     Phones

                                                                                            24
                          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                   Choice of Unit
• Word as a unit
   – A large number of units to store
   – Difficult to ensure coverage of all possible words (proper nouns
     etc).
   – Useful for limited domain

• Phone as a unit
   – No coarticulation present!

• Diphone as a unit
   – Preserves the transition region between two phones and thus
     coarticulation is present
   – Widely used unit for concatenation

                                                                            25
                   Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Phone 1             Phone 2
                                                             What is
                                                             diphone




                                                                   26
          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Phone 1             Phone 2
                                                             What is
                                                             diphone
                       • A diphone starts at the
                       middle of the first phone
                       and ends at the middle
                       of the second phone

                       • Preserves the transient
                       region between two
                       phones




                                                                   27
          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
  How to build a Diphone Voice
• Record all possible phone-phone combinations
  in a language
  – Example, record ka, ku, ki, kii, …kk, ks, kj..
  – Some combinations may not occur!!

• From each of the phone-phone recording,
  manually label the diphone boundaries
  – Tools such as Emulabel display the waveform and
    allows you to label the boundaries

• Pool all the diphones to form a diphone
  database
                                                                         28
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
       A Typical Architecture of Diphone
                   Synthesis


                                                                  Diphone units
                                Prosodic Rules
                                                                (speech database)




Text   • Preprocessing                                                             Speech
       • Text Normalization                           Concatenation &
       • Linguistic analysis                       Prosodic Manipulation
                                     Phones

                                                                                            29
                          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Pros & Cons of Diphone Synthesis
• Advantages over formant synthesis
  – Easy to adapt for a new language
  – Make use of recorded speech as apposed to
    modeling the formants and their transitions
• Cons
  – Needs explicit modeling of prosody
  – Output: Intelligible, but not natural


                                                                       30
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Diphone to Unit selection synthesis
• Formant to diphone
  – Avoids the building of a knowledge-base
  – Make use of recorded speech

• Diphone to unit selection
  – Avoids prosodic modeling
  – Speech database consists of multiple examples of
    each diphone
     • Record a diphone several times but in different contexts
     • Store diphone units with varying prosody
     • Don’t model the prosody, BUT *select* a diphone with
       suitable prosody


                                                                          31
                 Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
         A Typical Architecture of Unit-
             Selection Synthesis


                                                      n - Diphone units
                                                       (large speech
                                                          database)




Text   • Preprocessing                                                             Speech
       • Text Normalization
       • Linguistic analysis                             Unit Selection
                                     Phones

                                                                                            32
                          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
 Building a Unit Selection Voice
• Take a news paper text, say about 2000
  sentences
     • A more careful approach is to make sure that these 2000
       sentences have a good coverage of all possible diphones


• Speak the sentences one by one thus create 1-2
  hours of speech

• Recording should be done in a quiet
  environment
  – Speech can be recorded using your desktop
                                                                          33
                 Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Build Process: A higher level view
•   Goal – Automatically *extract* the diphones from the speech database and
    *index* them

•   Automatic Extraction
     – Label the phone boundaries in each of the spoken sentence. This task is
       referred to as speech segmentation.
     – Speech segmentation is performed by HMMs (Neural Networks could also be
       used)
     – Given the phone boundaries, approximate the diphone boundaries thus
       diphone-like units are obtained

•   Indexing
     – For each diphone-like unit, store context information
     – Context information: Previous phone, next phone, position in the syllable,
       position in the word etc…..
     – There could be thousands of units for each type
     – For each unit-type build a decision trees to split the thousands of units into
       several sub-clusters


                                                                                        34
                          Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
          During Synthesis..
• Given a sequence of phones
  – For each phone, traverse through the
    corresponding decision tree and arrive at a
    set of target units
• Select a unit based on how well it matches
  with the input specification and how well it
  matches with the other units in the
  sequence

                                                                       35
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Pros and Cons of Unit Selection
• For best examples, quality is high!
  – Quality varies from high to often bad due to
    bad selection of units (or missing units)
• Strongly resemble the style of speech
  being recorded
  – Hard to modify the characteristics for varying
    style, emotion etc. etc.



                                                                       36
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
   Statistical Parametric Synthesis
                 (SPS)
• Speech synthesized from parameters
• Parametric models are trained from speech data
   – Vs. Older non-statistical techniques such as DEC-talk
     had parameters constructed from hand
• In Blizzard 2006-08 challenges, SPS technique
  based quality is rated higher by native listeners
   – Consistency in the quality of the voice
   – Reached a matured level where the quality is quite
     acceptable
• Hidden Markov Model based (HTS), Decision
  Tree Based (CLUSTERGEN) etc.
                                                                         37
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
          Basic Technique
• Speech Parameter generation from HMM
  with the use of dynamic (delta) features
• Speech synthesis from Mel-cepstrum
  – A vocoding technique based on Mel-cepstrum
  – F0 used for excitation generation
• F0 pattern modeling using HMMs



                                                                      38
             Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
        HMM Based Speech Synthesis




Ref:
http://hts.ics.nitech.ac.jp/




                                                                                        39
                               Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Comparison of diphone, unit and
         HTS voices



     Diph, unit, hts




                                                                 40
        Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
         Open Source Tools
• Festival – multi-lingual speech synthesis
  engine
  – http://festvox.org/festival/index.html
• Festvox – A set of tools to create a new
  voice in a new language
  – www.festvox.org




                                                                       41
              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
              Further Directions
• Reading Style
  – Commonly used mode for many applications
• Emphasis
  – Story-Telling
• Emotional
  – Neutral, Sad, Happy and Anger Moods
• Stylistic
  – Specific to a speaker
                                                                         42
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
                              References
•   http://festvox.org
•   11-752 CMU course slides
     – http://festvox.org/festtut/
•   11-752 CMU Course Lecture Notes
     – http://festvox.org/festtut/notes/festtut_toc.html
•   Building Synthetic Voices
     – http://www.festvox.org/bsv/
•   The Festival Speech Synthesis System
     – http://www.festvox.org/docs/manual-1.4.3/festival_toc.html
•   Black, A. (2006), CLUSTERGEN: A Statistical Parametric Synthesizer using
    Trajectory Modeling, Interspeech 2006 - ICSLP, Pittsburgh, PA.
•   Black, A., Zen, H., and Tokuda, K, (2007) Statistical Parametric Synthesis,
    ICASSP 2007, Hawaii.
•   K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and Kitamura T.,
    “Speech parameter generation algorithms for HMM-base speech synthesis,”
    in ICASSP2000, Istanbul, Turkey, 2000


                                                                                    43
                           Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:84
posted:7/23/2011
language:English
pages:43