Speech Technology A Practical Introduction Fall 2005 by yurtgc548


									   Unit Selection Synthesis in Indian
            (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

                           Kishore Prahallad
                         Email: kishore@iiit.ac.in

International Institute of Information Technology (IIIT) Hyderabad, India
     Language Technologies Institute, Carnegie Mellon University

                    Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Building an Unrestricted Voice
• Build Language Specific Knowledge
    – Define phone set
    – Define stress and syllabification rules
    – Define letter to sound rules
•   Optimal text collection
•   Recording of speech
•   Speech Labeling
•   Unit clustering
•   This session will be a live demo of running
    Festvox scripts to build Hindi voice
                  Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
      Creation of Unit Speech
• Text selection:
  – Large corpus might be costly to record and
    hand label
• Optimal Text selection approaches
  – Use large text corpus
  – Extract a set of sentences which has best unit
    (phone/diphone/triphone/syllable) coverage

              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Recording of speech data
• Ideal conditions
  – Anechoic chamber
  – Studio recording
  – Professional speaker
• Practical conditions
  – Lab environments
  – Good voices
  – Need repetition of steps to create a good unit
    selection voice

              Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
     Labeling of Speech Data
• Automatic Labeling
  – Use Dynamic Wraping techniques, if duration models
    are available
  – Use HMMs / Neural Nets for automatic segmentation
    of the data
• Semi-Automatic Labeling
  – Machine Labeling + Hand Correction
  – Tools such as Emulabel (www.festvox.org/emu) are
  – Wavesurfer

               Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
  Building Databases (Training
• Get the phonemic features for each unit along
  with previous & next unit information
  –   Previous, Next Unit
  –   C/Vowel
  –   Vowel Length
  –   Vowel Height
  –   Vowel Frontness
  –   Vowel Height
  –   Consonant voicing
  –   Consonant POA
  –   MOA
  –   Position in the syllable & Word
                  Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
  Clustering the Units (Training
• For each unit create a decision tree
• Select a feature as a root of the tree, such that it
  minimizes the acoustic distances among its child
   – Acoustic distance between two sound units of varying
   – Use simple linear alignment, or Dynamic
     Programming for acoustic distance (ADM) measure
• Repeat the process with each child node until
  you have 10-30 units left in that cluster
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Indexing / Clustering using
      Decision Trees

      Linguistic / Contextual

     Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Synthesis (Testing Phase)
• Given the sequence of phones
• For each phone, create a set of phonemic
  features (Feature set is same as that of
  training Phase)
• Traverse through the tree and arrive at the
  child node
• Child node contain a set of target units

             Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Synthesis (Testing Phase)
• Give dh, ax and c, ae, t …., a sequence of
  phones to be synthesized
• Using decision trees: For the given
  sequence arrive at T_1, T_2 and T_3,
  where T_i is the set of target units for
  phone i.
• Use Viterbi alignment for choosing a
  sequence of units which minimize the
  concatenation cost
             Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Target + Join Cost

                                                  CSTR, UK

  Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
         Smoothing or Joining
• Where to join the two units
  – Optimal Coupling – Flexible joining point
  – Select the joining point, which has minimal distance
  – Select the last N frames of U(i-1) unit and first K
    frames of U(i) unit and perform N*K distance
  – Find out the set of frames which has the least
• What is the measure of joining?
  – F0, Power
  – Cepstral Features
                                                             diph        unit
                Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
  Building an Indian language
   _clunits iiit hin pra
Incorporate the language knowledge
1. festvox/*.phoneset.scm
2. festvox/*.durdata.scm
3. festvox/*.lexicon.scm

            Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
 Scripts of Indian Languages
 Basic units of writing system are characters
 Characters are close to syllable: CV, CVC, CCV, VC,
  C, V units (C is consonant, V is vowel)

    क      ख         ग            घ            ङ
   /ka/   /kha/    /ga/       /gha/         /ng-a/

   C V
 Universal phone set – About 35 consonants, 18 vowels
 Almost one to one correspondence between what you
  write and you speak
                  Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
    Issues: Relevant to Indic
 Input text: ISCII, UNICODE, and other font
 Occurrence of English words in Indic scripts
   - phonetic coverage, LTS rules etc.
 Text normalization: non-standard words
 Phonetic nature?
   - schwa deletion in Hindi and Bengali
 Syllabification rules
 Stress information                                                    15
               Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
      Syllable as unit size for Indian
              language TTS
 Various suggestions: Phones, Diphones, Half phones,
 Syllable like units

What we have done:
 Build different synthesizers for different size of units and
 compare the alternatives
Found syllable to be a better unit for synthesis in
 Indian languages
   Coverage of syllable for unrestricted TTS is a major
    issue of concern
Visit demo on http://speech.iiit.ac.in  Demo
                  Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
•   http://festvox.org
•   11-752 CMU course slides
     –   http://festvox.org/festtut/
•   11-752 CMU Course Lecture Notes
     –   http://festvox.org/festtut/notes/festtut_toc.html
•   Building Synthetic Voices
     –   http://www.festvox.org/bsv/
•   The Festival Speech Synthesis System
     –   http://www.festvox.org/docs/manual-1.4.3/festival_toc.html
•   http://www.cstr.ed.ac.uk/emasters/summer_school_2005/tutorial3/session2.pdf
•   S. P. Kishore, Alan W Black, Rohit Kumar and Rajeev Sangal, "Experiments with Unit Selection
    Speech Databases for Indian Languages", in Proceedings of National Seminar on Language
    Technology Tools: Implementations of Telugu, Hyderabad, India, 2003.
•   S. P. Kishore and Alan W Black,"Unit Size in Unit Selection Speech Synthesis", in Proceedings of
    Eurospeech, Geneva, Switzerland, 2003.
•   E. Veera Raghavendra, Srinivas Desai, B Yegnanarayana, Alan W Black, Kishore Prahallad
    "Global Syllable Set for Building Speech Synthesis in Indian Languages", in Proceedings of IEEE
    workshop on Spoken Language Technologies, Goa, India, December 2008.
•   6.     E. Veera Raghavendra, B Yegnanarayana, Kishore Prahallad "Speech Synthesis Using
    Approximate Matching of Syllables", in Proceedings of IEEE workshop on Spoken Language
    Technologies, Goa, India, December 2008.

                                   Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad

To top