Context in Multilingual Tone and Pitch Accent Recognition by m7G9AJ

VIEWS: 5 PAGES: 42

									Unsupervised and Semi-Supervised
            Learning
    of Tone and Pitch Accent
        Gina-Anne Levow
       University of Chicago
           June 6, 2006
                    Roadmap
• Challenges for Tone and Pitch Accent
  – Variation and Learning
• Data collections & processing
• Learning with less
  – Semi-supervised learning
  – Unsupervised clustering
     • Approaches, structure, and context
• Conclusion
Challenges: Tone and Variation
• Tone and Pitch Accent Recognition
  – Key component of language understanding
     • Lexical tone carries word meaning
     • Pitch accent carries semantic, pragmatic, discourse meaning

  – Non-canonical form (Shen 90, Shih 00, Xu 01)
     • Tonal coarticulation modifies surface realization
         – In extreme cases, fall becomes rise

  – Tone is relative
     • To speaker range
         – High for male may be low for female
     • To phrase range, other tones
         – E.g. downstep
  Challenges: Training Demands
• Tone and pitch accent recognition
  – Exploit data intensive machine learning
     • SVMs (Thubthong 01,Levow 05, SLX05)
     • Boosted and Bagged Decision trees (X. Sun, 02)
     • HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson
       et al, 04,…)
  – Can achieve good results with large sample sets
     • ~10K lab syllabic samples -> > 90% accuracy
  – Training data expensive to acquire
     • Time – pitch accent 10s of time real-time
     • Money – requires skilled labelers
     • Limits investigation across domains, styles, etc
  – Human language acquisition doesn’t use labels
            Strategy: Training
• Challenge:
  – Can we use the underlying acoustic structure of the
    language – through unlabeled examples – to reduce
    the need for expensive labeled training data?

• Exploit semi-supervised and unsupervised
  learning
  – Semi-supervised Laplacian SVM
  – K-means and asymmetric k-lines clustering
  – Substantially outperform baselines
     • Can approach supervised levels
    Data Collections I: English
• English: (Ostendorf et al, 95)
  – Boston University Radio News Corpus, f2b
  – Manually ToBI annotated, aligned, syllabified
  – Pitch accent aligned to syllables
     • 4-way: Unaccented, High, Downstepped High, Low
        – (Sun 02, Ross & Ostendorf 95)
     • Binary: Unaccented vs Accented
  Data Collections II: Mandarin
• Mandarin:
  – Lexical tones:
     • High, Mid-rising, Low, High falling, Neutral
  Data Collections III: Mandarin
• Mandarin Chinese:
  – Lab speech data: (Xu, 1999)
     • 5 syllable utterances: vary tone, focus position
         – In-focus, pre-focus, post-focus
  – TDT2 Voice of America Mandarin Broadcast News
     • Automatically force aligned to anchor scripts
         – Automatically segmented, pinyin pronunciation lexicon
         – Manually constructed pinyin-ARPABET mapping
         – CU Sonic – language porting


  – 4-way: High, Mid-rising, Low, High falling
        Local Feature Extraction
• Motivated by Pitch Target Approximation Model
       • Tone/pitch accent target exponentially approached
           – Linear target: height, slope (Xu et al, 99)

• Scalar features:
  –   Pitch, Intensity max, mean (Praat, speaker normalized)
  –   Pitch at 5 points across voiced region
  –   Duration
  –   Initial, final in phrase
• Slope:
  – Linear fit to last half of pitch contour
             Context Features
• Local context:
  – Extended features
     • Pitch max, mean, adjacent points of adjacent syllable
  – Difference features wrt adjacent syllable
     • Difference between
         – Pitch max, mean, mid, slope
         – Intensity max, mean

• Phrasal context:
  – Compute collection average phrase slope
  – Compute scalar pitch values, adjusted for slope
    Experimental Configuration
• English Pitch Accent:
  – Proportionally sampled: 1000 examples
     • 4-way and binary classification
         – Contextualization representation, preceding syllables


• Mandarin Tone:
  – Balanced tone sets: 400 examples
     • Vary data set difficulty: clean lab -> broadcast
     • 4 tone classification
         – Simple local pitch only features
             » Prior lab speech experiments effective with local features
      Semi-supervised Learning
• Approach:
   – Employ small amount of labeled data
   – Exploit information from additional – presumably more
     available –unlabeled data
      • Few prior examples: EM, co-& self-training: Ostendorf ‘05
• Classifier:
   – Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)
   – Semi-supervised variant of SVM
      • Exploits unlabeled examples
          – RBF kernel, typically 6 nearest neighbors
                      Experiments
• Pitch accent recognition:
   – Binary classification: Unaccented/Accented
   – 1000 instances, proportionally sampled
       • Labeled training: 200 unacc, 100 acc
   – >80% accuracy (cf. 84% w/15x labeled SVM)

• Mandarin tone recognition:
   – 4-way classification: n(n-1)/2 binary classifiers
   – 400 instances: balanced; 160 labeled
       • Clean lab speech- in-focus-94%
           – cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples
       • Broadcast news: 70%
           – Cf. <50% w/supervised SVM 160 training samples; 74% 4x training
       Unsupervised Learning
• Question:
  – Can we identify the tone structure of a language from
    the acoustic space without training?
     • Analogous to language acquisition
• Significant recent research in unsupervised
  clustering
     • Established approaches: k-means
     • Spectral clustering: Eigenvector decomposition of affinity matrix
         – (Shih & Malik 2000, Fischer & Poland 2004, BNS 2004)
  – Little research for tone
     • Self-organizing maps (Gauthier et al,2005)
         – Tones identified in lab speech using f0 velocities
     Unsupervised Pitch Accent
• Pitch accent clustering:
  – 4 way distinction: 1000 samples, proportional
     • 2-16 clusters constructed
        – Assign most frequent class label to each cluster
     • Learner:
        – Asymmetric k-lines clustering (Fischer & Poland ’05):
           » Context-dependent kernel radii, non-spherical clusters
  – > 78% accuracy
  – Context effects:
     • Vector w/context vs vector with no context comparable
        Contrasting Clustering
• Approaches
  – 3 Spectral approaches:
     • Asymmetric k-lines (Fischer & Poland 2004)
     • Symmetric k-lines (Fischer & Poland 2004)
     • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)
         – Binary weights, k-lines clustering
  – K-means: Standard Euclidean distance
  – # of clusters: 2-16
• Best results: > 78%
  – 2 clusters: asymmetric k-lines; > 2 clusters: kmeans
     • Larger # of clusters more similar
Contrasting Learners
               Tone Clustering
• Mandarin four tones:
     • 400 samples: balanced
     • 2-phase clustering: 2-3 clusters each
     • Asymmetric k-lines
  – Clean read speech:
     • In-focus syllables: 87% (cf. 99% supervised)
     • In-focus and pre-focus: 77% (cf. 93% supervised)
  – Broadcast news: 57% (cf. 74% supervised)
• Contrast:
  – K-means: In-focus syllables: 74.75%
     • Requires more clusters to reach asymm. k-lines level
                  Tone Structure




First phase of clustering splits high/rising from low/falling by slope
Second phase by pitch height, or slope
                  Conclusions
• Exploiting unlabeled examples for tone
  and pitch accent
  – Semi- and Un-supervised approaches
     • Best cases approach supervised levels with less
       training
        – Leveraging both labeled & unlabeled examples best
        – Both spectral approaches and k-means effective
            » Contextual information less well-exploited than in
              supervised case
     • Exploit acoustic structure of tone and accent space
              Future Work
• Additional languages, tone inventories
  – Cantonese - 6 tones,
  – Bantu family languages – truly rare data


• Language acquisition
  – Use of child directed speech as input
  – Determination of number of clusters
                       Thanks
• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J.
  Poland; T. Joachims; C-C. Cheng & C. Lin
• Dinoj Surendran, Siwei Wang, Yi Xu

• This work supported by NSF Grant #0414919

• http://people.cs.uchicago.edu/~levow/tai
Spectral Clustering in a Nutshell
• Basic spectral clustering
  – Build affinity matrix
  – Determine dominant eigenvectors and
    eigenvalues of the affinity matrix
  – Compute clustering based on them
• Approaches differ in:
  – Affinity matrix construction
     • Binary weights, conductivity, heat weights
  – Clustering: cut, k-means, k-lines
   K-Lines Clustering Algorithm
• Due to Fischer & Poland 2005
• 1. Initialize vectors m1...mK (e.g. randomly, or
  as the ¯first K eigenvectors of the spectraldata
  yi)
• 2. for j=1 . . .K:
  – Define Pj as the set of indices of all points yi that are
    closest to the line defined by mj , and create the
    matrix Mj = [yi], i in Pi whose columns are the
    corresponding vectors yi
• 3. Compute the new value of every mj as the
  ¯first eigenvector of MjMTj
• 4. Repeat from 2 until mj 's do not change
       Asymmetric Clustering
• Replace Gaussian kernel of fixed width
  – (Fischer & Poland TR-ISDIA-12-04, p. 12),
  – Where tau = 2d+ 1 or 10, largely insensitive to tau
              Laplacian SVM
• Manifold regularization framework
  – Hypothesize intrinsic (true) data lies on a low
    dimensional manifold,
     • Ambient (observed) data lies in a possibly high
       dimensional space
     • Preserves locality:
        – Points close in ambient space should be close in intrinsic
  – Use labeled and unlabeled data to warp
    function space
  – Run SVM on warped space
Laplacian SVM (Sindhwani)
• Input : l labeled and u unlabeled examples
• Output :
• Algorithm :
  –   Contruct adjacency Graph. Compute Laplacian.
  –   Choose Kernel K(x,y). Compute Gram matrix K.
  –   Compute
  –   And
      Current and Future Work
• Interactions of tone and intonation
  – Recognition of topic and turn boundaries
  – Effects of topic and turn cues on tone real’n
• Child-directed speech & tone learning
• Support for Computer-assisted tone learning
• Structured sequence models for tone
  – Sub-syllable segmentation & modeling
• Feature assessment
  – Band energy and intensity in tone recognition
               Related Work
• Tonal coarticulation:
  – Xu & Sun,02; Xu 97;Shih & Kochanski 00
• English pitch accent
  – X. Sun, 02; Hasegawa-Johnson et al, 04;
    Ross & Ostendorf 95
• Lexical tone recognition
  – SVM recognition of Thai tone: Thubthong 01
  – Context-dependent tone models
     • Wang & Seneff 00, Zhou et al 04
 Pitch Target Approximation Model
• Pitch target:
   – Linear model:
                       T (t )  at  b
   – Exponentially approximated:
                     y(t )   exp(t )  at  b
   – In practice, assume target well-approximated by
     mid-point (Sun, 02)
    Classification Experiments
• Classifier: Support Vector Machine
  – Linear kernel
  – Multiclass formulation
     • SVMlight (Joachims), LibSVM (Cheng & Lin 01)
  – 4:1 training / test splits
• Experiments: Effects of
  – Context position: preceding, following, none, both
  – Context encoding: Extended/Difference
  – Context type: local, phrasal
              Results: Local Context
Context          Mandarin Tone   English Pitch Accent
Full             74.5%           81.3%
Extend PrePost   74%             80.7%
Extend Pre       74%             79.9%
Extend Post      70.5%           76.7%
Diffs PrePost    75.5%           80.7%
Diffs Pre        76.5%           79.5%
Diffs Post       69%             77.3%
Both Pre         76.5%           79.7%
Both Post        71.5%           77.6%
No context       68.5%           75.9%
              Results: Local Context
Context          Mandarin Tone   English Pitch Accent
Full             74.5%           81.3%
Extend PrePost   74.0%           80.7%
Extend Pre       74.0%           79.9%
Extend Post      70.5%           76.7%
Diffs PrePost    75.5%           80.7%
Diffs Pre        76.5%           79.5%
Diffs Post       69.0%           77.3%
Both Pre         76.5%           79.7%
Both Post        71.5%           77.6%
No context       68.5%           75.9%
             Results: Local Context
Context          Mandarin Tone   English Pitch Accent
Full             74.5%           81.3%
Extend PrePost   74%             80.7%
Extend Pre       74%             79.9%
Extend Post      70.5%           76.7%
Diffs PrePost    75.5%           80.7%
Diffs Pre        76.5%           79.5%
Diffs Post       69%             77.3%
Both Pre         76.5%           79.7%
Both Post        71.5%           77.6%
No context       68.5%           75.9%
     Discussion: Local Context
• Any context information improves over none

  – Preceding context information consistently improves
    over none or following context information
     • English: Generally more context features are better
     • Mandarin: Following context can degrade
  – Little difference in encoding (Extend vs Diffs)

• Consistent with phonological analysis (Xu) that
  carryover coarticulation is greater than
  anticipatory
          Results & Discussion:
            Phrasal Context

 Phrase Context Mandarin Tone   English Pitch Accent
 Phrase         75.5%           81.3%
 No Phrase      72%             79.9%


•Phrase contour compensation enhances recognition
  •Simple strategy
  •Use of non-linear slope compensate may improve
            Context: Summary
• Employ common acoustic representation
  – Tone (Mandarin), pitch accent (English)
• SVM classifiers - linear kernel: 76%, 81%
• Local context effects:
  – Up to > 20% relative reduction in error
  – Preceding context greatest contribution
     • Carryover vs anticipatory
• Phrasal context effects:
  – Compensation for phrasal contour improves recognition
         Aside: More Tones
• Cantonese:
  – CUSENT corpus of read broadcast news text
  – Same feature extraction & representation
  – 6 tones:
       – High level, high rise, mid level, low fall, low rise, low level
  – SVM classification:
    • Linear kernel: 64%, Gaussian kernel: 68%
       – 3,6: 50% - mutually indistinguishable (50% pairwise)
           » Human levels: no context: 50%; context: 68%
    • Augment with syllable phone sequence
       – 86% accuracy: 90% of syllable w/tone 3 or 6: one
         dominates
 Aside: Voice Quality & Energy
     • By Dinoj Surendran
• Assess local voice quality and energy features
  for tone
  – Not typically associated with Mandarin
• Considered:
  – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;
    Band energy
• Useful: Band energy significantly improves
  – Esp. neutral tone
     • Supports identification of unstressed syllables
         – Spectral balance predicts stress in Dutch
                          Roadmap
• Challenges for Tone and Pitch Accent
   – Contextual effects
   – Training demands
• Modeling Context for Tone and Pitch Accent
   – Data collections & processing
   – Integrating context
   – Context in Recognition
• Reducing Training demands
   – Data collections & structure
   – Semi-supervised learning
   – Unsupervised clustering
• Conclusion
            Strategy: Context
• Exploit contextual information
  – Features from adjacent syllables
     • Height, shape: direct, relative


  – Compensate for phrase contour

  – Analyze impact of
     • Context position, context encoding, context type
     • > 20% relative improvement over no context

								
To top