Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

HEAD MOTION ESTIMATION USING PROSODIC FEATURES

VIEWS: 12 PAGES: 84

									          Multimodal Analysis of Expressive
              Human Communication:
            Speech and gesture interplay

                           Ph.D. Dissertation Proposal
                           Carlos Busso



                           Adviser: Dr. Shrikanth S. Narayanan


Nov 22nd, 2006
                       Outline

       Introduction
       Analysis
       Recognition
       Synthesis
       Conclusions




Nov 22nd, 2006
                                          Introduction
                                          Motivation
      • Gestures and speech are intricately coordinated to express messages
      • Affective and articulatory goals jointly modulate these channels in a non-
        trivial manner
      • A joint analysis of these modalities is needed to better understand
        expressive human communication
      • Goals:
            • Understand how to model the spatial-temporal modulation of these
              communicative goals in gestures and speech
            • Use these models to improve human-machine interfaces
                 • Computer could give specific and appropriate help to users
                 • Realistic facial animation could be improved by learning human-like gestures


                 This proposal focuses on the analysis, recognition and synthesis of
                 expressive human communication under a multimodal framework

Nov 22nd, 2006                                   01/40                                            Introduction
                                     Introduction
                                    Open challenges
      • How to model the spatio-temporal emotional modulation
            • If audio-visual models do not consider how the coupling between gestures
              and speech changes in presence of emotion, they will not accurately
              reflect the manner in which human communicate
      • Which interdependencies between the various communicative channels
        appear in conveying verbal and non-verbal messages?
            • Interplay between communicative, affective and social goals
      • How to infer meta-information from speakers (emotion, engagement)?
      • How gestures are used to respond to the feedback given by the
        listener?
      • How the verbal and non-verbal messages conveyed by one speaker are
        perceived by others?
      • How to use models to design and enhance applications that will help
        and engage the users?


Nov 22nd, 2006                             01/40                                  Introduction
                         Introduction

     Proposed Approach




Nov 22nd, 2006              01/40       Introduction
                                              Analysis

       Introduction
       Analysis
            • Facial Gesture/speech Interrelation
            • Affective/Linguistic Interplay
       Recognition
       Synthesis
       Conclusions



         C. Busso and S.S. Narayanan. Interrelation between Speech and Facial Gestures in Emotional
         Utterances. Under submission to IEEE Transactions on Audio, Speech and Language Processing.

Nov 22nd, 2006                                      01/40                                              Analysis
                 Facial gestures/speech interrelation
                                  Motivation
      • Gestures and speech interact and cooperate to convey a desired
        message [McNeill,1992], [Vatikiotis,1996], [Cassell,1994]
      • Notable among communicative components are the linguistic,
        emotional and idiosyncratic aspects of human communication




      • Both gestures and speech are affected by these modulations
      • It is important to understand the interrelation between facial gestures
        and speech in terms of all these aspects of human communication


Nov 22nd, 2006                          01/40                                     Analysis
                  Facial gestures/speech interrelation
                                                    Goals
      • To focus on the linguistic and emotional aspects of human communication
            • To investigate the relation between certain gestures and acoustic features
      • To propose recommendations for synthesis and recognition applications

                                            Related work[s]
      • Relationship between gestures and speech as conversational functions
           [Ekman,1979], [Cassell,1999], [Valbonesi,2002], [Graf,2002], [Granstrom,2005]

      • Relationship between gestures and speech as results of articulation
           [Vatikiotis,1996], [Yeshia,1998], [Jiang,2002], [Barker,1999]

      • Relationship between gestures and speech influenced by emotions
           [Nordstrand,2003], [Caldognetto,2003], [Bevacqua,2004] [Lee, 2005]




Nov 22nd, 2006                                                01/40                        Analysis
                  Facial gestures/speech interrelation
                 Proposed Framework: Data-driven approach
      • Pearson‟s correlation is used to quantify relationship between speech
        and facial features
      •    Affine Minimum Mean-Square Error is used to estimate the facial gestures from
           speech




      • Sentence-level mapping
      • Global-level mapping

Nov 22nd, 2006                                01/40                                        Analysis
                 Facial gestures/speech interrelation
                             Audio-Visual Database
      • Four emotions are targeted
            •    Sadness
            •    Angry
            •    Happiness
            •    Neutral state
      • 102 Markers to track facial
        expressions
      • Single subject
      • Phoneme balanced corpus
        (258 sentences)
      • Facial motion and speech are
        simultaneously captured

Nov 22nd, 2006                         01/40            Analysis
                 Facial gestures/speech interrelation
                           Facial and acoustic features
      • Speech
            • Prosodic features (source of the speech): Pitch, energy and
              they first and second derivatives
            • MFCC coefficients (vocal tract)
      • Facial features
            •    Head motion
            •    Eyebrow
            •    Lips
            •    Each marker grouped in
                  • Upper, middle and lower
                  face regions


Nov 22nd, 2006                                01/40                         Analysis
                 Facial gestures/speech interrelation
                  Correlation results : Sentence-level
• High levels of correlation

• Correlation levels are higher
  when MFCC features are used



                                     Prosodic
• Clear emotional effects
      • Correlation levels are
                                                Neutral   Sad   Happy   Angry
        equal or greater than
        neutral case
                                     MFCCs




      • Happiness and anger are
        similar                                 Neutral   Sad   Happy   Angry


Nov 22nd, 2006                    01/40                                 Analysis
                       Facial gestures/speech interrelation
                         Correlation results : Global-level
                                                  • Correlation levels decreases
                                                    compared sentence-level mapping
                                                      • Link between facial gestures and
                                                        speech varies from sentence to
                                                        sentence
                                                  • Correlation levels are higher when
Prosodic




                                                    MFCC features are used

                                                  • The lower face region presents the
           Neutral     Sad    Happy   Angry
                                                    highest correlation
MFCCs




                                                  • Clear emotional effects
                                                      • Correlation levels for neutral speech
           Neutral     Sad    Happy   Angry             are higher than emotional category

      Nov 22nd, 2006                          01/40                                   Analysis
                 Facial gestures/speech interrelation
                             Mapping parameter
      • Goal: study structure of mapping parameters



      • Approach: Principal Component analysis (PCA)



            • For each facial feature, find P such that it cover 90% of the
              variance
            • Emotional-dependent vs. emotional independent analysis


Nov 22nd, 2006                         01/40                                  Analysis
                 Facial gestures/speech interrelation
                             Mapping parameter’ Results




          Fraction of eigenvectors used to span 90% or more of the variance of the
      •   parameter T
          Parameters [are] cluster in small subspace
      •    Prosodic-based parameters [are] cluster in a smaller subspace than MFCC-
           based parameters
      •    Further evidences of emotional-dependent influence in the relationship
           between facial gestures and speech




Nov 22nd, 2006                              01/40                                     Analysis
                     Facial gestures/speech interrelation
                            Mapping parameter’ Results
            Prosodic        MFCCs
                                       • Correlation levels as function of P
Upper




                                       • Slope in prosodic-based features is
                                         lower than in MFCCs
                                            • Smaller dimension of the
 Middle




                                              cluster
                                       • Slope depends on the facial region
                                            • Different levels of coupling
 Lower




    Nov 22nd, 2006                  01/40                             Analysis
                    Affective/Linguistic Interplay

       Introduction
       Analysis
            • Facial Gesture/speech Interrelation
            • Affective/Linguistic Interplay
       Recognition
       Synthesis
       Conclusions



        C. Busso and S.S. Narayanan. Interplay between linguistic and affective goals in facial expression
        during emotional utterances. To appear International seminar on Speech Production (ISSP 2006)

Nov 22nd, 2006                                        01/40                                                  Analysis
                     Linguistic/affective interplay
                                 Motivation
      • Linguistic and emotional goals jointly modulate speech and gestures to
        convey the desired messages
      • Articulatory and affective goals co-occur during normal human
        interaction, sharing the same channels
      • Some control needs to buffer, prioritize and execute these communicative
        goals in coherent manner

                                 Hypotheses
      • Linguistic and affective goals interplay interchangeably as primary and
        secondary controls
      • During speech, affective goals are displayed under articulatory constraints
            • Some facial areas have more degree of freedom to display non-verbal clues


Nov 22nd, 2006                             01/40                                    Analysis
                           Linguistic/affective interplay
                                              Previous results
      • Low vowels (/a/) with less restrictive tongue position
        observed greater emotional coloring then high vowels (/i/)
           [Yildirim, 2004] [Lee,2005] [Lee, 2004]

      • Focus of this analysis is on the interplay in facial expressions

                                                     Approach
      • Compare facial expressions of neutral and emotional
        utterances with same semantic content
            • Correlation
            • Euclidean Distance
      • The database is a subset of the MOCAP data

Nov 22nd, 2006                                          01/40       Analysis
                    Linguistic/affective interplay
                              Facial activation analysis
    • Measure of facial motion


                                                   Neutral   Sad
    • Lower face area has the highest
      activeness levels
          • Articulatory processes play a
            crucial role                           Happy     Angry

    • Emotional modulation
          • Happy and angry more active
          • Sadness less active than neutral
          • Activeness in upper face region
            increases more then other regions
Nov 22nd, 2006                          01/40                        Analysis
                 Linguistic/affective interplay
                    Neutral vs. emotional analysis
      • Goal: Compare in detail[s] the facial expression displayed during
        neutral and emotional utterances with similar semantic content
      • Dynamic Time Warping (DTW) is used to align the utterances




Nov 22nd, 2006                      01/40                             Analysis
                         Linguistic/affective interplay
                 Correlation analysis : neutral vs. emotional
                                                          • Higher correlation implies higher
                                                            articulatory constraints

                                                          • Lower facial region has the
 Neutral-Sad        Neutral-Happy Neutral-Angry             highest correlation levels
                                                             • More constrained

                                                          • Upper facial region has the lower
                                                            correlation levels
                                                             • Can communicate non-verbal
                                                               information regardless of the
 (median results)                                              linguistic content

Nov 22nd, 2006                                    01/40                               Analysis
                 Linguistic/affective interplay
           Euclidean distance analysis : neutral vs. emotional
 • After scaling the facial features, the
   Euclidean distance was estimated


                                            Neutral-Sad   Neutral-Happy Neutral-Angry
 • High values indicate that facial
   features are more independent of the
   articulation.

 • Similar results than in correlation
    • Upper face region less
      constrained by articulatory                                     (median results)
      processes
Nov 22nd, 2006                      01/40                                    Analysis
                                  Analysis
                     Remarks from analysis section
      • Facial gestures and speech are strongly interrelated

      • The correlation levels present inter-emotion differences

      • There is an emotion-dependent structure in the mapping parameter
        that may be learned
         • The prosodic-based mapping parameter set is grouped in a small
            cluster

      • Facial areas and speech are coupled at different resolutions



Nov 22nd, 2006                       01/40                             Analysis
                                   Analysis
                     Remarks from analysis section
      • During speech, facial activeness is mainly driven by articulation

      • However, linguistic and affective goals co-occur during active
        speech.

      • There is an interplay between linguistic and affective goals in
        facial expression

      • Forehead and cheeks have more degree of freedom to convey non-
        verbal messages

      • The lower face region is more constrained by the articulatory
        process
Nov 22nd, 2006                       01/40                                Analysis
                                           Recognition

       Introduction
       Analysis
       Recognition
            • Emotion recognition
            • Engagement recognition
       Synthesis
       Conclusions



 C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan,
 “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in Sixth International
 Conference on Multimodal Interfaces ICMI 2004, State College, PA, 2004, pp. 205–211, ACM Press.

Nov 22nd, 2006                                         01/40                                              Recognition
                 Multimodal Emotion Recognition
                                      Motivation
            • Emotions are an important element of human-human interaction
            • Design improved human-machine interfaces
            • Give specific and appropriate help to user

                                     Hypotheses
            • Modalities give complementary information
            • Some emotions are better recognized in a particular domain
            • Multimodal approach provide better performance and robustness

                                     Related work
            • Decision-level fusion systems (rule-based system) [Chen,1998] [DeSilva,2000]
                 [Yoshitomi,2000]

            • Feature-level fusion systems [Chen,1998_2] [Huang,1998]
Nov 22nd, 2006                              01/40                              Recognition
                 Multimodal Emotion Recognition
                            Proposed work
      • Analyze the strength and limitation of unimodal systems to
        recognize emotion states
      • Study the performance of multimodal system
      • MOCAP database is used
      • Sentence-level features (e.g. mean, variance, range, etc.)
         • Speech : prosodic features
         • Facial expression: upper and middle face area
         • Sequential backward features selection
      • Support vector machine classifier (SVC)
      • Decision and feature level integration


Nov 22nd, 2006                      01/40                            Recognition
                 Multimodal Emotion Recognition
                              Emotion recognition results
     • From speech                                                Anger   Sadness Happiness Neutral
           • Average ~70%                             Anger        0.68    0.05     0.21     0.05
                                                      Sadness      0.07    0.64     0.06     0.22
           • Confusion sadness-neutral ( )            Happiness    0.19    0.04     0.70     0.08
           • Confusion happiness-anger ( )            Neutral      0.04    0.14     0.01     0.81

     • From facial expression
                                                                Anger Sadness Happiness Neutral
           •     Average ~85%                         Anger     0.79    0.18    0.00     0.03
           •     Confusion anger-sadness ( )          Sadness   0.06    0.81    0.00     0.13
                                                      Happiness 0.00    0.00    1.00     0.00
           •     Confusion neutral-happiness ( )      Neutral   0.00    0.04    0.15     0.81
           •     Confusion sadness-neutral ( )
     • Multimodal system (feature-level)                          Anger Sadness Happiness Neutral
           • Average ~90%                             Anger        0.95   0.00    0.03     0.03
                                                      Sadness      0.00   0.79    0.03     0.18
           • Confusion neutral-sadness ( )            Happiness    0.02   0.00    0.91     0.08
           • Other pairs are correctly separated      Neutral      0.01   0.05    0.02     0.92


Nov 22nd, 2006                                01/40                                   Recognition
                 Inferring participants‟ engagement

        Introduction
        Analysis
        Recognition
             • Emotion recognition
             • Engagement recognition
             • Engagement recognition
        Synthesis
        Conclusions


C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P. G. Georgiou, I. Cohen, S. Narayanan. Smart Room: Participant and
Speaker Localization and Identification. In Proc. ICASSP, Philadelphia, PA, March 2005.
C. Busso, P.G. Georgiou and S.S. Narayanan. Real-time monitoring of participants’ interaction in a meeting using audio-
visual sensors. Under submission to International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007)

Nov 22nd, 2006                                        01/40                                             Recognition
                 Inferring participants‟ engagement
                                   Motivation
      • At small group level, the strategies of one participant are affected by
        the strategies of other participants
      • Automatic annotations of human interaction will provide better tools
        for analyzing teamwork and collaboration strategies
      • Examples of application in which monitoring human interaction is very
        useful are summarization, retrieval and classification of meetings

                                     Goals
      • Infer meta-information from participants in a multiperson meeting
      • To monitor and track the behaviors, strategies and engagements of the
        participants
      • Infer interaction flow of the discussion


Nov 22nd, 2006                         01/40                              Recognition
                 Inferring participants‟ engagement
                                    Approach
   • Extract high-level features from automatic annotations of
     speaker activity (e.g. number and average duration of each turn)
   • Use an intelligent environment equipped with audio-visual
     sensors to get the annotations

                                    Related work
   • Intelligent environment [Checka,2004] [Gatica-Perez,2003] [Pingali,1999]
   • Monitoring human interaction [McCowan,2005] [Banerjee,2004] [Zhang,2006] [Basu,2001]




Nov 22nd, 2006                             01/40                                   Recognition
                 Inferring participants‟ engagement
                       Smart Room
• Visual
     • 4 firewire CCD cameras
     • 360o Omnidirectional
       camera


• Audio
     • 16-channel microphone
       array
     • Directional microphone
       (SID)



Nov 22nd, 2006                  01/40                 Recognition
                 Inferring participants‟ engagement
   Localization and identification

  • After fusing audio-visual stream
    of data, the system gives
        •   Participants‟ location
        •   Sitting arrangement
        •   Speaker identity
        •   Speaker activity

  • Testing (~85%)
        • Three 20-minute meeting
          (4 participants)
        • Casual conversation with
          interruptions and overlap


Nov 22nd, 2006                        01/40           Recognition
                 Inferring participants‟ engagement
                               Participants interaction
      • High level features per participant
            •    Number of turns
            •    Average duration of turns
            •    Amount of time as active speaker
            •    Transition matrix depicting turn-taking between participants


      • Evaluation
            • Hand-based annotation of speaker activity
            • Results describe here correspond to one of the meetings




Nov 22nd, 2006                              01/40                               Recognition
                 Inferring participants‟ engagement
  Results : Participants interaction                   Ground-true duration   Estimated duration

• Automatic annotations are good approximation

• The distribution of time used as active speaker
  correlate dominance [Rienks,2006]                     Ground-true              Estimated
                                                      time distribution       time distribution
    • Subject 1 spoke more than 65% of the time


• Discussion are characterized by many short
  turns to show agreement (e.g. “uh-huh”) and
  longer turns taken by mediators [Burger,2002]
                                                          Ground-true             Estimated
    • Subject 1 was leading discussion                    no. of turns            no. of turns
    • Subject 3 was only an active listener




Nov 22nd, 2006                                01/40                                 Recognition
                 Inferring participants‟ engagement
                                    Results : Participants interaction

                                      • The transition matrix gives the interaction
                                        flow and turn taking patterns
                                      • Claim: transition between speaker ~ who
                                        was being addressed
                                          • To evaluate this hypothesis, addressee
                                            was manually annotated and compared
                                            with transition matrix
   Ground-true         Estimated          • Transition matrix provides a good first
    transition         transition
                                            approximation to identifying the
                                            interlocutor dynamics.
                                      • Discussion was mainly between subjects
                                        1 and 3.


Nov 22nd, 2006                         01/40                                 Recognition
                 Inferring participants‟ engagement
                    Results : Participants interaction

• These high-level features can be   Dynamic behavior of speakers‟ activeness over time
  estimated in small windows over
  time to infer participants‟
  engagement
     • Subject 4 not engaged
     • Subjects 1, 2 and 3 engaged




Nov 22nd, 2006
                                Recognition
                       Remarks from recognition section

      • Multimodal approaches to infer meta-information from speaker gives
        better performance than unimodal system

      • When acoustic and facial features are fused, the performance and the
        robustness of the emotion recognition system improve measurably

      • In small group meetings, it is possible to accurately estimate in real-
        time not only the flow of the interaction, but also how dominant and
        engaged each participant was during the discussion




Nov 22nd, 2006
                                               Synthesis

         Introduction
         Analysis
         Recognition
         Synthesis
              • Head motion synthesis
              • Head motion synthesis
         Conclusions
         Future Work


C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, “Natural head motion synthesis driven by acoustic prosodic
features,” Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283–290, July 2005.
C. Busso, Z. Deng, M. Grimm, U. Neumann and S. Narayanan. Rigid Head Motion in Expressive Speech Animation:
Analysis and Synthesis. IEEE Transactions on Audio, Speech and Language Processing, March 2007
 Nov 22nd, 2006                                       01/40                                               Synthesis
                 Natural Head Motion Synthesis
                         Motivation
      • Mapping between facial gestures and speech can be
        learned using more sophisticated framework
      • A useful and practical application is avatars driven by
        speech
      • Engaging human-computer interfaces and application
        such as animated features films have motivated
        realistic avatars
      • Focus of this section: head motion



Nov 22nd, 2006                  01/40                         Synthesis
                   Natural Head Motion Synthesis
                            Why head motion?
      • It has received little attention compared to other
        gestures
      • Important for acknowledging active listening
      • Improves acoustic perception [Munhall,2004]
      • Distinguish interrogative and declarative statements
           [Munhall,2004]

      • Recognize speaker identity [Hill,2001]
      • Segment spoken content [Graf,2002



Nov 22nd, 2006                     01/40                       Synthesis
                 Natural Head Motion Synthesis
                                Hypotheses
      • Head motion is important for human-like facial
        animation
      • Head motion change the perception of the emotion
      • Head motion can be synthesize from acoustic features

                                Related Work
      •    Rule-based systems [Pelachaud,1994]
      •    Gaussian Mixtures Model [Costa,2001]
      •    Specific head motion (e.g. „nod‟) [Cassell, 1994] [Graf, 2002]
      •    Example-based system [Deng, 2004], [Chuang, 2004]
Nov 22nd, 2006                            01/40                             Synthesis
                 Natural Head Motion Synthesis

                      Proposed Framework
      • Hidden Markov Models are trained to capture the temporal
        relation between the prosodic features and the head motion
        sequence
      • Vector quantization is used to produce a discrete representation
        of head poses
      • Two-step smoothing techniques are used based on first order
        Markov model and spherical cubic interpolation
      • Emotion perception is studied by rendering deliberate
        mismatches between the emotional speech and the emotional
        head motion sequences



Nov 22nd, 2006                       01/40                             Synthesis
                 Natural Head Motion Synthesis

                        Database and features

      • Same audio-visual database
      • Acoustic Features ~ Prosody (6D)
            • Pitch
            • RMS energy
            • First and second derivative
      • Head motion ~ head rotation (3DOF)
            • Reduce the number of HMMs
            • For closed-view of the face, translation effect less important



Nov 22nd, 2006                         01/40                               Synthesis
                  Natural Head Motion Synthesis

                 Head motion analysis in expressive speech

  • Prosodic features are coupled with head
    motion (emotion dependent)
  • Emotional patterns in activeness, range
    and velocity
  • Discriminate analysis ~ 65.5%
  • Emotion-dependent models are needed




Nov 22nd, 2006                     01/40                     Synthesis
                    Natural Head Motion Synthesis

                  Head motion analysis in expressive speech
      • Head motions are modeled with HMMs
            • HMMs provide a suitable and natural framework to model the
              temporal relation between prosodic features and head motions
            • HMMs will be used as sequence generator (head motion sequence)

      • Discrete head pose representation
            • The 3D head motion data is quantized
            using K-dimensional vector quantization
                 HeadPose   ,  ,    Vi      i   .. K 
                                                        1

            • Each cluster is characterized by its
            mean, U i and covariance,  i

Nov 22nd, 2006                                  01/40                     Synthesis
                 Natural Head Motion Synthesis

                   Learning natural head motion
                        P(Vi | O) c  P(O | Vi ) P(Vi )

      • The observation, O, are the acoustic prosodic features
      • One HMM will be trained for each head pose cluster, Vi

      • Likelihood distribution: P(O/Vi )
            • It is modeled as a Markov process
            • A mixture of M Gaussian densities is used to model the pdf of the
              observations
            • Standard algorithm are used to train the parameters (Forward-
              backward, Baum-Welch re-estimation)
Nov 22nd, 2006                           01/40                                Synthesis
                 Natural Head Motion Synthesis

                   Learning natural head motion

                        P(Vi | O) c  P(O | Vi ) P(Vi )

      • Prior distribution: P(Vi)
            • It is built as bi-gram models learned from the data (1st smoothing
              step)
            • Transitions between clusters that do not appear in the training data
              are penalized
            • This smoothing constraint is imposed in the decoding step




Nov 22nd, 2006                            01/40                                  Synthesis
                 Natural Head Motion Synthesis

                    Synthesis of natural head motion
   •    For a novel sentence, the HMMs generate the most likely head motion sequence
   •    Interpolation is used to smooth the cluster transition region (2nd smoothing step)




Nov 22nd, 2006
                 Natural Head Motion Synthesis
  • 2nd smoothing constraint
        • Spherical Cubic Interpolation
        • Remove the breaks in the cluster transition of the new sequences
        • The interpolation take place in the quaternion unit sphere [Shoemake, 1985]




Nov 22nd, 2006                          01/40                                 Synthesis
                   Natural Head Motion Synthesis

      • Configuration
            •    Left-to-Right topology
            •    K=16 (number of cluster)
            •    S=2 (number of states)
            •    M=2 (number of mixtures)
            •    80% training set, 20% test set
      • A set of HMMs was built for each emotion
      • From Euler angles to talking avatars
            • The Euler angles are directed applied to the control parameters
              of the face model
            • Face is synthesized with techniques given in [Deng,2004], [Deng,2005], [Deng,2005_2],
                 [Deng,2006]




Nov 22nd, 2006                                    01/40                                     Synthesis
                 Natural Head Motion Synthesis

                               Results
      • Canonical correlation between original and synthesized sequence



                 Neutral                                 Happiness




Nov 22nd, 2006                     01/40                             Synthesis
                   Natural Head Motion Synthesis

                                     Results
                 Sadness                                Anger




                           Subjective naturalness assessment




Nov 22nd, 2006                          01/40                   Synthesis
                 Natural Head Motion Synthesis

                         Emotional Perception
      • Approach: Render animations with deliberate
        mismatches between the emotional content of the
        speech and the emotional pattern of head motion
      • Dynamic Time Warping for alignment
      • 17 human subjects assessed the videos
      • Evaluation is performed in primitives attributes
        domain (valence, activation and dominance)




Nov 22nd, 2006                 01/40                       Synthesis
                 Natural Head Motion Synthesis

                 Results: Valence (Positive-Negative)
                                • Happy head motion make the attitude of
                                  the animation more positive (s.s.)
                                • Angry head motion make the attitude of
                                  the animation more negative (n.s.s.)




                                         1   2      3       4      5




Nov 22nd, 2006                   01/40                              Synthesis
                     Natural Head Motion Synthesis

                     Results: Activation (Excited-Calm)
   • Anger head motion makes the attitude of the
     animation more excited than happy head
     motion (s.s)
   • Happy speech with sad head motion is
     perceived more excited (s.s.)
         • Artifact of the approach
         • True effect generated by combination of
           modalities (McGurk effect)




                 1     2      3       4        5

Nov 22nd, 2006                            01/40           Synthesis
                 Natural Head Motion Synthesis

                 Results: Dominance (Weak-strong)

                               • Head motion does not modify this attribute
                               • Neutral speech with happy head motion is
                                 perceived more strong (n.s.s.)
                               • Happy speech synthesized with angry head
                                 motion is perceived more strong (n.s.s.)




                                        1    2       3       4      5




Nov 22nd, 2006                  01/40                               Synthesis
                                      Synthesis
                         Remarks from synthesis section
      • Re-visiting the hypotheses
             Head motion is important for human-like facial animation
                 • Animation is perceived more natural with head motion
             Head motion change the perception of the emotion
                 • Specially in valence and activation domain
                 • Head motion need to be designed to convey the desire emotion
             Head motion can be synthesize from acoustic features
                 • The synthesized sequences were perceives as natural as the original
                   sequences
                 • HMMs capture the relation between prosodic and head motion
                   features




Nov 22nd, 2006
                              Conclusions

       Introduction
       Analysis
       Recognition
       Synthesis
       Conclusions
             Proposed work
             Timeline




Nov 22nd, 2006                    01/40     Conclusions
                                Proposed work

                                 Research goals
      • To jointly model different modalities within an integrated framework
            • gestures and speech are not synchronous and they are coupled at different
              resolutions
      • Explore human emotional perception
            • Different combination of modalities may create different emotion percept
      • The idiosyncratic influence in expressive human communication
            • How speaker-dependent are the results presented here
      • At dyad level, how gestures and speech of the speakers are affected by
        the feedbacks provided by other interlocutors
      • At small group level, to infer meta-information from participants‟
        gestures



Nov 22nd, 2006                              01/40                                  Conclusions
                          Proposed work

         Interactive and emotional motion capture database
• Features
      • Dyad interaction
      • 5 session 2 actor each
      • Emotions were elicited in
        context
      • ~14 hour of data
      • Markers on the face and
        on the hands
      • Happiness, sadness anger,
        frustration neutral state
      • Still under preparation

Nov 22nd, 2006                      01/40             Conclusions
                                Proposed work

                                Individual level
       Gesture and speech framework
       • To model different modalities with a single framework that considers
         asynchrony and interrelation between modalities
             • Coupled HMMs and graphical models
       • Multiresolution analysis of gestures and speech during expressive
         utterances
             • Facial gestures and speech are systematically synchronized in different
               scales (phonemes-wordsphrases-sentences) [Cassell,1994]
             • The lower face area is strongly constrained by articulation
             • The upper face area has more degrees of freedom to communicate non-
               linguistic messages
             • A multiresolution decomposition approach may provide a better framework
               to analyze the interrelation between facial and acoustic features
             • We will study the correlation levels of coarse-to-fine representations of
               acoustic and facial features

Nov 22nd, 2006                             01/40                                  Conclusions
                                  Proposed work

                                  Individual level
        Emotion perception
        • Evaluate the hypothesis that different combination of
          modalities create different emotion percept (McGurk effect)
                 • Approach: design controlled experiments using facial animations
                 • Create deliberate mismatches between the emotional speech and
                   specific facial gestures (e.g. eyebrow)
                 • Human raters will assess the emotions conveyed in these animations
                 • For facial animation the open source software Xface will be used
        • Emotion perception in different modalities
                 • We will compare acoustic versus visual emotion perception
                 • We will evaluate the importance of content in emotion perception
                 • Approach: assess the IEMOCAP database


Nov 22nd, 2006                              01/40                              Conclusions
                                    Proposed work

                                    Individual level
        Gestures and speech driven by discourses functions
        • Study influence of high-level linguistic functions in the relationship between
           gestures and speech
        • We propose to analyze gestures that are generated as discourse functions (e.g.
           head nod for “yes”)
        • Application: Improve facial animations
            • e.g. Head motion sequences driven by prosody and discourse functions
        Analysis of personal styles
        • We propose to study the idiosyncrasy aspect of expressive human
          communication
        • Since IEMOCAP database has 10 subjects, results can generalized
                 • To learn inter-personal similarities (speaker-independent emotion
                   recognition systems)
                 • To learn inter-personal differences, (better human-like facial animation)


Nov 22nd, 2006                                  01/40                                   Conclusions
                                     Proposed work

                                     Dyad level
        Extension of interplay theory
        • Analyze facial expressions during acoustic silence
        • The lower face area may be modulated as much as the upper/middle face
           areas
        Gestures of active listeners
        • Active listeners respond with non-verbal gestures
                 • These gestures appear in specific structure of the speaker‟s word [Ekman,1979]
                 • Application: design active listener virtual agents
        • We propose to analyze the gestures and speech of the subjects when
          they are trying to positively affect the mood of the other interlocutor
                 • Hypothesis: particular gestures are used which can be learned and
                   synthesize


Nov 22nd, 2006                                    01/40                                     Conclusions
                                    Proposed work

                                  Small group level
        Gestures of the participants
        • Rough estimations of the participant gestures will be extracted
        • We propose to include this information as additional clue to measure
          speaker engagement
        • Use gestures to improve fusion algorithm
                 • A measure of hand activeness can be used for speaker localization
                 • Head postures of the participant can improve the change of turns detection
        Smart room as training tool
        • Evaluate whether the report provided by the smart room can be used as
          training tool for improving participant skills during discussions




Nov 22nd, 2006                                  01/40                                  Conclusions
                                  Timeline

      January-March
      • Multiresolution analysis of gestures and speech during expressive
         utterances
      • Analysis of McGurk effects in emotional perception of expressive
         facial animations
      • Relation between visual and acoustic boundaries

      April-June
      • Gesture and Speech framework (e.g. CHMM, graphical models)
      • Emotion perception in different modalities (context vs. insolated
        emotional assessments)
      • Extension of the interplay theory during acoustic silence



Nov 22nd, 2006
                                  Timeline

      July-September
      • Facial Animation driven by discourse function and acoustic features
      • Study of idiosyncrasy aspect of human communication
      • Engagement analysis in multiparty discussion
      October-November
      • Active listeners analysis
      • Hand gestures analysis and its relationship with speech




Nov 22nd, 2006
                                          Publications

      Journal Articles
      [1] C. Busso and S.S. Narayanan. “Interrelation between Speech and Facial Gestures in Emotional
          Utterances”. Submitted to IEEE Transactions on Audio, Speech and Language Processing.
      [2] C. Busso, Z. Deng, M. Grimm, U. Neumann and S. Narayanan. “Rigid Head Motion in Expressive
          Speech Animation: Analysis and Synthesis”. IEEE Transactions on Audio, Speech and Language
          Processing, March 2007
      [3] C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, “Natural head motion synthesis driven by
          acoustic prosodic features,” Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283-290,
          July 2005.
      Conferences Proceedings
      [1] C. Busso, P.G. Georgiou and S.S. Narayanan. “Real-time monitoring of participants‟ interaction in a
          meeting using audio-visual sensors”. Submitted to International Conference on Acoustics, Speech,
          and Signal Processing (ICASSP 2007)
      [2] C. Busso and S.S. Narayanan. “Interplay between linguistic and affective goals in facial expression
          during emotional utterances”. To appear International seminar on Speech Production (ISSP 2006)
      [3] M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C.M. Lee, S. Lee, and S. Narayanan,
          “Investigating the role of phoneme-level modifications in emotional speech resynthesis,” in 9th
          European Conference on Speech Communication and Technology (Interspeech 2005 - Eurospeech),
          Lisbon, Portugal, September 2005, pp. 801804.




Nov 22nd, 2006
                                    Publications

      Conferences Proceedings (cont)
      [4] C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P. G. Georgiou, I. Cohen, S.
         Narayanan. “Smart Room: Participant and Speaker Localization and Identification”.
         In Proc. ICASSP, Philadelphia, PA, March 2005.
      [5] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U.
         Neumann, and S. Narayanan, “Analysis of emotion recognition using facial
         expressions, speech and multimodal information,” in Sixth International Conference
         on Multimodal Interfaces ICMI 2004, State College, PA, 2004, pp. 205211, ACM
         Press.
      [6] Z. Deng, C. Busso, S. Narayanan, and U. Neumann, “Audio-based head motion
         synthesis for avatar-based telepresence systems,” in ACM SIGMM2004Workshop
         on Effective Telepresence (ETP 2004), New York, NY, 2004, pp. 2430, ACM Press.
      [7] C.M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S.
         Narayanan, “Emotion recognition based on phoneme classes,” in 8th International
         Conference on Spoken Language Processing (ICSLP 04), Jeju Island, Korea, 2004.
      [8] S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S.
         Narayanan, “An acoustic study of emotions expressed in speech,” in 8th
         International Conference on Spoken Language Processing (ICSLP 04), Jeju Island,
         Korea, 2004.



Nov 22nd, 2006
                                          Publications

      Abstracts
      [1] S. Yildirim, M. Bulut, C. Busso, C.M. Lee, A. Kazamzadeh, S. Lee, and S. Narayanan. “Study of
           acoustic correlates associate with emotional speech”. J. Acoust. Soc. Am., 116:2481, 2004.
      [2] C. M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazamzadeh, S. Lee, and S. Narayanan. “Effects of
           emotion on different phoneme classes”. J. Acoust. Soc. Am., 116:2481, 2004.
      [3] M. Bulut, S. Yildirim, S. Lee, C.M. Lee, C. Busso, A. Kazamzadeh, and S. Narayanan. “Emotion to
           emotion speech conversion in phoneme level”. J. Acoust. Soc. Am., 116:2481, 2004.




                                                 Thanks!




Nov 22nd, 2006
                        Spherical Cubic Interpolation
      • Interpolation procedure
            • Euler angles are transformed to quaternion
            • Key-points are selected by down-sampling the quaternion
            • Spherical cubic interpolation (squad) is used to interpolate key-
              points
            • The interpolated results are transformed to Euler angles

           squad ( q1 , q2 , q3 , q4 , u )  slerp ( slerp ( q1 , q4 , u ), slerp ( q2 , q3 , u ), 2u(1  u ))
                                    sin(1  u)      sin u
           slerp ( q1 , q2 , u)                q1         q2
                                       sin           sin 
            • Motivation for spherical cubit interpolation
            • Interpolation in Euler space introduce jerky movement
            • Introduce undesired effects such as Gimbal lock

Nov 22nd, 2006                                                                                                   Synthesis
                             Techniques
                 Canonical correlation analysis (CCA)
      • Scale-invariant optimum linear framework to measure
        the correlation between two streams of data with
        different dimensions [Dehon, 2000]
      • Basic idea: project the features into a
        common space in which Pearson’s correlation
        can be computed Coefficient
                   Motion

      • Def: The standard deviation of the sentence-
        level mean-removed signal


Nov 22nd, 2006                                            Synthesis
                      HMM Configuration

      • Using generic emotional-independent models 8 configurations
        were tested




      • For emotional-dependent models, less data for training is
        available
      • HMMs were set with LR, S=2, M=2, K=16

Nov 22nd, 2006                                                        Synthesis
           From Euler Angles to Talking Avatars

      •    Improve
      •    Avatar is synthesized using Maya
      •    A model with 46 blend shapes is used
      •    Lip and eye motions are also included [Deng,2004], [Deng,2005],
           [Deng,2005_2], [Deng,2006]




Nov 22nd, 2006                                                               Synthesis
                    Dynamic Time Warping

      • Two acoustic signals are aligned by finding the optimum path
        (dynamic programming)
      • Optimum path is used to modify the facial gestures




Nov 22nd, 2006
                       MFCC features

      • The first coefficient of the MFCCs was removed
        (Energy)
      • The velocity and acceleration coefficients were also
        included
      • Feature vector was reduced from 36 to 12 using
        Principal Component Analysis (95% of the variance)
      • This post-processed feature vector is what will be
        referred here on as MFCCs.




Nov 22nd, 2006                                                 Analysis
                         Markers post processing

      • Translate the markers (nose marker is the reference)
      • Frames are multiplied by a rotational matrix
            •    Choose a neutral pose as reference (102×3 Mref)
            •    For the frame t, construct a similar matrix Mt
            •    Compute Singular Value Decomposition, UDVT , of matrix (Mref) T ·Mt
            •    The product VUT gives the rotational matrix, Rt, [Stegmann,2002]




Nov 22nd, 2006                                                                         Analysis
       Statistical significant correlation analysis



      • Sentence level
        Mapping



      • Global Level
        Mapping



Nov 22nd, 2006
                 Phoneme level analysis of correlation




Nov 22nd, 2006
                 Features from Facial Expression

      •    4-D feature vector
      •    Data is normalized to remove head motion
      •    Five facial areas are defined
      •    3-D coordinates are concatenated
      •    PCA is used to reduce to 10-D vector
      •    Frame level classification (K-nearest neighbor)
      •    The statistic of the frame at utterance level are aggregated forming a
           4D feature vector




Nov 22nd, 2006
              More details on emotion recognition
                                   Multimodal Systems
• Feature-level integration (89%)
       • High performance of anger, happiness and                                          Anger Sadness Happiness Neutral
                                                                              Anger         0.95   0.00    0.03     0.03
         neutral state                                                        Sadness       0.00   0.79    0.03     0.18
       • Bad performance of sadness 79%                                       Happiness     0.02   0.00    0.91     0.08
                                                                              Neutral       0.01   0.05    0.02     0.92
       • The performance of happiness decreased
•      Decision-level integration (89%)
       • Product of the posterior probabilities was the best
         criterion
       • Product criterion: Same results, big differences
                       Overall   Anger Sadness Hapiness Neutral
                                                                                            Anger Sadness Happiness Neutral
    Maximum combining 0.84        0.82   0.81    0.92    0.81
    Averaging combining 0.88      0.84   0.84    1.00    0.84
                                                                               Anger         0.84   0.08    0.00     0.08
    Product combining   0.89      0.84   0.90    0.98    0.84     In detail    Sadness       0.00   0.90    0.00     0.10
    Weight combining    0.86      0.89   0.75    1.00    0.81                  Happiness     0.00   0.00    0.98     0.02
                                                                               Neutral       0.00   0.02    0.14     0.84


Nov 22nd, 2006
                 Smart room fusion




Nov 22nd, 2006

								
To top