Determination of Prosodic Feature Set for Emotion Recognition in

Determination of Prosodic Feature Set for Emotion Recognition in Call Center Speech Eliza Concepcion E. Ebarvia1 eeebarvia@up.edu.ph Michael Gringo Angelo R. Bayona1 mrbayona@up.edu.ph Franz A. de Leon, MS EE1 franz@eee.upd.edu.ph Mary Shalom B. Lopez1 risha.lopez@gmail.com Rowena Cristina L. Guevara, Ph.D. 1 gev@eee.upd.edu.ph Belen D. Calingacion, Ph.D.2 belcal@yahoo.com Prospero C. Naval Jr. Ph.D.3 p.naval@ieee.org 1 Digital Signal Processing Laboratory – Department of Electrical and Electronics Engineering 2 Department of Speech Communications and Theater Arts 3 Department of Computer Science University of the Philippines Diliman, Quezon City ABSTRACT In this paper, we determined the feature parameters needed to formulate an emotion recognition algorithm for technical call centers. Four emotions were considered for the system output: anger, boredom, happiness or satisfaction, and neutral for both the client and the call center agent. Our proposed emotion recognition system aims at being independent of the context of the words uttered; therefore, only prosodic features were considered as system parameters. Each feature was tested for its base performance using the ratio between the between-class variance and the within-class variance. Multivariate discriminant analysis was also performed on each feature to determine their performance accuracy on actual call center data. Among the features, the median derivative of pitch has the highest percentage accuracy of 69.62%, with neutral as the most successfully recognized emotion. Keywords Emotion Recognition System, Human Machine Interface, Call Centers, Prosody 1. INTRODUCTION The ability of call centers to provide satisfactory service to clients measures their operational output. As such, customers’ emotions should be correctly gauged to attend to the customers’ needs efficiently. Call center owners also need to keep track of how their agents react to clients. In an attempt to help call center owners and agents, an automated emotion recognition system should be developed. An emotion classification system would be of great assistance to call center agents so they can decipher or validate the emotions conveyed by their customers effectively. Better performance for every call translates to improved company operations and business security. With the improved human-machine interaction brought by this system, both call center clients and owners are assured of quality service and increased customer satisfaction. Human-machine interface (HMI) can be developed to help the call center operators. This interface will help the call center personnel respond to clients in the most suitable manner. The HMI should recognize situations in a similar way that humans do. These HMIs should recognize the present situation and respond accordingly depending on a particular situation. To achieve this goal, the HMI should monitor and understand the emotion of both the agent and the client in a call center. An emotion recognition system consists of seven modules: speech input, preprocessing, spectral analysis, feature extraction, feature selection, emotion classification and recognized emotion output. Figure 1 shows a suggested structure for an emotion recognition system by Bhatti, et. al. Categories and Subject Descriptors H.1.2 [Information Systems]: User/Machine Systems – human factors, human information processing General Terms Human Factors This project focuses on the feature extraction block of an emotion recognition system. Different speech factors that determine the emotion communicated by a person were studied in this project. Specifically, this study looked for the prosodic features that would best discriminate one emotion from another in a call center speech. This set of features was tested for its performance through some standard pattern recognition techniques. The final feature set used for the emotion classification system consisted of features with good performance ranking based on the different tests. This study is part of a larger project under the University of the Philippines Diliman Office of the Vice-Chancellor for Research and Development (OVCRD) entitled, “Interdisciplinary Signal Processing for Pinoys, Project Three: Emotion Recognition Software for Call Centers.” This undertaking is possible with the joint cooperation of the Department of Electrical and Electronics Engineering (DEEE) and Department of Computer Science (DCS) of the College of Engineering and the Department of Speech Communication and Theater Arts of the College of Arts and Letters (DSCTA). The DSCTA is in-charge of data acquisition and initial analysis needed for the system’s database and testing; the DEEE covers the in-depth speech data analysis and the formation of the feature set parameters while the DCS is responsible for the development of the actual emotion recognition software using Neural Networks. The project consisted of several stages. First, the acquired recordings were processed and converted to an audio file format fit for feature extraction. To complete the training and testing data, the audio data is transcribed and labelled. In this stage, we indicated the time when either the agent or the client is talking and the emotion that is conveyed. Figure 1. Emotion Recognition System. In the previous years, different studies on the recognition of emotion had been conducted. Table 1 shows the percentage accuracy for every emotion recognition system constructed by different studies. These studies used different kinds of data. The studies of Dellaert et al., Bhatti et al., and Pitterman used acted spoken emotions. For their study, Lee and Narayanan used spoken dialogs from a call center. These studies did not cite the specific environment that used the emotion recognition system. It is important to know the environment with which the classifier will be used in order to direct data processing. Anticipating the scenario based on the environment can improve recognition accuracy as opposed to just classifying speech blindly. Differences in percentage accuracy result from the varying classification techniques for each recognition system and the number of emotion output – not on the number of features alone. It is important to determine the optimal number of prosodic features to minimize classification errors. Table 1. Different Emotion Recognition System Studies Emotion Recognition Study Lee and Narayanan, 2005 Emotions Detected Number of utterances for Speech Corpora 7200 Number of Prosodic Features 21 Percentage Accuracy Negative and nonnegative Happiness, sadness, anger, fear and neutral Happiness, sadness, anger, fear, surprise and disgust Anger, boredom, disgust, fear, happiness, sadness and neutral Happiness, sadness, anger, fear, surprise, disgust and neutral 76.87% Dellaert et al., 1996 1250 17 71.50% Bhatti et. al, 2004 500 17 77.24% Pittermann, 2006 650 24 72.00% Chuang and Wu, 2004 110 33 76.44% Next, prosodic features were extracted from the speech sample for the database construction and feature set formation. Finally, the feature set was tested for its base performance and discriminant analysis was performed. text files were gathered and a database of labels with the most number of votes was generated. The database became the ground truth for emotion classification. Figure 2 shows the interface used. 2. METHODOLOGY 2.1 Data A call center company provided thirty-nine and 6/10 hours as data for the project. All recordings were in *.wma format, stereo, sampled at 44,100 Hz with a bit rate of 128 kbps. The data was converted to *.wav files, 8000 Hz, 16bit. Not all of the recordings were valid data due to duplicates and clipped recordings. By considering only unique calls and undistorted data, valid recordings amounted to 28.91 hours. Aside from the file format, another undesirable characteristic of the recordings was the presence of a floating DC offset. The offset did not appear throughout the duration of the call. If the recordings were used for human analysis only, there would not be any undesirable effect, since the human ear could not perceive the offset. However, once automated feature extraction commenced, errors would arise especially on the intensity-based features leading to faulty emotion classifier training. As such, the DC offset must be eliminated right after data conversion. Removing the time-average for every 10 ms frame (80 samples) solved this problem. Since the data for both agent and client came in onechannel recordings, the calls should be transcribed to keep track when the agent or the client is speaking. As the initial phase for transcription, the raw input speech signal was processed in order to distinguish and identify the voice of the agent and the client. The call center company gave no record of emotions of the client and agent. After the speaker-level transcription, all data were passed to the Department of Speech Communications and Theater Arts (DSCTA) for emotion labeling. A DSCTA researcher classified the speech signals according to the emotion portrayed by either the agent or the client. Speaker-level transcription involved indicating the parts of agent’s speech, client’s speech and non-speech in the recording. During emotion labeling, the DSCTA researcher would only have to type in the emotion portrayed at that particular point in the call. For the purpose of the project, all labels were considered as the ground truth and would not be changed unless the DSCTA researcher specified otherwise. The first experiment assumes that one person is enough to represent the general perception of the emotion conveyed by a speaker. For the final experiment, a jury was formed to perform the same emotion labeling on the transcribed data to check and validate the ground truth established in the first experiment. The jury was composed of three DSCTA professors and five student affiliates of the UP DSP laboratory. The members were each given a CD containing all the call center recordings and a GUI through which they can label the calls. The GUI’s output for a single call is a text file with the emotion label for each utterance. All the Figure 2. Emotion Labeling Graphical User Interface. 2.2 Feature Extraction Figure 3 shows the flowchart of the feature extraction algorithm used. The feature extraction block was started by first extracting the intensity contour of the speech signal. The speech signal is segmented into 25-msec frame with 10ms overlap. Equation (1) shows how the intensity for each frame is computed. N E  10 log10  s( n) n 1 2 (1) From the intensity contour of the signal, vowel detection was implemented. The vowels are the voiced parts of the speech signal which are indicated by the peaks in the intensity contour. The peaks from the intensity contour are then detected. The number of peaks approximates the number of vowels present in the signal. The intensity-based features were computed from the intensity contour. The parts of the speech which contained the vowels were also passed to the pitch detection block, which produces the pitch contour of the voiced parts. Then the pitch-based features were derived from the pitch contour. The pitch of the speech frame is computed using the short-time average magnitude difference function (AMDF) (Yu-min, 2003). The average magnitude difference function of a frame of signal is computed using the following equation: N l 1 xw (l )  Where  n 1 sw (n  l  1)  sw (n) = speech signal = 1, 2, 3,…, N (2) sw ( n ) l N = length of a frame of speech signal The preliminary feature set used for this project are as follows: Intensity-based features: (1) Speaking Rate, (2) Mean Value of Individual Voiced Parts, (3) Mean Minimum of Individual Voiced Parts, (4) Mean Maximum of Individual Voiced Parts. Pitch-related features: (1) Minimum Pitch, (2) Maximum Pitch, (3) Median Pitch, (4) Standard Deviation of Pitch, (5) Range of Pitch, (6) Minimum Derivative of Pitch, (7) Maximum Derivative of Pitch, (8) Median Derivative of Pitch, (9) Standard Deviation of Derivative of Pitch, (10) Upslope Ratio, (11) Mean Positive Derivative of Individual Slopes, (12) Mean Negative Derivative of Individual Slopes. 2.3 Testing 2.3.1 Base Performance Test The base performance algorithm tests the features if they are efficient in determining the emotion in a speech. Some features may be redundant or may affect the emotion recognition system negatively. Each feature was rated based on the ratio between its between-class variance (  b ) and within-class variance (  w ) (Ververidis and Kotropolos, 2004). Between-class variance represents the distance between class means while the within-class variance measures the distances within each class mean. The best features were categorized as having high base performance ratio between  b and  w . The features would be ranked according to its base performance and the features with high base performance ranking would form the feature set. The base performance test measures the capability of a feature to distinguish one emotion from another. The base performance ratio (BPR) indicates the ratio of the between class variance and within class variance. Note that the BPR only relates the performance of one feature against each other and does not give any accuracy value once used in a particular emotion classifier. Consequently, the top features consist of emotion features that achieved highest BPR. 2 2 2 2 2.3.2 Multivariate Discriminant Analysis This performance test matches the multivariate normal density to each emotion group, using the pooled estimate of covariance from the training set. With this test, we could predict the performance of each feature. This test can also give the error estimation based on the same training set. For the project, the training set contained eighty percent (80%) of each emotion of the feature database from the feature extraction phase while the remaining twenty percent (20%) of each emotion comprised the testing set. The testing set was completely isolated from the training set. The features were ranked again according to their percentage accuracy and the features with high ranking would also form the feature set. Performance accuracy was computed by taking the ratio between the number of correctly identified emotion and the total number of samples tested. The preliminary set of features, which includes the intensity-based and pitch-based features, was tested using a multivariate discriminant analysis. Another set of features which included the Mel-frequency cepstral coefficients, intensity contour and pitch contour, was also tested using the multivariate discriminant analysis. Figure 3. Flowchart of the Feature Extraction Algorithm. As the second set of prosodic features, the pitch and intensity history together with the Mel-Frequency Cepstral Coefficients (MFCCs) of all recordings were studied. The pitch and intensity values of every frame of the speech signal were studied instead of just looking at the characteristics of the pitch and intensity contours. The changes in values for both pitch and intensity contours could be observed closely as opposed to studying the summary of the contour as described by the preliminary feature set. The MFCC is another representation of cepstral coefficients wherein the analysis was done on a non-linear frequency scale, otherwise known as the Bark scale or Mel scale. This scale approximates the human auditory response more closely compared to the linearly-spaced frequency bands of the Fast Fourier Transform (FFT) or the Discrete Cosine Transform (DCT). All features belonging to the second feature set were extracted for every 25-msec frame or 200 samples with 10-msec overlap. 3. RESULTS AND ANALYSIS 3.1 Initial Experiment The data used for this experiment have an unequal distribution of emotions. The neutral emotion had the most number of samples while bored had the least number of samples. Table 2 shows the distribution of emotions used for the analysis. Table 2. Distribution of Emotions for the Initial Experiment. Emotion Neutral Angry Happy Bored Distribution (%) 91.28 Table 4. Ranking of Features Based on a Multivariate Discriminant Analysis on a Per Utterance Level Rank 1 Feature Mel-Frequency Cepstral Coefficients Mean Maximum Value of Individual Voiced Parts Mean Value of Individual Voiced Parts Median Pitch Pitch Contour Intensity Contour Standard Deviation of Derivative of Pitch Speaking Rate Median Derivative of Pitch Maximum Pitch Success Rate (%) 65.32 58.49 57.26 50.23 50.11 47.42 46.86 46.37 46.31 46.15 5.7 2 2.4 0.62 3 4 5 As an initial test, the base performance rate of each feature was computed. Table 3 shows the top 10 features ranked according to their base performance. Based on the table, only the top four features are considered valid to recognize an emotion. The remaining six features were not considered because of the large difference of their base performance rate to the top four features. The top two features are intensity-based while the features that ranked third and fourth are both pitch-based. Table 3. Base Performance Results for the Initial Experiment Rank 1 2 3 4 5 6 7 8 9 10 Feature Mean Maximum of Individual Voiced Parts Mean of Individual Voiced Parts Median Pitch Maximum Pitch Minimum Pitch Mean Minimum of Individual Voiced Parts Upslope Ratio Mean Positive Derivative of Individual Slopes Mean Derivative of Pitch Range of Pitch BPR 0.065 0.057 0.048 0.027 0.013 0.013 0.012 0.007 0.006 0.004 6 7 8 9 10 3.2 Final Experiment The emotion labels from the DSP and the DSCTA were combined and the new emotion distribution is shown in Table 5. Table 5. Distribution of Emotions for the Final Experiment. Emotion Neutral Angry Happy Bored Distribution (%) 94.98 3.56 0.35 1.11 Since the results of the base performance test could only indicate the relative performance of each feature and not the actual percentage accuracy when fed to an emotion classifier, another test, which can predict the actual performance of each emotion feature, should be performed. This test was the multivariant discriminant analysis. The ranking of the features based on the multivariate discriminant analysis is shown in Table 4. Five features obtained a success rate greater than 50%, with the MelFrequency Cepstral Coefficients obtaining the highest rate of 65.32%. The performance of each feature was again obtained. Table 6 shows the ranking of the features according to their accuracy rates. The values are now much lower than the previous experiment, with only the median derivative of pitch attaining a performance rate higher than 50%. Pitchbased features now have relatively higher success rates, occupying four of the top five slots. The previous experiment, on the other hand, had an equal number of intensity- and pitch-based features. Table 7 shows the confusion matrix for the median derivative of pitch. The neutral emotion remained to be the most recognized emotion while angry was not properly recognized at all. Table 6. Ranking of Features Tested Using Multivariate Discriminant Analysis Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Median Derivative of Pitch Minimum Pitch Mel-Frequency Cepstral Coefficients Mean Positive Derivative of Individual Slopes Pitch Median Pitch Intensity Mean Negative Derivative of Individual Slopes Range of Pitch Mean Value of Individual Voiced Parts Maximum Derivative of Pitch Standard Deviation of Derivative of Pitch Standard Deviation of Pitch Speaking Rate Maximum Pitch Minimum Derivative of Pitch Mean Minimum Value of Individual Voiced Parts Mean Maximum Value of Individual Voiced Parts Upslope Ratio Success Rate (%) 69.62 35.34 25.02 24.02 22.09 14.78 11.1 10.93 9.35 8.79 8.12 7.61 7.57 7.21 7.14 6.32 6.06 4.36 2 4. CONCLUSION According to our base performance tests, features relating to the individual voiced parts and the fundamental frequency achieved high BPR compared to the other features. The results of the base performance, however, are yet to be validated by the neural network tests. It was for this reason that the researchers utilized multivariant discriminant analysis on all features. Based on the multivariant discriminant analysis, the Median Derivative of Pitch is the most suitable feature to be used in identifying the emotion present in a speech signal. This feature achieved a performance accuracy of 69.62%. As for the emotions, neutral was the most successfully recognized emotion while the other three emotions had much lower accuracy rate than neutral. This pattern can be related to the distribution of emotion in the data used. In the data, neutral had the most number of samples while the other emotions have much lesser samples. We expect that once the samples for the angry, happy and bored emotions are increased to almost the same number of samples of the neutral emotion, a high classification rate for these features will be achieved. 5. ACKNOWLEDGEMENTS This project was granted financial support by the Office of the Vice-Chancellor for Research and Development (OVCRD) under Project/Grant No. 070702 OG entitled, “Development of Interdisciplinary Signal Processing for Pinoys (ISIP) Program - Project Three: Development of Emotion Classification Algorithms for Call Center Speech Analysis.” 6. REFERENCES [1] Bhatti, M.W., Y. Wang, and L. Guan. 2004. A Neural Network Approach for Human Emotion Recognition in Speech. Proceedings of IEEE International Symposium on Circuits and Systems. [2] Chuang, Z.J., and C.H. Wu. 2004. Emotion Recognition using Acoustic Features and Textual Content. Proceedings of IEEE International Conference on Multimedia and Expo. [3] Dellaert, F., T. Polzin, and A. Waibel. 1996. Recognizing Emotion in Speech. Proceedings of Fourth International Conference on Spoken Language. [4] Lee, C.M., and S. Narayanan. 2005. Toward Detecting Emotion in Spoken Dialogs. IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 293303, 2005. [5] Pittermann, A., and J. Pittermann. 2006. Getting Bored with HTK? Using HMMs for Emotion Recognition from Speech Signals. Proceedings in IEEE International Conference on Signal Processing. [6] Ververidis, D., and C. Kotropolos. 2004. Automatic Emotion Speech Classification. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Based on the observation from the confusion matrices, it is evident that the neutral emotion is the most recognized emotion among the four emotions being tested. The other three emotions had a much lesser accuracy rates than that of neutral. When the emotion distribution shown in Table 4 is closely examined, the neutral emotion is the wellrepresented emotion in the data used. The percentages of samples for the angry, bored, and happy emotions are much less than that of neutral. Consequently, the rate in correctly identifying the emotion present in a speech signal is correlated to the distribution of emotions in the data. The well-represented emotion had the highest accuracy rate while the other three emotions with less than 10% distribution where harder to identify. Table 7. Confusion Matrix for Median Derivative of Pitch Neutral Neutral Angry Bored Happy 75.33 76.46 60.87 66.67 Angry 0.13 0 0 0 Bored 22.81 20.54 39.13 22.22 Happy 1.73 3.00 0 11.11 [7] Yu-min, Z., W. Zhen-yang, L. Hai-Bin, and L. Zhou. 2003. Modified AMDF Pitch Detection Algorithm. Proceedings of the Second International Conference on Machine Language and Cybernetics

Related docs
DETERMINATION
Views: 7  |  Downloads: 0
Reading Notes for Emotion Class
Views: 1  |  Downloads: 0
CFA Feature Template
Views: 3  |  Downloads: 0
Recognition and Support
Views: 2  |  Downloads: 0
K–12 Education feature[909]
Views: 0  |  Downloads: 0
determination
Views: 0  |  Downloads: 0
Other docs by revitup2367
Review of Algebra
Views: 6858  |  Downloads: 499
Katko Banowski Hodgeden Briefs
Views: 492  |  Downloads: 2
Start Italian in Your School
Views: 823  |  Downloads: 7
a33
Views: 194  |  Downloads: 1
Break My Heart
Views: 445  |  Downloads: 4
Connection in Healing
Views: 321  |  Downloads: 5
ch151
Views: 125  |  Downloads: 0
Thank You Lord
Views: 260  |  Downloads: 3
cr130
Views: 101  |  Downloads: 0
Garner Crechale Polles Inc
Views: 206  |  Downloads: 4
Teleportation Physics Study
Views: 627  |  Downloads: 23
Pennoyer v Neff
Views: 601  |  Downloads: 8
Praise the Lord
Views: 229  |  Downloads: 0
Simmons Carlill Empire_ Nature of Acceptance
Views: 306  |  Downloads: 4
course07-1
Views: 204  |  Downloads: 4