Integration of Negative Emotion Detection into VoIP Call Center by mikeholy


									          Integration of Negative Emotion Detection into a
                      VoIP Call Center System
                              Tsang-Long Pao, Chia-Feng Chang, and Ren-Chi Tsao
                                 Department of Computer Science and Engineering
                                        Tatung University, Taipei, Taiwan

                                                                       A considerable number of studies have been made on
Abstract - The speech signal itself contains not only the
                                                                  speech emotion recognition over the past decades [1-16]. By
semantics of the spoken words but also the emotion state of
                                                                  integrating the speech emotion recognition system into the
the speaker. By analyzing the voice signal to recognize the
                                                                  VoIP call center system, we can continuously monitor the
emotion hidden in the speech signal, it is possible to identify
                                                                  emotional state of the service representatives and the
the emotion state of the speaker. With the integration of a
                                                                  customers. In this paper, we propose the mechanism to
speech emotion recognition system into a VoIP call center
                                                                  integrate the negative emotion detection engine into the
system, we can continuously monitor the emotion state of
                                                                  VoIP call center system. A parallel processing architecture
the service representatives and the customers. In this paper,
                                                                  is implemented to meet the performance requirements for
we proposed a framework that integrates the speech
                                                                  the system. We also record the emotion states for all of the
emotion recognition system into a VoIP call center system
                                                                  calls into the database. Alerts will be issued to the service
Using this setup, we can detect in real time the speech
                                                                  manager whenever a negative emotion such as anger is
emotion from the conversation between service
                                                                  being detected. The service manager has a chance to
representatives and customers. It can display the emotion
                                                                  intervene into the two quarreling parties to pacify the
states of the conversation in a monitoring console and, in
                                                                  customer and resolve the problem immediately. With this
the event of a negative emotion being detected, issue alert
                                                                  mechanism, it can enhance the level of customer satisfaction.
signal to the service manager who can then promptly react
to the situation.
                                                                       The organization of this paper is as follows. In Section
                                                                  2, the background and the related researches of speech
Keywords: Speech Emotion Recognition, Call Center,
                                                                  emotion recognition and voice over Internet Protocol (VoIP)
Negative Emotion Detection, WD-KNN Classifier
                                                                  are reviewed. In Section 3, the system architecture of a
                                                                  multi-line negative emotion detection in VoIP Call Center is
                                                                  described. In Section 4, the experimental setup is presented
1    Introduction                                                 and the results are discussed. Conclusions are presented in
      The voice signal in the conversation represents the         Section 5.
semantics of the spoken words and also the emotion state of
the speaker. If a dispute happened between the service            2     Backgrounds
representative and the customer, there is no way to notify
the service manager to take action immediately in current         2.1    Speech Emotion Recognition
call center system. Since the customer service is playing an
important role for an enterprise, the customer satisfaction is         In the past, quite a lot of researchers studied the human
very important. So, the management of customer service            emotion and try to define what an emotion is. But it is hard
department to improve the customer satisfaction is an             to define emotional category because of there is no a single
important issue for the enterprise.                               universally agreed definition. The emotion category defined
                                                                  by Ortony and Turner is a commonly accepted definition
     Traditional customer service lacks of the ability in         [10]. In recent years, the study of psychological tends to
issuing alerts for conversation with negative emotion in real-    divide emotion category into basic emotion and complex
time. For example, when the customer disagrees with the           emotion. The complex emotion is a derived version from the
service representative, a dispute may arise. The traditional      basic emotion in their definition.
call center handles a large amount of calls every day and
will usually make recording of all the calls for later analysis        In addition to the semantics of spoken word, the speech
to see whether there is any improper conversation or not.         signal also carries information of the emotion state of the
However, these setups cannot handle the dispute situation in      speaker. That is, inside the speech signal, there are features
a timely manner.                                                  that are related to the emotion state at the time of making
                                                                  that speech. By analyzing these features, it is possible to
                                                                  classify the emotion categories with a suitable classifier.
Features related to speaking rate, signal amplitude,                 classifier, which is a variant of KNN and is proposed in [11].
frequency, jitter, and formant are being studied in the speech       The KNN is a classification algorithm that assigns the test
emotion recognition researches. [1-4].                               sample to a class based on the distance between test sample
                                                                     and k-nearest training samples. The WD-KNN extends the
      In previous studies of emotion recognition, there are          KNN by comparing the weighted distance sum to maximize
several aspects that are being addressed. Some of the studies        the classification accuracy. The weight calculation is to
tried to find the most relevant acoustic features to the             assign a higher weight to neighbors that provide more
emotion inside the speech signal [13-16]. Searching for the          reliable information.
most suitable machine learning algorithms for the classifier
is also a topic that attracted quite a lot of attention [7][9-12].        For an M-class classification using KNN, let the k
In most of the previous studies, short speech corpora are            neighbours nearest to unknown test sample y be N k (y ) , and
used in the experiments. However, in this research, we need          c(z) be the class label for training sample z. The subset of
to deal with the continuous speech, which is long in nature.         the nearest neighbours in class j ∈ { , L , M } is
Therefore, we need to find a way to properly segment the
speech signal and categorize the emotion such that burst
misclassification will not affect the accuracy of the                                N kj ( y ) = {z ∈ N k (y ); c(z) = j}        (1)
                                                                     If we denote the cardinality (the number of elements) of the
     The corpus is a set of a large collection of utterance          set N kj (y ) as | N kj (y ) | . Then the classification of y
segments. A corpus plays an important role in the emotion
recognition research. In this paper, the D80 corpus built in         belonging to class j* is the majority class vote, that is:
the previous studies is used. We use the D80 to train our
emotion recognition engine. The speech corpus was                                        j* = arg max{| N kj ( y ) |}             (2)
collected from 18 males and 16 females who were given 20
scripts and were asked to speak out in five emotions
including anger, boredom, happiness, neutral and sadness                  For WD-KNN classification, we need to select k
for each of them. A subjective test is performed and those           nearest neighbors from each class. Let di j denotes the
utterances with over 80% agreement were kept. After this             Euclidean distance between the ith nearest neighbor in class j
process, there are 570 utterances left. The number of                to unknown sample y. The distance measure is in ascending
utterances in each emotion categories in the D80 corpus is           order, that is d i j ≤ d i j+1 . The weighted distance sum for
151 for anger, 83 for boredom, 96 for happiness, 116 for
                                                                     sample y to all the k nearest neighbors in class j will be
neutral, and 124 for sadness.
     A considerable number of previous studies have been
                                                                                              Dj =   ∑w d         i i
made on feature selection in order to improve the accuracy
                                                                                                         i =1
of the emotion recognition. The core of the feature selection
is to reduce the dimension of the feature set to a smallest
feature combination that yields the highest recognition              where wi ≥ wi +1 for all i. As discussed in [11], the best
accuracy. The speech features commonly used in previous              recognition rate could be obtained by using a weighting in
emotion recognition researches include Formants (F1, F2              the reverse ordered Fibonacci sequence, which is:
and F3), Shimmer, Jitter, Linear Predictive Coefficients
(LPC), Linear Prediction Cepstral Coefficients (LPCC),                             wi = wi +1 + wi + 2 , wk −1 = wk = 1           (4)
Mel-Frequency Cepstral Coefficients (MFCC), first
derivative of MFCC (dMFCC), second derivative of MFCC                The classification of y belonging to class j* is the class with
(ddMFCC), Log Frequency Power Coefficients (LFPC),                   the shortest weighted distance
Perceptual Linear Prediction (PLP), and Zero-Crossing Rat
(ZCR). According to previous studies, the MFCC is the
                                                                                            j* = arg min{D j }                    (5)
commonly used feature for the emotion recognition [13].                                                       j
Therefore, we choose the MFCC feature as the acoustic
feature as input to the classifier input in this research.                As a conclusion, the speech emotion recognition is a
                                                                     system that takes the voice signal as input and then
     In the machine learning, the purpose of a classifier is to      recognizes the emotion of the speaker at the instance of
classify objects with similar characteristics into the same          making that speech. From the extracted features, the most
class. The classification can be divided into two types,             likely emotion was judged by using a classification
supervised (e.g. KNN) and unsupervised (e.g. k-means). In            algorithm. The block diagram of a speech emotion
this research, we used the Weighted D-KNN (WD-KNN)                   recognition system is shown in Figure 1. In this system, the
selected features were extracted and sent to the classifier to        The packet capture tool used in this study is WinPcap.
determine the most probable emotion of that segment.             WinPcap is a tool for link-layer network access in Windows
                                                                 environments. It includes kernel-level packet filter, a
                                                                 dynamic link library (packet.dll) and a high-level and
                                                                 system-independent library. It can be used in win32
      Emotional          Preprocessing                           platform to capture network packets. WinPcap consists of a
       Speech                                                    driver that facilitates the operating system to access the low-
                                                                 level networks. WinPcap also provides a library that can be
                                                                 used easily to access the low-level network layers by the
                            Feature                              application programs. In addition, the kernel of WinPcap
        Feature            Extraction                            also provides packet filtering. Through the filter setting, the
        Database                                                 driver will directly discard unwanted packet in driver layer.
                                                                 The performance of the packet filter is good. Hence it is
                         Classification                          now widely used in a lot of software such as WireShark.

                                                                 3    System Architecture
                                                                      In this paper, we proposed a framework that integrates
                                                                 the speech emotion recognition system into a VoIP call
Figure 1. Block diagram of emotion recognition system            center system. The components of the system are shown in
                                                                 Figure 2. The customer and the service representative
                                                                 communicate with each other through the VoIP phones. We
2.2    VoIP Telephony and Packet Capture                         defined session as the conversation between the pair of the
       Techniques                                                customer and the service representative. We use a layer 2
                                                                 switch to mirror packets going in and out from the IP PBX
      VoIP or known as IP phone is a technology that uses
                                                                 into our packet capture agent. Then, we identify the session
internet to accomplish telephone communication. The IP
                                                                 and assign the session with a Session ID (SID) if it is new.
phone was used internally in the enterprise in the past.
                                                                 After the session is established, we extract the voice signal
However, owning to the rapid grown of internet, the IP
                                                                 from the RTP packets captured and regroup it into proper
phone is now widely adopted and is gradually replacing the
traditional telephone communication system. Currently, the       segments. The speech emotion recognition system will be
                                                                 activated to classify the emotion of each segment. Finally,
commonly used VoIP communication protocol is the
                                                                 the recognition result will be stored into a database. The
Session Initiation Protocol (SIP). The SIP is a protocol
                                                                 system will issue an alert whenever a negative emotion,
developed by IETF MMUSIC which is used in the
                                                                 mainly anger, is detected.
establishment, modification and termination of an
interactive call session. Possible applications include voice
and video communication, instant messaging, online game,
virtual reality and other multimedia applications. The
purposes of SIP are to define the format and the control of
the packets transmitted over the internet.

     The SIP is a point to point protocol. Using a distributed
architecture, the SIP transmits the text-based information
and names the address by URL. The SIP use similar syntax
as some protocols used in the internet such as HTTP and
SMTP which consist of headers and message body.

     There are two types of SIP operations for the client
connection, one is to connect the two clients in a point to
point manner, and the other is to connect the party though a
                                                                        Figure 2. Components of the proposed system
proxy server. In this research, we adopt proxy server
configuration to simplify the packet capture process. The IP
                                                                       The operation steps of system are shown in Figure 3.
PBX (IP Private Branch exchange) works as the proxy
                                                                 First, the system will capture the packets and filter out the
server and plays the role of packet relaying. We capture the
                                                                 SIP and RTP packets. The Speech Emotion Recognition
packets in and out from the IP PBX by port mirroring from
                                                                 System (SERS) will assign a Session ID (SID) according to
the switch the server connected to.
                                                                 the phone number of the service representative and
customers. Then, the speech segment will be sent to the             by scholars to use the combination of multiple speech
emotion recognition engine to classify the emotion state.           features to increase the accuracy of speech emotion
Finally, the classification results will be stored into database.   recognition. However, in order to process multi-line voice
The service manager acquires the all the results using a web        segment simultaneously in real-time, the recognition engine
page interface. The system will issue alert to service              needs to reduce the computational complexity. From
manager when the negative emotion is detected.                      previous studies, the MFCC feature has been proven to be
                                                                    one of the most robust features in emotion recognition [12].
                                                                    So we choose to use only the MFCC in the emotion
         Packet                                                     recognition. We use the WD-KNN to be our classifier.
                                                                         In order to process multi-line voice segment, we may
                                                Negative            need more than one computer to perform the speech
         Session                                emotion             recognition. We design a mechanism to share the load
      identification                            detection           among several speech emotion recognition engines. Virtual
                                                                    machine architecture is used to fully utilize the computing
                                                                    power of a high performance server. With this framework,
        Voice data          Emotion                                 we can distribute the voice segment recognition to each
      reconstruction       recognition                              virtual machine in turn. The parallel processing mechanism
                                                                    can resolve the bottleneck problem due to the high
Figure 3. Operations of the speech emotion recognition              computational resource required by the recognition engine.

3.1      Packet Capture and Session Identification                  3.3    Negative Emotion Detection
     The packet is the smallest unit in network                          In the D80 corpus, there are five basic emotion
communication. Packets are transmitted across the internet          categories. We further divided them into positive and
through a series of switches and routers. A packet consists         negative class. The positive emotion includes happiness and
of a header and a payload. The header has the control               neutral, and the negative emotion includes anger, sadness
information for the transmission of packets. The control            and boredom. In this paper, we focus on detecting the anger
information includes source and destination IP addresses,           emotion incurred in the conversation between the service
source and destination ports, etc. In our experiment                representative and customer. The system will issue an alert
environment, all of the VoIP packet will pass through the           whenever the anger emotion being detected during the
SIP proxy server. In this framework, we can use a                   conversion between the service representative and the
mechanism called port mirror in the layer 2 switch to mirror        customer.
the packets in and out of the SIP server. Then, we can
capture the packets and analyze its content easily.                       To decrease the false alarm rate, a score board
                                                                    judgment mechanism is implemented. The score board is
     In the call center, each phone line is independent. But        zero at the beginning of a session. When the emotion
once the call is established, the phone number will not             recognition result is anger, the score board will increase by
change during the call. So, the identification of a session is      4 points. The score board will decrease 1 point when no
necessary for storing the captured RTP packet to the correct        anger emotion is detected for a voice segment if the score is
pair of conversation. We build a session object to manage           positive. The system will confirm the anger emotion when
the session. The session object include source IP, setup time,      the score exceed the threshold. An alert flag will be written
speech coding and used buffers. We create a session object          to database whenever the score exceed the threshold. An
when a session start and we will remove the session object          alert message will be sent to the service manager through a
when the phone call ends. In order to implement multi-line          web page interface whenever alert flag is being set.
packet capture, we need to identify the source and
destination IP address of the packet first. According to the        4     Experiments and Result Discussion
information, we assign a session ID to the session, and then
allocate the storage for storing the session ID and related         4.1    Experimental Setup
                                                                         The proposed system consists of three subsystems. The
                                                                    detail description of each subsystem will be presented in this
3.2      Emotion Recognition Engine                                 section. The framework of the proposed system is shown in
     To integrate emotion recognition function into the             Figure 4.
VoIP call center system require further effort then the
speech emotion recognition. Some attempts have been made
                                              Alert             4.3    Testing Samples
 Packet capture         recognition                                  A six scripts corpus was recorded by inviting
    & Session             engine 1           Negative           volunteers to speak out the scripts with emotion as they like.
  identification                             emotion            Each corpus is tagged subjectively by human judge. Only
                         Emotion             detection
                        recognition                             the results with more than 80% agreement in this process are
                          engine 2                              kept. The emotion of human seems to change gradually. So
                            ...                                 we use the score board concept to determine whether or not
                                             Database           an anger emotion was occurred in the conversation between
  File system            Emotion                                the customer and the service representative.
                         engine N                                     In our environment, we set the score to 0 at the
       Figure 4. Framework of the proposed system               beginning. If a negative emotion is being detected, the
                                                                system will add 4 points to the corresponding session, or
      The Asterisk IP PBX server is installed in a Linux        otherwise the system will subtract 1 point from that session
platform. In this research, we assign the SIP ID 1xx as the     if the score is positive. In order to reduce the false negative
phone number for customers, SIP ID 4xx for the service          emotion recognition rate, a threshold should be set. We test
manager, and SIP ID 6xx for the service representatives.        several threshold values, including 4, 8, 12, and 16, to check
The SIP phone is configured in the proxy mode, such that        the recognition accuracy. We compare the emotion
all the SIP and RTP related communication packets between       recognition results with human judge for the chosen
the customer and the service representative will pass           threshold. The result of the negative emotion detection is
through the IP PBX. With this configuration, we can capture     listed in Table 1.
all the SIP and RTP packets by mirroring the traffic going
into the server from the Layer 2 switch where the server is     Table 1. Number of detected negative emotion with
attached.                                                       different threshold T.
                                                                                       T=4     T=8     T=12     T=16     T=20
     The session information is stored in a table structure.                Judge
When a new SIP or RTP packet is received, the packet             Script1      2         59      36      18       1         0
capture module will capture the packet, analyze the content      Script2      0         2       0       0        0         0
of header and compare it with the contents in the session        Script3      0         48      33      16       5         0
table. If it is a new session, the system will create a new      Script4      1         21      1       1        1         0
session identifier and assign a new session ID to it;            Script5      7         97      85      61       29       10
otherwise the existing session ID will be retrieved.             Script6      10        54      39      23       12        7

4.2    Emotion Recognition
                                                                     In Table 1, the negative emotion detection results are
      We capture the voice signal from the conversation
                                                                close to the human judge when the threshold is 16 except
between the customer and service representative. The voice
                                                                the script 3 and 5. For script 3, the content of the script
we captured will be regrouped into sound file in WAV
                                                                consist of a lot of happy sentence. It is hard to differentiate
format with 1 second in duration each. We will send the
                                                                between happiness and anger in a speech emotion
segmented voice file into the speech recognition engine, and
                                                                recognition system since they are both in the activation
the engine will output the recognition result. In order to
                                                                category. For script 5, the content of the script consists of
increase the performance of the emotion recognition engine,
                                                                continuous negative emotion. Due to the score board
we use a parallel architecture to build our recognition
                                                                architecture, it is hard to count the dispute correctly. The
system. We setup several machines and install the
                                                                advantage of the score board framework is that it can reduce
recognition engine written in MATLAB into each machine.
                                                                the judgment error. But it cannot count the dispute exactly.
By using this structure, the bottleneck problem of the speech
                                                                For the negative emotion detection as mentioned above, the
emotion recognition system can be avoided. The output
                                                                most important thing is to issue an alert when the dispute
from each recognition engine is stored into a database. The
                                                                happened. The frequency of the dispute is not an important
results are analyzed by using the score board algorithm
                                                                issue in this research.
stated above to determine whether the anger emotion exists
in the conversation or not.
                                                                5     Conclusions
                                                                     The call center plays an important role for an enterprise.
                                                                It is obviously that collects the opinion from the customer
and pacify the customer if he or she is too agitated is       of Computational Linguistics and Chinese Language
important for the enterprise. To improve the quality of the   Processing, pp. 1-18, Vol. 9, No. 2, 2004.
service of the call center, we can integrate a negative
emotion detection mechanism into the call center system.      [6] R. E. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G.
When a service representative faces too many angry            Votsis, S. Kollias, W. Fellenz, and J. Taylor, “Emotion
customers, the call distribution system can reduce the        Recognition in Human-Computer Interaction,” IEEE Signal
number of calls to that representative to avoid the           Processing Magazine, Vol. 18, No. 1, pp. 32-80, 2001.
representative running into angry state. In this paper, we
propose and implement the frameworks which can handle         [7] X. Jin and Z. Wang, “An Emotion Space Model for
multi-line phone call with negative emotion detection         Recognition of Emotions in Spoken Chinese,” First
capability. We modularize the system components that make     International Conference on Affective Computing and
the system more flexible. We develop the subsystem            Intelligent Interaction, pp. 397-402, 2005.
individually which includes packet capture and analysis,
emotion recognition, result recording, and negative emotion   [8] M. Lugger and B. Yang, “Extracting voice quality
detection. In the part of emotion recognition, we adopt the   contours using discrete hidden Markov models,”
parallel architecture to avoid the possible bottleneck        Proceedings of the Speech Prosody, 2008
problem. By the combination of these subsystems, we build
a system that can detect negative emotion from the            [9] T. Nwe, S. Foo, and L. De Silva, “Speech emotion
conversation and issue an alert to service manager            recognition using hidden Markov models,” Journal of
accordingly. Consequently, this system can improve the        Speech Communication, 41(4), 603-623, 2003
service quality of a call center because of the service
manager has a chance to intervene to resolve the problem      [10] Ortony and T. J. Turner, “What's Basic about Basic
immediately.                                                  Emotions,” Psychological Review, pp. 315-331, 1990.

Acknowledgement                                               [11] T. L. Pao, Y. M. Cheng, Y. T. Chen & J. H. Yeh,
                                                              “Performance Evaluation of Different Weighting Schemes
    The authors would like to thank the National Science      on KNN-Based Emotion Recognition in Mandarin Speech,”
Council (NSC) for financial support of this research under    International Journal of Information Acquisition, Vol. 4,
NSC project No: NSC 100-2221-E-036 -043.                      No. 4, pp. 339-346, Dec. 2007.

6    References                                               [12] T. L. Pao, Y. T. Chen, Chen and J. H Yeh,
                                                              “Comparison of classification methods for detecting
[1] H. Altun and G. Polat, “Boosting selection of speech      emotion from Mandarin speech,” IEICE Transactions on
related features to improve performance of multi-class        Information and Systems, Vol. E91-D, no. 4, pp. 1074-1081,
SVMs in emotion detection,” Expert Systems with               April 2008.
Applications, 36, 8197-8203, 2009
                                                              [13] T. L. Pao, Y. T. Chen, and J. H. Yeh, “Emotion
[2] C. Busso and S. S. Narayanan, “Between Speech and         Recognition and Evaluation from Mandarin Speech
Facial Gestures in Emotional Utterances: A Single Subject     Signals,” International Journal of Innovative Computing,
Study,” IEEE Transactions on Audio, Speech, and               Information and Control (IJICIC), Vol.4, no. 7, pp. 1695-
Language Processing, Vol. 15, No. 8, pp. 2331-2347, 2007.     1709, July 2008.

[3] C. Busso, S. Lee, and S. Narayanan, “Analysis of          [14] J. Rong, G. Li, and Y. P. Chen, (2009). “Acoustic
Emotionally Salient Aspects of Fundamental Frequency for      feature selection for automatic emotion recognition from
Emotion     Detection,”    IEEE     Transacations    On       speech,” Information Processing and Management, 45,
Audio ,Speech, And language Processing, Vol. 17, No. 4,       315-328, 2009
pp.582-596, May. 2009
                                                              [15] D. Ververidis and C. Kotropoulos, C, “Fast and
[4] M. Cernak and C. Wellekens, “Emotional aspects of         accurate sequential floating forward feature selection with
intrinsic speech variabilities in automatic speech            the Bayes classifier applied to speech emotion recognition,”
recognition,” International Conference on Speech and          Signal Processing, 88, 2956-2970, 2008
Computer, 405-408, 2006
                                                              [16] B. Yang and M. Lugger, “Emotion recognition from
[5] Z. J. Chuang and C. H. Wu, “Multi-Modal Emotion           speech signals using new Harmony features,” Signal
Recognition from Speech and Text,” International Journal      Processing, 90, 1415-1423, 2010

To top