Document Sample
A 2+1-LEVEL STOCHASTIC UNDERSTANDING MODEL Hél_ene Bonneau-Maynard Powered By Docstoc
					                          A 2+1-LEVEL STOCHASTIC UNDERSTANDING MODEL

                    H´ l` ne Bonneau-Maynard∗
                     ee                                                                              e
                                                                                          Fabrice Lef` vre

                       LIMSI/CNRS                                                          LIA/CNRS
              Spoken Langage Processing Group                                        University of Avignon

                          ABSTRACT                                      evaluation project allows us to validate these improvements
                                                                        on a new task.
     In this paper, an extension of the 2-level stochastic un-
                                                                            The aim of the French E VALDA -M EDIA project [7] is
derstanding system is presented. An additional stochastic
                                                                        to define and to test an evaluation methodology to compare
level is introduced in the system as the attribute value nor-
                                                                        and to diagnose the understanding capability of spoken lan-
malization module. In order to improve the model trainabil-
                                                                        guage dialog systems. The evaluation environment relies
ity, the conceptual decoding and value normalization steps
                                                                        on the premise that, for database query systems, it is pos-
are decoupled, leading to a 2+1-level system. The proposed
                                                                        sible to define a common semantic representation to which
approach is evaluated on the French MEDIA task (tourist in-
                                                                        each system is capable of converting its own internal repre-
formation and hotel booking). This new 10k-utterance cor-
                                                                        sentation. Systems from both academic organizations and
pus is segmentally annotated allowing for a direct training
                                                                        industrial sites are engaged.
of the 2-level conceptual models. Further developments of
                                                                            The M EDIA project is held in two evaluation phases:
the system (modality propagation and hierarchical recom-
                                                                        context-independent and context-dependent understanding
position) are also investigated. On the whole, the proposed
                                                                        evaluation. Context-independent understanding corresponds
improvements achieve a 24% relative reduction of the un-
                                                                        to the building of the literal representation of the meaning of
derstanding error rate from 37.6% to 28.8%.
                                                                        isolated spoken utterance, whereas context-dependent cor-
                                                                        responds to the semantic representation of the query, taking
                    1. INTRODUCTION                                     into account the dialog context. In each case, the project
                                                                        participants benefit from the same corresponding annotated
The recourse to stochastic techniques for spoken understand-            training corpus to allow the adaptation of their models to the
ing modeling offers an efficient alternative to rule-based               task and the domain. The M EDIA task concerns the reserva-
techniques by reducing the human expertise and develop-                 tion of hotel rooms with tourist information in France, using
ment cost [1, 2, 3, 4, 5]. In a precedent paper [6] the de-             information obtained from a web-based database.
velopment of a baseline 2-level understanding model had                     The paper is organized as follows. The next section de-
been presented on the A RISE task (railway timetables and               scribes the semantic annotation scheme. The stochastic un-
ticket booking). The present paper presents two main im-                derstanding modeling is described in Section 3. Section 4
provements of our understanding system: the recourse to a               describes the Media corpus. Then, after the experimental
segmentally annotated training corpus and the application               setup description, the last section reports the results on the
of stochastic models in the value normalization step.                   test data.
    Firstly, based on the idea that a keyword annotation of
the training corpus does not allow a powerful 2-level model-
                                                                              2. SEMANTIC REPRESENTATION AND
ing (which is then trained on keyword-centered fixed-length
                                                                                    ANNOTATION SCHEME
word sequences artificially derived), this paper studies the
use of a new annotation scheme. The semantic annota-                    The speech understanding module is the front end of the
tion is aligned on a thorough segmentation of the queries.              dialog manager. Its role is to analyze the user query and
Secondly, the whole understanding process is converted to               to produce a representation of its semantic content that al-
an embedded stochastic model, including the normalization               lows the dialog manager to take a decision about the dialog
phase which was previously rule-based. The participation to             follow-up taking into account the context.
the French Technolangue E VALDA -M EDIA understanding                       The M EDIA evaluation paradigm relies on a common
   ∗ This work was partially supported by French Ministry of Research   generic semantic representation described in detail in [8].
under the E VALDA-M EDIA project.                                       The representation is based on an attribute-value structure
                 word seq.          mode     attribute name                             normalized value
                 euh                 +       null
                 oui                 +       response                                   yes
                 l’                  +       refLink-coRef                              singular
                 hˆ tel              +       BDObject                                   hotel
                 dont                +       null
                 le prix             +       object                                     payment-amount
                 ne d´ passe pas     +       comparative-payment                        less than
                 cent dix            +       payment-amount-integer-room                110
                 euros               +       payment-unit                               euro
Fig. 1. Example of the semantic attribute/value representation for the sentence “hum yes the hotel which price doesn’t exceed
one hundred and ten euros”. The relations between attributes are given by their order in the representation and the composed
attribute names. The segments are aligned on the sentences.

in which conceptual relationships are implicitly represented      base tables (eg BDObject or payment-amount). The
by the name of the attributes.                                    modifiers attributes (eg comparative) are linked to data-
    The semantic representation relies on a hierarchy of ba-      base attributes and used to modify the meaning of the re-
sic attributes, which are identified in a semantic dictionary,     lying database attribute (eg in Figure 1 the comparati-
jointly developed by the M EDIA consortium. This concep-          ve attribute, which value is less than) is associated to
tual hierarchy provides also a set of relationships between       the payment-amount attribute). General attributes are
semantic units. Each turn of a dialog is segmented into           also defined as command-task which includes the dif-
one or more dialogic segments and each dialogic segment           ferent actions that can be performed on objects of the task,
is segmented into one or more semantic segments with the          or command-dial with values cancellation, cor-
assumption that a semantic segment corresponds to a sin-          rection... One of the general attributes refLink is ded-
gle attribute. An example of a semantic representation of a       icated to the annotation of references [8].
client utterance is given in Figure 1.                                The general and modifier attributes are domain indepen-
    A semantic segment is represented by a 5-tuple which          dent and were directly derived from other applications [6,
contains:                                                         10] whereas most of the database attributes were derived
                                                                  from the database linked to the system.
   • the mode: affirmative ’+’, negative ’-’, interrogative
                                                                      The set of normalized values associated to each attribute
     ’?’ or optional ’˜’,
                                                                  are defined in the semantic dictionary with 3 different pos-
   • the name of the attribute representing the meaning of        sible configuration:
     the sequence of words,
                                                                     • a value list (eg comparative with possible values
   • the value of the attribute,                                       around, less-than, maximum, minimum and
   • some optional links: pointers to related segments in
     previous utterances (only useful for contextual seman-          • regular expressions (as for dates),
     tic representation),
                                                                     • open values (i.e. no restrictions, as for client names).
   • an optional comment on the segment.
The order of the 5-tuples in the semantic representation fol-     2.2. Semantic Segmentation
lows their order in the utterance. The attribute values are ei-
ther numeric units, proper names or semantic classes merg-        In our preceding work, the semantic annotation was keyword-
ing lexical units which are synonyms for the task. The            based: the attributes were associated to the words which
modes are assigned in a per segment basis. This allows            determine their value. In the chosen annotation scheme, a
to disambiguate sentences such as “not in Paris in Nancy”         query is segmented into semantic segments: the attributes
which could otherwise be misleading for the dialog man-           are associated to sequences of words - the segments - which
ager.                                                             better disambiguate their semantic role. An example is given
                                                                  in Figure 2. The segmentation instructions used by the an-
                                                                  notators rely on simple empirical rules essentially based on
2.1. Semantic dictionary
                                                                  the syntactic structure of the sentence (usage of verbs, de-
The basic attributes can be divided in several classes. The       terminers...), but also address the case of repetitions, the use
database attributes correspond to the attributes of the data-     of filler words... In terms of development cost it should be
     Keyword       I’d like hum to reserve in the area               and Pr(C) is estimated in terms of m-gram probabilities of
                   of Montparnasse                                   concepts:
     Segmental     I’d like hum to reserve / in the area
                   of / Montparnasse
                                                                                   Pr(C)             Pr(ci |ci−1 , . . . , ci−m )          (3)
Fig. 2. Illustration of the keyword and segmental annotation                                   i=1
schemes. Keywords (first row) are given in bold, whereas
                                                                         Based on this formulation, several approaches can be
semantic segments (second row) are separated with slashes.
                                                                     considered depending on the orders of the models used to
                                                                     produce the estimates of Pr(W |C) and Pr(C). Generally,
noted that the segment-based annotation appears to be more           concept bigrams Pr(ci |ci−1 ) (m = 1) are sufficient to model
natural and easier for the annotators than the keyword based         the concept sequences. Depending on the availability of
scheme for which they often had difficulties to decide for            training data with segmental annotation, word bigrams con-
which keywords were to be elected.                                   ditioned on concept Pr(wi |wi−1 , ci ) (n = 1) can be used.
                                                                     Generalization of the word n-gram models can be improved
2.3. From flat annotation to hierarchical representation              by the use of a set of lexical classes.
                                                                         The first decoding stage aligns an attribute name to each
Hierarchical semantic representation is powerful as it allows        sub-sequence of the query. Segmented word strings have
to explicitly represent relationships between segments, pos-         then to be converted to the corresponding normalized form
sibly non-adjacent in the transcription of the query. On the         expected in the semantic dictionary.
other hand, a flat representation facilitates the manual an-              In the example of Figure 1, the normalization module
notation. It has then been decided for the MEDIA annota-                                            e
                                                                     translates the sequence ne d´ passe pas (isn’t greater than)
tion scheme to preserve the relationships, by defining a set          assigned to the attribute comparative-payment to the
of specifiers which are combined with database or modi-               normalized form less than. Several lexical sequences
fier attributes. For example in Figure 1, the attribute com-          can correspond to the same normalized value.
parative-payment is derived from the combination of                      This normalization step is generally obtained by means
the comparative attribute and the payment specifier                   of a set of rules. In this work we propose to extend the
and the attribute payment-amount-integer-room is                     stochastic model with an additional level for the value nor-
derived from the combination of the payment-amount-                  malization. In this context, a new formulation of the concept
integer attribute with the specifier payment. The com-                decoding is done where the concept sequence is combined
bination of the specifiers and the attribute names makes it           with a value sequence 1 :
possible to derive a hierarchical representation of a query
from its flat annotation [8].                                                  ˆ ˆ
                                                                              C, V      = arg max Pr(C, V |W )
                                                                                        = arg max Pr(W |C, V ) Pr(C, V )
                                                                         This leads to the probabilities being conditioned on the
The aim of stochastic understanding is to find the best se-           normalized values. Consequently, the number of correspon-
quence of concepts C = c1 c2 . . . cN that will represent            ding states in the conceptual model would be greatly increa-
the meaning of the sentence. The assumption is made that             sed. Moreover, this solution is inadequate when an attribute
there is a sequential correspondence between the concept             accepts an open set of values (like numbers or client names).
and word sequences [1].                                              For these reasons, the normalization level is not totally em-
    Given W = w1 w2 . . . wN the sequence of words in                bodied in the conceptual model but is applied after the con-
the sentence, the understanding process consists of finding           cept decoding.
the sequence of concepts which maximizes the a posteriori
probability, rewritten according to the Bayes formula:                        ˆ
                                                                              C    =     arg max          Pr(W |C, V ) Pr(C, V )           (4)
 C = arg max Pr(C|W ) = arg max Pr(W |C) Pr(C) (1)
            C                          C                                      ˆ
                                                                              V    =                ˆ
                                                                                         arg max Pr(C, V |W )
The term Pr(W |C) is estimated by means of n-gram prob-                            =                   ˆ         ˆ
                                                                                         arg max Pr(W |C, V ) Pr(C, V )                    (5)
abilities of words given the concept associated to word i:
                                                                     Equation 4 allows for a better generalization of the concep-
                                                                     tual model. Furthermore the hypothesis that the normalized
       Pr(W |C)            Pr(wi |wi−1 , . . . , wi−n , ci )   (2)
                     i=1                                               1A   concept is then a combination of an attribute and a modality
              Concept i                  Concept i+1                                                                train    test
                                                                         #utterances                               10965    1009
                                                                         mean #words per utterance                    4.8     5.4
                                                                         number of different words                  2115     794
                          Value i                      Value i+1
                                                                         number of observed attributes             29980    3125
                                                                         mean #attributes per utterance               2.7     3.1
                                                                         number of different attributes              144     106

                                                                   Table 1. Main characteristics of the client utterances in the
                                                                   training and test corpus.
               Word i                     Word i+1

Fig. 3. DBN representation of the 2+1-level stochastic un-              Semantic annotation has been done on the transcriptions
derstanding model.                                                 of the dialogs, using a specific annotation tool2 , by two
                                                                   E LDA annotators under LIMSI supervision. The tool en-
values have a slight or no influence on the segmentation pro-       sures that the provided annotation respects the semantic rep-
cess seems reasonable. The latter equation then leads to           resentation defined in the semantic dictionary. An on-line
separate evaluation of normalization models associated to          verification is performed on the attribute value constraints.
every possible concepts and expected values. In this way           In order to verify the quality of the annotations, periodic
the system can be considered a 2+1-level system. The con-          evaluations were performed, showing that the attribute inter-
ditional independence assumptions on the probabilities are         annotator agreement is always greater than 80%, resulting
represented in the diagram of Figure 3 in the formalism of         in a kappa [9] of more than 0.8, which is commonly consid-
dynamic Bayesian networks (DBN).                                   ered in the literature as good.
    Finally, the AVR proposed by the whole understanding                Table 1 gives details on both training and test corpus.
process is obtained after removing the lines corresponding         The most frequent attribute is the yes/no response (17%),
to null attribute. In the example of Figure 1, the resulting       followed by reference attributes (6.9%) and command-task
AVR is given in the last three columns.                            (6.8%). It is interesting to note that the most frequently en-
                                                                   countered attributes are task-independent (localization, time,
                                                                   ...) and that task-dependent attributes (hotel, room...) rep-
              4. CORPUS DESCRIPTION                                resent only 14.1% of the observed attributes. A total of 144
                                                                   distinct attributes appear in the training corpus. Only one
The M EDIA dialog corpus was recorded using a WOZ sys-
                                                                   attribute of the test corpus was not observed in the training
tem simulating a vocal tourist information phone server [7].
1257 dialogs were recorded, from 250 different speakers
where each caller carried out 5 different hotel reservation
scenarios. Several starting points were possible for the di-                 5. EXPERIMENTS AND RESULTS
alogs i.e. choice of town, itinerary, touristic event, festival,
price, date etc. Eight scenario categories were defined each        The scoring tool developed for the Media project allows to
with a different level of complexity. The final corpus is on        align two semantic representations and to compare them in
the order of 70 hours of transcribed dialogs.                      terms of deletion, insertion, and substitution. The scoring
    In this paper we focus on literal understanding. The           can be done on the whole triplet including [mode, attribute
corpus (Table 1) consists in a training portion of 10965 re-       name and attribute value] (full eval). It is also performed
quests, and a preliminary test portion of 1009 requests. The       by using a simplification which consists in applying at the
676 proper names appearing in the corpus, which corre-             same time a projection on modes ’˜’, and ’?’ to ’+’ mode
spond mainly to city names (201) and hotel names (548),            (resulting in a mode distinction limited to ’+’ and ’-’), and a
are highly ambiguous for the task.                                 relaxation function to attribute names by removing the spec-
    The semantic dictionary defined for the M EDIA project          ifiers (relax eval). The value eval corresponds to a scoring
includes 83 basic attributes and 19 specifiers. The combi-          only on the attribute values.
nation of the basic attributes and the specifiers - automati-
cally generated by the annotation tool - results in a total of     5.1. Baseline rule-based normalization
1121 attributes that can be used during the annotation pro-
                                                                   The baseline experiment relies on the methodology descri-
cess. The 83 basic attributes include 73 database attributes,
                                                                   bed in [6]. The 2-level conceptual model is trained on the
4 modifiers, and 6 general attributes. The total number of
different normalized values in the training corpus is around         2 download   at
                                #tags   full    relax    values   of penalty are applied beforehand. Three cases are distin-
 baseline                       390     37.6    23.0     20.8     guished:
 stochastic norm                390     36.9    21.8     19.6
 stochastic norm+               393     36.9    21.6     19.2        1. All the words in the tested sequence are OOVs for the
 modality propagation           393     35.6    21.3     19.2           considered value.
 specifiers post-processing      344     28.8    21.5     19.7        2. The normalized value is present in the word sequence
                                                                        with the form expected in the semantic dictionary.
Table 2. Understanding error rates (%) on the test set with
different evaluation modes. First column gives the number            3. All other cases.
of concepts in the model. Baseline is a 2-level system with
rule-based normalization.                                              The stochastic normalization has also been tested with
                                                                  3-grams. Another contrastive experiment has consisted in
10,965-utterance training corpus. Thanks to the segmental         varying the out-of-vocabulary (OOV) fraction used during
annotation of the training corpus, no transformation of the       the language model training. In the value normalization
annotations (like usage of concept-markers) is needed to es-      models, a fraction of every 2 or 3-gram counts is substracted
timate the word bigrams on concepts. Several lexical classes      and affected to words not present in the training data (repre-
are used to generalize the estimate of the bigrams. They are      sented by the unknown token <UNK>). In this context, the
derived from the database attribute values, and consist only      models are small-sized and so the fraction amount may have
of words syntactically and semantically equivalent for the        a certain impact on the log-likelihood values for sequences
task (such as hotel names, city names, numbers, dates...).        containing OOVs. Our baseline value (0.1) has been in-
The entries in the classes are the base forms of the entities     creased to 0.5. However we observed that both increasing
(for instance aeroport charles de gaulle) which                   the model level to 3-grams or make greater provision for
could appear in various surface forms in the word strings         OOVs don’t show any significant gain on the understanding
(Aeroport Charles de Gaulle, aeroport de Gaulle...). In or-       error rate.
der to deal with ambiguities (a same sequence of words can             These experiments show that the proposed stochastic
belong to several classes depending on the underlying con-        normalization modeling can generalize better from the very
cept), the classes are applied selectively to the concepts.       small set of examples available for each normalized value,
    The segmental annotation of the training data is also         and thus represents an efficient alternative to the rule-based
used to automatically derive the set of rewriting rules for the   normalization. However, the important gap observed be-
normalization process. A rewriting rule is derived from each      tween full and relax error rates in Table 2 indicates that
observation of a concept in the training corpus. Attribute-       mode confusion and specifier mis-placement are two main
dependent sets of rules are then obtained. To improve gen-        sources of the system errors and should be addressed explic-
eralization, automatically derived rules are shared between       itly.
modalities (+/-/?/˜) of the same attribute and between at-
tributes which only differ on specifier part. A specific treat-     5.3. Modality detection
ment is done for numbers and dates normalization.
    The 2-level understanding process is performed after ap-      Some improvement for the mode detection is obtained by
plying a transduction, so as to detect class elements in the      modifying the semantic annotations of the training corpus.
word strings and convert them to their base form. Also filler      As indicated in Section 2, each semantic segment has to be
words (such as “euh”, “ah”...) are removed before the con-        assigned with a modality which, according to the annota-
ceptual decoding. The understanding error rate for the base-      tion guidelines, is not necessarily the same for all attributes
line system are shown in Table 2: 37.6% in full mode, down        of an utterance. For instance in the sentence “no / Roissy /
to 23.0% in relax eval and 20.8% for value assignment.            not Paris / Roissy” where the second and fourth segments
                                                                  (Roissy) are assigned to the affirmative ’+’ mode whereas
                                                                  the third segment (not Paris) is assigned to the negative ’-’
5.2. Stochastic value normalization
                                                                  mode. In “and / what are the prices / for the hotel / near/
The embedded stochastic normalization described in Sec-           Montparnasse”, the interrogative mode is assigned to the
tion 3 is introduced in the baseline system described above.      second segment and the affirmative mode on all other seg-
Results are given in Table 2. The stochastic normalization        ments: the client interrogation lies on the prices and not on
allows a relative improvement of 6% on the value identi-          the localization.
fication (last column) compared to the rule-based normal-               During the manual annotation the positive mode has been
ization. The relative improvement is increased to 7.6% by         attributed by default to the null segments regardless of
considering a more elaborated decision setup. Instead of di-      their context. However, during the conceptual decoding,
rectly comparing the word sequence likelihoods, two kinds         these non-modal null segments introduce discontinuities
that may stop the propagation of the mode along adjacent         ments. However the remaining errors on specifiers still ac-
segments. An automatic annotation modification is then ap-        count for around 25% of all the errors. Therefore an en-
plied which consists in propagating the mode to the null         hancement of the current method with the introduction of
segments enclosed between two segments of the same mode.         statistical classifiers is planned to reach a better coverage
As shown in Table 2, this prior propagation allows for a bet-    of all the semantic contexts for a more precise hierarchical
ter modality detection resulting in a 3.6% relative improve-     recomposition of the concepts.
ment of the understanding error rate.
                                                                                     7. REFERENCES
5.4. Hierarchical recomposition
                                                                 [1] E. Levin and R. Pieraccini, “Concept-based Sponta-
The 2+1-level model has no mechanism to deal with hier-             neous Speech Understanding System,” in ESCA Eu-
archical and long-span dependencies. Even though tech-              rospeech, Madrid, 1995, pp. 555-558.
niques applicable to sequential stochastic models have been
recently proposed [11, 4], a full tree-based semantic parsing    [2] R. Schwartz et al., “Hidden Understanding Models for
of the sentences seems out of the scope of models based on          Statistical Sentence Understanding”, in IEEE ICASSP,
bigrams. Taking into account this limit of our models, the          Munich, 1997.
concept specifier identification is improved through a two         [3] F. Pla , A. Molina E. Sanchis, E. Segarra and F. Gar-
step procedure.                                                     cia “Language Understanding using Two-level Stochas-
    It has been observed that most frequent concept speci-          tic Models with POS and Semantic Units”, LNCS series,
fiers are activated in specific semantic contexts which can           vol. 2166, p. 403-409, 2001.
have a long span influence but can be easily described in
terms of basic concept presence. For instance, the specifier      [4] Y. He and S. Young, “Hidden Vector State Models for
reservation appears nearly exclusively when a com-                  Hierarchical Semantic Parsing,” in IEEE ICASSP, Hong
mand-task concept with value reservation has been                   Kong, 2003.
decoded in the utterance. As a consequence these contexts
can be retrieved after the concept decoding. This allows to      [5] C. Raymond, F. Bechet, N. Camelin, R. de Mori. and
decode with models trained without specifiers (reducing the          G. Damnati, “Semantic Interpretation with error Correc-
total number of concepts from 390 to 344 and leading to a           tion,” in IEEE ICASSP, Montreal, 2005.
better generalization of the models), and then the specifiers     [6] F. Lefevre and H. Bonneau-Maynard, “Issues in the de-
are added in a second step. Several simple hand-made rules          velopment of a stochastic speech understanding system”,
have been elaborated to detect when appropriate specifiers           in ICSLP, Denver, 2002.
should be added and to which concepts they should be ap-
plied. This procedure allows an important relative improve-      [7] L. Devillers, H. Maynard et al., “The French M E -
ment of 19% (see last two rows of Table 2) despite a slight         DIA/E VALDA project: the evaluation of the understand-
increase of the value error rate (from 19.2 to 19.7), due to a      ing capability of Spoken Language Dialogue Systems”,
greater number of possible values per concept.                      LREC, Lisbon 2004.

                                                                 [8] H. Bonneau-Maynard and S. Rosset et al., “Semantic
                    6. CONCLUSION                                   annotation of the MEDIA corpus for spoken dialog ”, in
                                                                    ISCA Eurospeech, Lisbon, 2005.
An embedded 2+1-level stochastic model for speech under-
standing has been proposed and evaluated. Taking bene-           [9] J. Carletta, “Assessing agreement on classification
fit of a new segmental annotation scheme and new specific             tasks: the kappa statistic”, Computational Linguistics, 22
techniques, this model has been applied successfully to the         (2):249-254, 1996.
M EDIA touristic information retrieval task. The task is chal-
                                                                 [10] H. Bonneau-Maynard and S. Rosset, “Semantic rep-
lenging as the conceptual model represents 344 concept tags
                                                                    resentation for spoken dialog ”, in ISCA Eurospeech,
and includes hierarchical constituents. From the baseline 2-
                                                                    Geneva, 2003.
level model, the introduction of the mode propagation tech-
nique and a new stochastic normalization module had led          [11] A. Molina and F. Pla, “Shallow Parsing using Spe-
to a global 24% relative improvement of the understanding           cialized HMMs”, Journal of Machine Learning Research
error rate. An important part of the system performance im-         (Special Issue on Shallow Parsing) Vol. 2, pp 595-613,
provement comes from a simple but efficient way of dealing           2002.
with the identification of the specifiers used to represent hi-
erarchical non-adjacent relationships between semantic seg-

Shared By: