Docstoc

slideshow

Document Sample
slideshow Powered By Docstoc
					             Using Dialog Corpora to train a Chatbot
           Bayan Abu Shawar and Eric Atwell, University of LEEDS


The paper presents the following:
ALICE and Elizabeth chatbot systems.
Examples of the Dialogue Diversity Corpus and its problems.
A Java program to convert from dialogue transcript to AIML Format.
 Using Wmatrix to compare human and chatbot dialogue.



                                                                   1
A Chatbot


 A chatbot is a conversational agent that interacts with users using
   natural language.


 ALICE and Elizabeth chatbots are presented in this paper.


 Both were adopted from ELIZA (Weizenbaum 1966), which
  emulated a psychotherapist.


                                                                 2
ALICE System


ALICE: the Artificial Linguistic Internet Computer Entity; a
software robot that you can chat with using natural language.


ALICE language knowledge is stored in AIML files.


AIML: The Artificial Intelligence Mark up Language.



                                                        3
AIML Files are made up of :


 Topics : each Topic file contains a list of categories
 Categories: contain
           Pattern: to match with user input
           Template: represents ALICE output


Patterns can match parts of input: “divide and conquer”




                                                           4
The AIML Format


< aiml version=”1.0” >
< topic name=” the topic” >
<category>
 <pattern>PATTERN</pattern>


 <template>Template</template>
</category>
  ..
</topic>
</aiml>                          5
Example involving <srai> - recursion:
                               Halo, what is 2 and 2 ?
                              HALO WHAT IS 2 AND 2




          <sr/>                              <srai>WHAT IS 2 AND 2 </srai>
        HALO                                             WHAT IS 2 AND *
        HELLO                                            WHAT IS 2 *



  Well hello there!                             Two
  Hi. I was waiting to talk                     Four
  Hello there!                                  Six


                               Hello there! Four                           6
Elizabeth system (Millican 2002)


Knowledge is stored as a script in a text file.
Each line starts with a script command notation.
These notations are:
W: Welcome message          Q: quitting message
V: Void input               I: Input transformation
K: Key word pattern         R: key word response
N: No match                 O: Output transformation
M: Memorise phrase          &: Action to be perform
/ : Comment
                                                       7
Pattern Matching is more complex in Elizabeth


The matching process involves five phases:


  1. Matching with Input Transformation Rules.
  2. Matching with Keyword patterns.
  3. Matching with Output transformation rules.
  4. Matching with Void or No keyword messages.
  5. Performing any Dynamic processes.


… ALICE categories are simpler, easier to Machine Learn, but
  we can also convert from AIML format to Elizabeth script.
                                                         8
Example:
Input: I think my mum loves my brother more than me
Match Algorithm:
1. I think my mother loves my brother more than me
2. WHY DO YOU THINK [my mother loves my brother more than
   me]?
3. WHY DO YOU THINK YOUR MOTHER LOVES YOUR
   BROTHER MORE THAN YOU?


Respond: WHY DO YOU THINK YOUR MOTHER LOVES YOUR
   BROTHER MORE THAN YOU?


                                                      9
Machine Learning from the Dialog Diversity Corpus


The DDC is a collection of links to different dialogue corpuses in
  different fields.
Examples of these dialogue corpuses are:
MICAS Corpus
CIRCLE Corpus
CSPA Corpus
The TRAINS Dialogue Corpus
ICE-Singapore Corpus
Mishler Book Medical Interview
                                                                10
MICAS Corpus
Michigan of academic Spoken English, a collection of transcripts of
academic speech events recorded at the University of Michigan.
Astronomy transcript:

S1: circumpolar stars. So if I keep my pointer there, [S2: oh ]
    <ROTATES CEILING> everything else moves and we all get
    sick. <SS LAUGH> and we go backwards in time. And that’s
    even more fun.
S2: make it go really really fast.
Problems:
Long monologs
Overlapping
More than two speakers
                                                             11
Extra annotations recorded actions such as <SS Laugh>
Circle Corpus
Centre for interdisciplinary research on constructive learning environments

A collection of transcripts holding different tutorial sessions on
topics such as physics, algebra and geometry.

Algebra transcript

TUTOR [ Opening remarks and asks student to read out aloud
        and begin]

STUD [Reads problem] Mike starts a job at McDonald’s that will
  pay him 5 dollars and hour, Mike gets dropped off by his parents
  at the start of is shift. Mike works a “h” hour shift. Write an
  expression for how much he makes in one night?

[Writes “h*5 = how much he makes”]                                    12
Physics transcripts

T: [student name], I’d like you to read the problem carefully,
   and then tell me your strategy for solving this.

S: ok
       [Pause 17 sec]
   hmm.
       [Pause 6 sec]
T: thinking out loud as much as possible is good


Problem:

 Different format structure were used to distinguish speakers
 and linguistic annotation                                   13
CSPA Corpus

Corpus of Spoken Professional American-English
Includes transcripts conversations of various types.

LANGER: Hello, I’m delighted to be here.
I have carefully read and heard about the University of Albany, the
State University of New York. And I’m also the director of the
National Research Center on English Learning and Achievement.

STRICKLAND: Her mother wrote the stances.
(Laughter)

Problems:
      Long turn monologues.

       The transcript were not “anonymised”.
                                                              14
The TRAINS Dialogue Corpus

A corpus of task-oriented spoken dialogue, that has been
used in several studies of human-human dialogue.

utt10 : what you'll have to do is you'll have to uh pick out
an <sli> uh an engine <sli> and schedule a train to do that
utt11 : u: okay <sli> um <sli> engine <sli> two
utt12 : s: + okay +
utt13 : u: + from + Elmira
utt14 : s: + mm-hm +


Problem

Dealing with extra linguistic annotation such as „+‟ and
 <sli>                                                      15
ICE-Singapore

International Corpus of English, Singapore English

<$B>
<ICE-SIN: S1A-099#33:1:B>
How how are things otherwise
<ICE-SIN:S1A-099#34:1:B>
Are you okay
<$A>
<ICE-SIN:S1A-099#35:1:A>
Uhm okay lah

Problems

       Unconstrained conversations
       A lot of linguistic annotation
       Great variation in turn length               16
Mishler Book Medical Interviews


A scanned text image, including dialogue between patient and
physician.


Problems


       Scanned image cannot be converted to text format

       Extra linguistic annotation



                                                               17
Desired dialogue corpus characteristics for machine learning


 We developed a Java program to read a transcript from the DDC and
  convert it to AIML format in order to retrain ALICE.


 Problems arise when extracting ALICE categories from the DDC:

      No standard formats to distinguish between speakers.
      Extra-linguistic annotations were used.
      No standard format in using linguistic annotations.
      Long turns and monologues.
      Irregular turn taking (overlapping).
      More than one speaker.
      Scanned text-image not converted to text format.

                                                              18
To extract AIML, corpus data must be “normalized” to make it
look like chatbot transcripts:

  1. Two speakers.

  2. Structured format.

  3. Short, obvious turns without overlapping, and without any
     unnecessary notes, extras-linguistic expressions etc.




                                                          19
The Java Program


Converts the dialogue transcript to AIML format.


The output AIML is used to retrain ALICE.


The first speaker is the pattern, the second is the template.




                                                           20
Example from the MICAS corpus:

S1: circumpolar stars. So if I keep my pointer there, [S2: oh ]
<ROTATES CEILING> everything else moves and we all get sick.
<SS LAUGH> and we go backwards in time. And that’s even more
fun.
S2: make it go really really fast.

The AIML category generated by the program is:

<category>
<pattern> CIRCUMPOLAR STARS SO IF I KEEP MY POINTER
THERE EVERYTHING ELSE MOVES AND WE ALL GET SICK
AND WE GO BACKWARDS IN TIME AND THAT’S EVEN
MORE FUN</pattern>
<template> make it go really really fast.</template>
</category>                                          21
Other differences we need to “learn” :
Using Wmatrix to compare human and chatbot dialogue

 Wmatrix is a tool to provide a data driven method to compare
  corpora, three levels: Word, PoS and semantic tag analysis.

 The comparisons results are viewed as frequency lists ordered by
  log-likelihood ratio (LL).

 LL values indicate the most important differences between
  corpora.

 Wmatrix was used to compare human-to-human dialogues
  extracted from the DDC corpora and human to computer
  dialogues extracted from chatting with ALICE.

                                                                 22
ALICE and Astronomy Word Comparison


                 Sorted by log-likelihood value
Item        O1       %1      O2     %2        LL
do          44      3.90     35     0.65 +    58.69
i           54      4.79    67      1.25 +    48.04
we          1       0.09    129     2.41 -    41.15
so          1       0.09    117     2.19 -    36.75
and         8       0.71    195     3.65 -    35.19
Emily       9       0.80     0      0.00 +    31.46
you         72      6.38    151     2.82 +    28.91
this        0       0.00    70      1.31 -    26.80   23
ALICE and Astronomy POS Comparison


                  Sorted by log-likelihood value
Item     O1        %1     O2       %2       LL
PPIS1     55      4.88     0      0.00 +   192.23
VD0       43      3.81     27     0.50 +    67.27
PPIS2         1   0.09    129     2.41 -    41.15
CC       10       0.89    230     4.30 -    39.86
PPY       80      7.09    155     2.90 +    37.52
CS        4       0.35    116     2.17 -    23.31
ZZ1       0        0.00    56    1.05 -     21.44
DD1       9       0.80    151     2.82 -    19.97   24
ALICE and Astronomy Semantic Comparison


                Sorted by log-likelihood value
Item    O1     %1    O2       %2      LL
Z1      34    3.01   22     0.41 +   52.21   Personal names
E2+    16    1.42    6     0.11 +    32.44   Liking
W1      2    0.18    102     1.91 - 26.27    The universe
M6      7    0.62    151     2.82 - 24.95    Location and direction
M1      0    0.00    59     1.10 -   22.59   Moving, coming
H4     8     0.71    1     0.02 +    22.06   Residence
F1     6     0.53    0     0.00 +    20.97       Food
Q2.1   25    2.22    35     0.65 + 19.27     Speech act     25
French word Comparison between Chatbot and real dialogue


                Sorted by log-likelihood value
Item             O1     %1      O2      %2        LL
conversation     3     0.01      6    1.01 -     33.18
euh             662     2.80    0     0.00 +     32.91
danser           0      0.00    4     0.67 -     29.66
fais              0     0.00     4    0.67 -     29.66
de               463    1.96    35    5.88 -     29.16
coucher           0     0.00     3     0.50 -    22.24
football          0    0.00      3    0.50 -     22.24
                                                         26
Conclusions

1. We train ALICE rather than Elizabeth because AIML
   format is closer to the markup language and the simple
   pattern matching technique used by ALICE.

2. Dialogue Diversity corpus (DDC) illustrates huge diversity in
    dialogues: genres, speaker background/register, mark-up and
    annotation.

3. It will be useful to agree standards for transcription and
    mark-up format.

4. Wmatrix has shown further differences between chatbot and
    real dialogue.

                                                                27
Future Work


Expanding AIML files using least frequent word and
investigating how to incorporate corpus-derived linguistic
annotation into an Elizabeth-style chatbot pattern file.




                                                             28