Travel Demand Model umentation by nikeborome


									       A Task-based Framework to Evaluate Evaluative Arguments

                                             Giuseppe Carenini
                                       Intelligent Systems Program
                           University of Pittsburgh, Pittsburgh, PA 15260, USA

                                                       In this paper, we present an evaluation
                      Abstract                         framework in which the effectiveness of
                                                       evaluative arguments can be measured with real
    We present an evaluation framework in              users. The measures of argument effectiveness
    which the effectiveness of evaluative              used in our framework are based on principles
    arguments can be measured with real users.         developed in social psychology to study
    The framework is based on the task-efficacy        persuasion (Miller and Levine 1996). We are
    evaluation method. An evaluative argument          currently applying the framework to evaluate
    is presented in the context of a decision task     arguments generated by an argument generator
    and measures related to its effectiveness are      we have developed (Carenini 2000). To facilitate
    assessed. Within this framework, we are            the evaluation of specific aspects of the
    currently running a formal experiment to           generation process, the argument generator has
    verify whether argument effectiveness can          been designed so that its functional components
    be increased by tailoring the argument to the      can be easily turned-off or changed.
    user and by varying the degree of argument         In the remainder of the paper, we first describe
    conciseness.                                       our argument generator. Then, we summarize
                                                       literature on persuasion from social psychology.
Introduction                                           Next, we discuss previous work on evaluating
Empirical methods are fundamental in any               NLG models. Finally, we describe our
scientific endeavour to assess progress and to         evaluation framework and the design of an
stimulate new research questions. As the field of      experiment we are currently running.
NLG matures, we are witnessing a growing
interest in studying empirical methods to              1     The Argument Generator
evaluate computational models of discourse             The architecture of the argument generator is a
generation (Dale, Eugenio et al. 1998).                typical pipelined architecture comprising a
However, with the exception of (Chu-Carroll            discourse planner, a microplanner and a sentence
and Carberry 1998), little attention as been paid      realizer.
to the evaluation of systems generating                The input to the planning process is an abstract
evaluative arguments, communicative acts that          evaluative communicative action expressing:
attempt to affect the addressee’s attitudes (i.e.       The subject of the evaluation, which can be
evaluative tendencies typically phrased in terms           an entity or a comparison between two
of like and dislike or favor and disfavor).                entities in the domain of interest (e.g., a
The ability to generate evaluative arguments is            house or a comparison between two houses
critical in an increasing number of online                 in the real-estate domain).
systems that serve as personal assistants,
                                                        An evaluation, which is a number in the
advisors, or sales assistants1. For instance, a
                                                           interval [0,1] where, depending on the
travel assistant may need to compare two
                                                           subject, 0 means “terrible” or “much worse”
vacation packages and argue that its current user
                                                           and 1 means “excellent” or “much better”).
should like one more than the other.
                                                       Given an abstract communicative action, the
                                                       discourse planner (Young and Moore 1994)
                                                       selects and arranges the content of the argument
1   See for instance
  Figure 1 Sample arguments in order of decreasing expected effectiveness for the target user SUBJ-26
by decomposing abstract communicative actions         Finally, decisions about cue phrases are made
into primitive ones and by imposing appropriate       according to a decision tree based on
ordering constraints among communicative              suggestions from (Knott 1996; di Eugenio,
actions. Two knowledge sources are involved in        Moore et al. 1997) . The sentence realizer
this process:                                         extends previous work on realizing evaluative
 A complex model of the user’s preferences           statements (Elhadad 1995).
     based on multiattribute utilility theory         The argument generator has been designed to
     (MAUT)(Clemen 1996).                             facilitate the testing of the effectiveness of
 A set of plan operators, implementing               different aspects of the generation process. The
     guidelines for content selection and             experimenter can easily vary the expected
     organisation from argumentation theory           effectiveness of the generated arguments by
     (Carenini and Moore 2000).                       controlling whether the generator tailors the
By using these two knowledge sources, the             argument to the current user, the degree of
discourse planner produces a text plan for an         conciseness of the generated arguments and
argument whose content and organization are           what microplanning tasks are performed.
tailored to the user according to argumentation       Figure 1 shows three arguments generated by the
theory.                                               argument generator that clearly illustrate this
Next, the text plan is passed to the microplanner     feature. We expect the first argument to be very
which performs aggregation, pronominalization         effective for the target user. Its content and
and makes decisions about cue phrases.                organization has been tailored to her
Aggregation is performed according to heuristics      preferences. Also, the argument is reasonably
similar to the ones proposed in (Shaw 1998). For      fluent because of aggregation, pronominalization
pronominalization, simple rules based on              and cue phrases. In contrast, we expect the
centering are applied (Grosz, Joshi et al. 1995).     second argument to be less effective with our
target user, because it is not tailored to her          A final note on the evaluation of arguments. An
preferences2, and it appears to be somewhat too         argument can also be evaluated by the argument
verbose3. Finally, we expect the third arguments        addressee with respect to several dimensions of
not to be effective at all. It suffers from all the     quality,   such     as    coherence,   content,
shortcomings of the second argument, with the           organization, writing style and convincingness.
additional weakness of not being fluent (no             However, evaluations based on judgements
microplannig tasks were performed).                     along these dimensions are clearly weaker than
                                                        evaluations measuring actual attitudinal and
2     Research   in           Psychology         on     behavioral changes (Olso and Zanna 1991).
                                                        3     Evaluation of NLG Models
Arguing an evaluation involves an intentional
communicative act that attempts to affect the           Several empirical methods have been proposed
current or future behavior of the addressees by         and applied in the literature for evaluating NLG
creating, changing or reinforcing the addressees’       models. We discuss now why, among the three
attitudes. It follows that the effectiveness of an      main evaluation methods (i.e., human judges,
evaluative argument can be tested by comparing          corpus-based and task efficacy), task efficacy
measurements of subjects' attitudes or behavior         appears to be the most appropriate for testing the
before and after their exposure to the argument.        effectiveness of evaluative arguments that are
In many experimental situations, however,               tailored to a complex model of the user’s
measuring effects on overt behavior can be              preferences.
problematic (Miller and Levine 1996), therefore         The human judges evaluation method requires a
most research on persuasion has been based              panel of judges to score outputs of generation
either on measurements of attitudes or on               models (Chu-Carroll and Carberry 1998; Lester
declaration of behavioral intentions. The most          and Porter March 1997). The main limitation of
common technique to measure attitudes is                this approach is that the input of the generation
subject self-report (Miller and Levine 1996).           process needs to be simple enough to be easily
Typically, self-report measurements involve the         understood by judges4. Unfortunately, this is not
use of a scale that consists of two ``bipolar''         the case for our argument generator, where the
terms (e.g., good-choice vs. bad-choice), usually       input consists of a possibly complex and novel
separated by seven or nine equal spaces that            argument subject (e.g., a new house with a large
participants use to evaluate an attitude or belief      number of features), and a complex model of the
statement (see Figure 4 for examples).                  user’s preferences.
Research in persuasion suggests that some               The corpus-based evaluation method (Robin and
individuals may be naturally more resistant to          McKeown 1996) can be applied only when a
persuasion than others (Miller and Levine 1996).        corpus of input/output pairs is available. A
Individual features that seem to matter are:            portion of the corpus (the training set) is used to
argumentativeness (tendency to argue)(Infante           develop a computational model of how the
and Rancer 1982), intelligence, self-esteem and         output can be generated from the input. The rest
need for cognition (tendency to engage in and to        of the corpus (the testing set) is used to evaluate
enjoy effortful cognitive endeavours)(Cacioppo,         the model. Unfortunately, a corpus for our
Petty et al. 1983). Any experiment in persuasion        generator does not exist. Furthermore, it would
should control for these variables.                     be difficult and extremely time-consuming to
                                                        obtain and analyze such a corpus given the
2 This argument was tailored to a default average       complexity of our generator input/output pairs.
user, for whom all aspects of a house are equally
important. With respect to the first argument, notice
the different evaluation for the location and the       4   See (Chu-Carroll and Carberry 1998) for an
different order between the two text segments about     illustration of how the specification of the context
location and quality.                                   can become extremely complex when human judges
3 A threshold controlling verbosity was set to its      are used to evaluate content selection strategies for a
maximum value.                                          dialog system.
When a generator is designed to generate output         main sub-systems: the IDEA system, a User
for users engaged in certain tasks, a natural way       Model Refiner and the Argument Generator. The
to evaluate its effectiveness is by experimenting       framework assumes that a model of the user’s
with users performing those tasks. For instance,        preferences based on MAUT has been
in (Young, to appear) different models for              previously acquired using traditional methods
generating natural language descriptions of plans       from decision theory (Edwards and Barron
are evaluated by measuring how effectively              1994), to assure a reliable initial model.
users execute those plans given the descriptions.       At the onset, the user is assigned the task to
This evaluation method, called task efficacy,           select from the dataset the four most preferred
allows one to evaluate a generation model               alternatives and to place them in the Hot List
without explicitly evaluating its output but by         (see Figure 3 upper right corner) ordered by
measuring the output’s effects on user’s                preference. The IDEA system supports the user
behaviors, beliefs and attitudes in the context of      in this task (Figure 2 (1)). As the interaction
the task. The only requirement for this method is       unfolds, all user actions are monitored and
the specification of a sensible task.                   collected in the User’s Action History (Figure 2
Task efficacy is the method we have adopted in          (2a)). Whenever the user feels that she has
our evaluation framework.                               accomplished the task, the ordered list of
                                                        preferred alternatives is saved as her Preliminary
4   The Evaluation Framework                            Decision (Figure 2 (2b)). After that, this list, the
                                                        User’s Action History and the initial Model of
4.1 The task                                            User’s Preferences are analysed by the User
                                                        Model Refiner (Figure 2 (3)) to produce a
Aiming at general results, we chose a rather
                                                        Refined Model of the User’s Preferences (Figure
basic and frequent task that has been extensively
                                                        2 (4)).
studied in decision analysis: the selection of a
                                                        At this point, the stage is set for argument
subset of preferred objects (e.g., houses) out of a
                                                        generation. Given the Refined Model of the
set of possible alternatives by considering trade-
                                                        User’s Preferences for the target selection task,
offs among multiple objectives (e.g., house
                                                        the Argument Generator produces an evaluative
location, house quality). The selection is
                                                        argument tailored to the user’s model (Figure 2
performed by evaluating objects with respect to
                                                        (5-6)). Finally, the argument is presented to the
their values for a set of primitive attributes (e.g.,
                                                        user by the IDEA system (Figure 2 (7)).
house distance form the park, size of the               The argument goal is to introduce a new
garden). In the evaluation framework we have
                                                        alternative (not included in the dataset initially
developed, the user performs this task by using a
                                                        presented to the user) and to persuade the user
computer environment (shown in Figure 3) that           that the alternative is worth being considered.
supports interactive data exploration and
                                                        The new alternative is designed on the fly to be
analysis (IDEA) (Roth, Chuah et al. 1997). The
                                                        preferable for the user given her preference
IDEA environment provides the user with a set           model. Once the argument is presented, the user
of     powerful     visualization     and     direct
                                                        may (a) decide to introduce the new alternative
manipulation techniques that facilitate user’s
                                                        in her Hot List, or (b) decide to further explore
autonomous exploration of the set of alternatives
                                                        the dataset, possibly making changes to the Hot
and the selection of the preferred alternatives.
                                                        List and introducing the new instance in the Hot
Let’s examine now how the argument generator,
                                                        List, or (c) do nothing. Figure 3 shows the
that we described in Section 1, can be evaluated
                                                        display at the end of the interaction, when the
in the context of the selection task, by going
                                                        user, after reading the argument, has decided to
through the architecture of the evaluation
                                                        introduce the new alternative in the first
framework.                                              position.
4.2 The framework architecture
Figure 2 shows the architecture of the evaluation
framework. The framework consists of three
                            Figure 2 The evaluation framework architecture
Whenever the user decides to stop exploring and      argument were more influential (i.e., better
is satisfied and confident with her final            understood and accepted by the user).
selections, measures related to argument’s           A fourth measure of argument effectiveness is to
effectiveness can be assessed (Figure 2 (8)).        explicitly ask the user at the end of the
These measures are obtained either from the          interaction to judge the argument with respect to
record of the user interaction with the system or    several dimensions of quality, such as content,
from user self-reports (see Section 2).              organization, writing style and convincigness.
First, and most important, are measures of           Evaluations based on judgments along these
behavioral intentions and attitude change: (a)       dimensions are clearly weaker than evaluations
whether or not the user adopts the new proposed      measuring actual behavioural and attitudinal
alternative, (b) in which position in the Hot List   changes (Olso and Zanna 1991). However, these
she places it, (c) how much she likes it, (d)        judgments may provide more information than
whether or not the user revises the Hot List and     judgments from independent judges (as in the
(e) how much the user likes the objects in the       “human judges” method discussed in Section 3),
Hot List. Second, a measure can be obtained of       because they are performed by the addressee of
the user’s confidence that she has selected the      the argument, when the experience of the task is
best for her in the set of alternatives. Third, a    still vivid in her memory.
measure of argument effectiveness can also be        To summarize, the evaluation framework just
derived by explicitly questioning the user at the    described supports users in performing a
end of the interaction about the rationale for her   realistic task at their own pace by interacting
decision. This can be done either by asking the      with an IDEA system. In the context of this task,
user to justify her decision in a written            an evaluative argument is generated and
paragraph, or by asking the user to self-report      measurements related to its effectiveness can be
for each attribute of the new house how              performed.
important the attribute was in her decision (Olso    In the next section, we discuss an experiment
and Zanna 1991). Both methods can provide            that we are currently running by using the
valuable information on what aspects of the          evaluation framework.


                                                          NewHouse 3-26

                  Figure 3 The IDEA environment display at the end of the interaction

5   The Experiment
As explained in Section 1, the argument               No-Argument - subjects are simply informed that
generator has been designed to facilitate testing     a new house came on the market.
of the effectiveness of different aspects of the      Tailored-Concise - subjects are presented with
generation process. The experimenter can easily       an evaluation of the new house tailored to their
control whether the generator tailors the             preferences and at a level of conciseness that we
argument to the current user, the degree of           hypothesize to be optimal.
conciseness of the argument, and what                 Non-Tailored-Concise - subjects are presented
microplanning tasks are performed. In our initial     with an evaluation of the new house which is not
experiment, because of limited financial and          tailored to their preferences5, but is at a level of
human resources, we focus on the first two            conciseness that we hypothesize to be optimal.
aspects for arguments about a single entity. Not      Tailored-Verbose - subjects are presented with
because we are not interested in effectiveness of     an evaluation of the new house tailored to their
performing microplanning tasks, but because we        preferences, but at a level of conciseness that we
consider effectiveness of tailoring and               hypothesize to be too low.
conciseness somewhat more difficult, and
therefore more interesting to prove.
Thus, we designed a between-subjects                  5 The evaluative argument is tailored to a default
experiment with four experimental conditions:         average user, for whom all aspects of a house are
                                                      equally important.
          a) How would you judge the houses in your Hot List?
          The more you like the house the closer you should put a cross to “good choice”
          1st house
          bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice
          2nd house
          bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice
          3rd house
          bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice
          4th house
          bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice
          b) How sure are you that you have selected the four best houses among the ones available?
          Unsure : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : Sure

          Figure 4 Excerpt from questionnaire that subjects fill out at the end of the interaction
In the four conditions, all the information about       argumentativeness (Infante and Rancer 1982).
the new house is also presented graphically. Our        The last one assesses the subject’s need for
hypotheses on the outcomes of the experiment            cognition (Cacioppo, Petty et al. 1984). In the
can be summarized as follows. We expect                 second phase of the experiment, to control for
arguments generated for the Tailored-Concise            other possible confounding variables (including
condition to be more effective than arguments           intelligence and self-esteem), the subject is
generated for both the Non-Tailored-Concise             randomly assigned to one of the four conditions.
and Tailored-Verbose conditions. We also                Then, the subject interacts with the evaluation
expect the Tailored-Concise condition to be             framework and at the end of the interaction
somewhat better than the No-Argument                    measures of the argument effectiveness are
condititon, but to a lesser extent, because             collected. Some details on measures based on
subjects, in the absence of any argument, may           subjects’ self-reports can be examined in Figure
spend more time further exploring the dataset,          4, which shows an excerpt from the final
therefore reaching a more informed and                  questionnaire that subjects are asked to fill out at
balanced decision. Finally, we do not have              the end of the interaction.
strong hypotheses on comparisons of argument            After running the experiment with 8 pilot
effectiveness among the No-Argument, Non-               subjects to refine and improve the experimental
Tailored-Concise       and       Tailored-Verbose       procedure, we are currently running a formal
conditions.                                             experiment involving 40 subjects, 10 in each
The design of our evaluation framework and              experimental conditions.
consequently the design of this experiment take
into account that the effectiveness of arguments        Future Work
is determined not only by the argument itself,
                                                        In this paper, we propose a task-based
but also by user’s traits such as
                                                        framework to evaluate evaluative arguments.
argumentativeness, need for cognition, self-
                                                        We are currently using this framework to run a
esteem and intelligence (as described in Section
                                                        formal experiment to evaluate arguments about a
2). Furthermore, we assume that argument
                                                        single entity. However, this is only a first step.
effectiveness can be measured by means of the
                                                        The power of the framework is that it enables
behavioral intentions and self-reports described
                                                        the design and execution of many different
in Section 4.2.
                                                        experiments about evaluative arguments. The
The experiment is organized in two phases. In
                                                        goal of our current experiment is to verify
the first phase, the subject fills out three
                                                        whether tailoring an evaluative argument to the
questionnaires on the Web. One questionnaire
                                                        user and varying the degree of argument
implements a method from decision theory to
                                                        conciseness influence argument effectiveness.
acquire a model of the subject’s preferences
                                                        We envision further experiments along the
(Edwards and Barron 1994). The second
                                                        following lines.
questionnaire      assesses      the     subject’s
In the short term, we plan to study more                  Edwards, W. and F. H. Barron (1994). SMARTS and
complex arguments, including comparisons                    SMARTER: Improved Simple Methods for
between two entities, as well as comparisons                Multiattribute Utility Measurements.
between mixtures of entities and set of entities.         Organizational Behavior and Human Decision
One experiment could assess the influence of                Processes 60: 306-325.
tailoring and conciseness on the effectiveness of         Elhadad, M. (1995). Using argumentation in text
these more complex arguments.            Another            generation. Journal of Pragmatics 24: 189-220.
possible experiment could compare different               Eugenio, B. D., J. Moore, et al. (1997). Learning
argumentative strategies for selecting and                  Features that Predicts Cue Usage. ACL97,
organizing the content of these arguments. In the           Madrid, Spain.
long term, we intend to evaluate techniques to            Grosz, B. J., A. K. Joshi, et al. (1995). Centering: A
generate evaluative arguments that combine                  Framework for Modelling the Local Coherence of
natural language and information graphics (e.g.,            Discourse. Computational Linguistics 21(2):203-
maps, tables, charts).                                      226.
                                                          Infante, D. A. and A. S. Rancer (1982). A
Acknowledgements                                            Conceptualization         and       Measure       of
                                                            Argumentativeness.       Journal    of   Personality
My thanks go to the members of the Autobrief                Assessment 46: 72-80.
project: J. Moore, S. Roth, N. Green, S.                  Knott, A. (1996). A Data-Driven Methodology for
Kerpedjiev and J. Mattis. I also thank C. Conati            Motivating a Set of Coherence Relations,
for comments on drafts of this paper. This work             University of Edinburgh.
was supported by grant number DAA-                        Lester, J. C. and B. W. Porter (1997). Developing and
1593K0005 from the Advanced Research                        Empirically Evaluating Robust Explanation
Projects Agency (ARPA). Its contents are solely             Generators:       The     KNIGHT        Experiments.
responsibility of the author.                               Computational Linguistics 23(1): 65-101.
                                                          Miller, M. D. and T. R. Levine (1996). Persuasion.
References                                                  An Integrated Approach to Communication Theory
                                                            and Research. M. B. Salwen and D. W. Stack.
Cacioppo, J. T., R. E. Petty, et al. (1984). The            Mahwah, New Jersey: 261-276.
  efficient Assessment of need for Cognition. Journal
  of Personality Assessment 48(3): 306-307.               Olso, J. M. and M. P. Zanna (1991). Attitudes and
                                                            beliefs; Attitude change and attitude-behavior
Cacioppo, J. T., R. E. Petty, et al. (1983). Effects of     consistency. Social Psychology. R. M. Baron and
  Need for Cognition on Message Evaluation, Recall,         W. G. Graziano.
  and Persuasion. Journal of Personality and Social
  Psychology 45(4): 805-818.                              Robin, J. and K. McKeown (1996). Empirically
                                                            Designing and Evaluating a New Revision-Based
Carenini, G. (2000). Evaluating Multimedia                  Model for Summary Generation. Artificial
  Interactive Arguments in the Context of Data              Intelligence Journal, 85, 135-179.
  Exploration Tasks. PhD Thesis, Intelligent System
  Program, University of Pittsburgh.                      Roth, S. F., M. C. Chuah, et al. (1997). Towards an
                                                            Information Visualization Workspace: Combining
Carenini, G. and J. Moore (2000). A Strategy for            Multiple Means of Expression. Human-Computer
  Evaluating Evaluative arguments Int. Conference           Interaction Journal.Vol. 12, No. 1 & 2, pp. 131-185
  on NLG, Mitzpe Ramon, Israel.
                                                          Shaw, J. (1998). Clause Aggregation Using
Chu-Carroll, J. and S. Carberry (1998). Collaborative       Linguistic Knowledge. 9th Int. Workshop on NLG,
  Response Generation in Planning Dialogues.                Niagara-on-the-Lake, Canada.
  Computational Linguistics 24(2): 355-400.
                                                          Young, M. R. Using Grice's Maxim of Quantity to
Clemen, R. T. (1996). Making Hard Decisions: an             Select the Content of Plan Descriptions. Artificial
  introduction to decision analysis. Belmont,               Intelligence Journal, to appear.
  California, Duxbury Press.
                                                          Young, M. R. and J. D. Moore (1994). Does
Dale, R., B. di Eugenio, et al. (1998). Introduction to     Discourse Planning Require a Special-Purpose
  the Special Issue on NLG. Computational                   Planner? Proceedings of the AAAI-94 Workshop
  Linguistics 24(3): 345-353.                               on planning for Interagent Communication. Seattle,

To top