VIEWS: 1 PAGES: 8 POSTED ON: 3/23/2011
A Task-based Framework to Evaluate Evaluative Arguments Giuseppe Carenini Intelligent Systems Program University of Pittsburgh, Pittsburgh, PA 15260, USA firstname.lastname@example.org In this paper, we present an evaluation Abstract framework in which the effectiveness of evaluative arguments can be measured with real We present an evaluation framework in users. The measures of argument effectiveness which the effectiveness of evaluative used in our framework are based on principles arguments can be measured with real users. developed in social psychology to study The framework is based on the task-efficacy persuasion (Miller and Levine 1996). We are evaluation method. An evaluative argument currently applying the framework to evaluate is presented in the context of a decision task arguments generated by an argument generator and measures related to its effectiveness are we have developed (Carenini 2000). To facilitate assessed. Within this framework, we are the evaluation of specific aspects of the currently running a formal experiment to generation process, the argument generator has verify whether argument effectiveness can been designed so that its functional components be increased by tailoring the argument to the can be easily turned-off or changed. user and by varying the degree of argument In the remainder of the paper, we first describe conciseness. our argument generator. Then, we summarize literature on persuasion from social psychology. Introduction Next, we discuss previous work on evaluating Empirical methods are fundamental in any NLG models. Finally, we describe our scientific endeavour to assess progress and to evaluation framework and the design of an stimulate new research questions. As the field of experiment we are currently running. NLG matures, we are witnessing a growing interest in studying empirical methods to 1 The Argument Generator evaluate computational models of discourse The architecture of the argument generator is a generation (Dale, Eugenio et al. 1998). typical pipelined architecture comprising a However, with the exception of (Chu-Carroll discourse planner, a microplanner and a sentence and Carberry 1998), little attention as been paid realizer. to the evaluation of systems generating The input to the planning process is an abstract evaluative arguments, communicative acts that evaluative communicative action expressing: attempt to affect the addressee’s attitudes (i.e. The subject of the evaluation, which can be evaluative tendencies typically phrased in terms an entity or a comparison between two of like and dislike or favor and disfavor). entities in the domain of interest (e.g., a The ability to generate evaluative arguments is house or a comparison between two houses critical in an increasing number of online in the real-estate domain). systems that serve as personal assistants, An evaluation, which is a number in the advisors, or sales assistants1. For instance, a interval [0,1] where, depending on the travel assistant may need to compare two subject, 0 means “terrible” or “much worse” vacation packages and argue that its current user and 1 means “excellent” or “much better”). should like one more than the other. Given an abstract communicative action, the discourse planner (Young and Moore 1994) selects and arranges the content of the argument 1 See for instance www.activebuyersguide.com Figure 1 Sample arguments in order of decreasing expected effectiveness for the target user SUBJ-26 by decomposing abstract communicative actions Finally, decisions about cue phrases are made into primitive ones and by imposing appropriate according to a decision tree based on ordering constraints among communicative suggestions from (Knott 1996; di Eugenio, actions. Two knowledge sources are involved in Moore et al. 1997) . The sentence realizer this process: extends previous work on realizing evaluative A complex model of the user’s preferences statements (Elhadad 1995). based on multiattribute utilility theory The argument generator has been designed to (MAUT)(Clemen 1996). facilitate the testing of the effectiveness of A set of plan operators, implementing different aspects of the generation process. The guidelines for content selection and experimenter can easily vary the expected organisation from argumentation theory effectiveness of the generated arguments by (Carenini and Moore 2000). controlling whether the generator tailors the By using these two knowledge sources, the argument to the current user, the degree of discourse planner produces a text plan for an conciseness of the generated arguments and argument whose content and organization are what microplanning tasks are performed. tailored to the user according to argumentation Figure 1 shows three arguments generated by the theory. argument generator that clearly illustrate this Next, the text plan is passed to the microplanner feature. We expect the first argument to be very which performs aggregation, pronominalization effective for the target user. Its content and and makes decisions about cue phrases. organization has been tailored to her Aggregation is performed according to heuristics preferences. Also, the argument is reasonably similar to the ones proposed in (Shaw 1998). For fluent because of aggregation, pronominalization pronominalization, simple rules based on and cue phrases. In contrast, we expect the centering are applied (Grosz, Joshi et al. 1995). second argument to be less effective with our target user, because it is not tailored to her A final note on the evaluation of arguments. An preferences2, and it appears to be somewhat too argument can also be evaluated by the argument verbose3. Finally, we expect the third arguments addressee with respect to several dimensions of not to be effective at all. It suffers from all the quality, such as coherence, content, shortcomings of the second argument, with the organization, writing style and convincingness. additional weakness of not being fluent (no However, evaluations based on judgements microplannig tasks were performed). along these dimensions are clearly weaker than evaluations measuring actual attitudinal and 2 Research in Psychology on behavioral changes (Olso and Zanna 1991). Persuasion 3 Evaluation of NLG Models Arguing an evaluation involves an intentional communicative act that attempts to affect the Several empirical methods have been proposed current or future behavior of the addressees by and applied in the literature for evaluating NLG creating, changing or reinforcing the addressees’ models. We discuss now why, among the three attitudes. It follows that the effectiveness of an main evaluation methods (i.e., human judges, evaluative argument can be tested by comparing corpus-based and task efficacy), task efficacy measurements of subjects' attitudes or behavior appears to be the most appropriate for testing the before and after their exposure to the argument. effectiveness of evaluative arguments that are In many experimental situations, however, tailored to a complex model of the user’s measuring effects on overt behavior can be preferences. problematic (Miller and Levine 1996), therefore The human judges evaluation method requires a most research on persuasion has been based panel of judges to score outputs of generation either on measurements of attitudes or on models (Chu-Carroll and Carberry 1998; Lester declaration of behavioral intentions. The most and Porter March 1997). The main limitation of common technique to measure attitudes is this approach is that the input of the generation subject self-report (Miller and Levine 1996). process needs to be simple enough to be easily Typically, self-report measurements involve the understood by judges4. Unfortunately, this is not use of a scale that consists of two ``bipolar'' the case for our argument generator, where the terms (e.g., good-choice vs. bad-choice), usually input consists of a possibly complex and novel separated by seven or nine equal spaces that argument subject (e.g., a new house with a large participants use to evaluate an attitude or belief number of features), and a complex model of the statement (see Figure 4 for examples). user’s preferences. Research in persuasion suggests that some The corpus-based evaluation method (Robin and individuals may be naturally more resistant to McKeown 1996) can be applied only when a persuasion than others (Miller and Levine 1996). corpus of input/output pairs is available. A Individual features that seem to matter are: portion of the corpus (the training set) is used to argumentativeness (tendency to argue)(Infante develop a computational model of how the and Rancer 1982), intelligence, self-esteem and output can be generated from the input. The rest need for cognition (tendency to engage in and to of the corpus (the testing set) is used to evaluate enjoy effortful cognitive endeavours)(Cacioppo, the model. Unfortunately, a corpus for our Petty et al. 1983). Any experiment in persuasion generator does not exist. Furthermore, it would should control for these variables. be difficult and extremely time-consuming to obtain and analyze such a corpus given the 2 This argument was tailored to a default average complexity of our generator input/output pairs. user, for whom all aspects of a house are equally important. With respect to the first argument, notice the different evaluation for the location and the 4 See (Chu-Carroll and Carberry 1998) for an different order between the two text segments about illustration of how the specification of the context location and quality. can become extremely complex when human judges 3 A threshold controlling verbosity was set to its are used to evaluate content selection strategies for a maximum value. dialog system. When a generator is designed to generate output main sub-systems: the IDEA system, a User for users engaged in certain tasks, a natural way Model Refiner and the Argument Generator. The to evaluate its effectiveness is by experimenting framework assumes that a model of the user’s with users performing those tasks. For instance, preferences based on MAUT has been in (Young, to appear) different models for previously acquired using traditional methods generating natural language descriptions of plans from decision theory (Edwards and Barron are evaluated by measuring how effectively 1994), to assure a reliable initial model. users execute those plans given the descriptions. At the onset, the user is assigned the task to This evaluation method, called task efficacy, select from the dataset the four most preferred allows one to evaluate a generation model alternatives and to place them in the Hot List without explicitly evaluating its output but by (see Figure 3 upper right corner) ordered by measuring the output’s effects on user’s preference. The IDEA system supports the user behaviors, beliefs and attitudes in the context of in this task (Figure 2 (1)). As the interaction the task. The only requirement for this method is unfolds, all user actions are monitored and the specification of a sensible task. collected in the User’s Action History (Figure 2 Task efficacy is the method we have adopted in (2a)). Whenever the user feels that she has our evaluation framework. accomplished the task, the ordered list of preferred alternatives is saved as her Preliminary 4 The Evaluation Framework Decision (Figure 2 (2b)). After that, this list, the User’s Action History and the initial Model of 4.1 The task User’s Preferences are analysed by the User Model Refiner (Figure 2 (3)) to produce a Aiming at general results, we chose a rather Refined Model of the User’s Preferences (Figure basic and frequent task that has been extensively 2 (4)). studied in decision analysis: the selection of a At this point, the stage is set for argument subset of preferred objects (e.g., houses) out of a generation. Given the Refined Model of the set of possible alternatives by considering trade- User’s Preferences for the target selection task, offs among multiple objectives (e.g., house the Argument Generator produces an evaluative location, house quality). The selection is argument tailored to the user’s model (Figure 2 performed by evaluating objects with respect to (5-6)). Finally, the argument is presented to the their values for a set of primitive attributes (e.g., user by the IDEA system (Figure 2 (7)). house distance form the park, size of the The argument goal is to introduce a new garden). In the evaluation framework we have alternative (not included in the dataset initially developed, the user performs this task by using a presented to the user) and to persuade the user computer environment (shown in Figure 3) that that the alternative is worth being considered. supports interactive data exploration and The new alternative is designed on the fly to be analysis (IDEA) (Roth, Chuah et al. 1997). The preferable for the user given her preference IDEA environment provides the user with a set model. Once the argument is presented, the user of powerful visualization and direct may (a) decide to introduce the new alternative manipulation techniques that facilitate user’s in her Hot List, or (b) decide to further explore autonomous exploration of the set of alternatives the dataset, possibly making changes to the Hot and the selection of the preferred alternatives. List and introducing the new instance in the Hot Let’s examine now how the argument generator, List, or (c) do nothing. Figure 3 shows the that we described in Section 1, can be evaluated display at the end of the interaction, when the in the context of the selection task, by going user, after reading the argument, has decided to through the architecture of the evaluation introduce the new alternative in the first framework. position. 4.2 The framework architecture Figure 2 shows the architecture of the evaluation framework. The framework consists of three Figure 2 The evaluation framework architecture Whenever the user decides to stop exploring and argument were more influential (i.e., better is satisfied and confident with her final understood and accepted by the user). selections, measures related to argument’s A fourth measure of argument effectiveness is to effectiveness can be assessed (Figure 2 (8)). explicitly ask the user at the end of the These measures are obtained either from the interaction to judge the argument with respect to record of the user interaction with the system or several dimensions of quality, such as content, from user self-reports (see Section 2). organization, writing style and convincigness. First, and most important, are measures of Evaluations based on judgments along these behavioral intentions and attitude change: (a) dimensions are clearly weaker than evaluations whether or not the user adopts the new proposed measuring actual behavioural and attitudinal alternative, (b) in which position in the Hot List changes (Olso and Zanna 1991). However, these she places it, (c) how much she likes it, (d) judgments may provide more information than whether or not the user revises the Hot List and judgments from independent judges (as in the (e) how much the user likes the objects in the “human judges” method discussed in Section 3), Hot List. Second, a measure can be obtained of because they are performed by the addressee of the user’s confidence that she has selected the the argument, when the experience of the task is best for her in the set of alternatives. Third, a still vivid in her memory. measure of argument effectiveness can also be To summarize, the evaluation framework just derived by explicitly questioning the user at the described supports users in performing a end of the interaction about the rationale for her realistic task at their own pace by interacting decision. This can be done either by asking the with an IDEA system. In the context of this task, user to justify her decision in a written an evaluative argument is generated and paragraph, or by asking the user to self-report measurements related to its effectiveness can be for each attribute of the new house how performed. important the attribute was in her decision (Olso In the next section, we discuss an experiment and Zanna 1991). Both methods can provide that we are currently running by using the valuable information on what aspects of the evaluation framework. HotList 3-26 NewHouse 3-26 Figure 3 The IDEA environment display at the end of the interaction 5 The Experiment As explained in Section 1, the argument No-Argument - subjects are simply informed that generator has been designed to facilitate testing a new house came on the market. of the effectiveness of different aspects of the Tailored-Concise - subjects are presented with generation process. The experimenter can easily an evaluation of the new house tailored to their control whether the generator tailors the preferences and at a level of conciseness that we argument to the current user, the degree of hypothesize to be optimal. conciseness of the argument, and what Non-Tailored-Concise - subjects are presented microplanning tasks are performed. In our initial with an evaluation of the new house which is not experiment, because of limited financial and tailored to their preferences5, but is at a level of human resources, we focus on the first two conciseness that we hypothesize to be optimal. aspects for arguments about a single entity. Not Tailored-Verbose - subjects are presented with because we are not interested in effectiveness of an evaluation of the new house tailored to their performing microplanning tasks, but because we preferences, but at a level of conciseness that we consider effectiveness of tailoring and hypothesize to be too low. conciseness somewhat more difficult, and therefore more interesting to prove. Thus, we designed a between-subjects 5 The evaluative argument is tailored to a default experiment with four experimental conditions: average user, for whom all aspects of a house are equally important. a) How would you judge the houses in your Hot List? The more you like the house the closer you should put a cross to “good choice” 1st house bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice 2nd house bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice 3rd house bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice 4th house bad choice : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : good choice b) How sure are you that you have selected the four best houses among the ones available? Unsure : ___ : ___ : ___ : ___ : __ : ___ : ___ : ___ : ___ : Sure Figure 4 Excerpt from questionnaire that subjects fill out at the end of the interaction In the four conditions, all the information about argumentativeness (Infante and Rancer 1982). the new house is also presented graphically. Our The last one assesses the subject’s need for hypotheses on the outcomes of the experiment cognition (Cacioppo, Petty et al. 1984). In the can be summarized as follows. We expect second phase of the experiment, to control for arguments generated for the Tailored-Concise other possible confounding variables (including condition to be more effective than arguments intelligence and self-esteem), the subject is generated for both the Non-Tailored-Concise randomly assigned to one of the four conditions. and Tailored-Verbose conditions. We also Then, the subject interacts with the evaluation expect the Tailored-Concise condition to be framework and at the end of the interaction somewhat better than the No-Argument measures of the argument effectiveness are condititon, but to a lesser extent, because collected. Some details on measures based on subjects, in the absence of any argument, may subjects’ self-reports can be examined in Figure spend more time further exploring the dataset, 4, which shows an excerpt from the final therefore reaching a more informed and questionnaire that subjects are asked to fill out at balanced decision. Finally, we do not have the end of the interaction. strong hypotheses on comparisons of argument After running the experiment with 8 pilot effectiveness among the No-Argument, Non- subjects to refine and improve the experimental Tailored-Concise and Tailored-Verbose procedure, we are currently running a formal conditions. experiment involving 40 subjects, 10 in each The design of our evaluation framework and experimental conditions. consequently the design of this experiment take into account that the effectiveness of arguments Future Work is determined not only by the argument itself, In this paper, we propose a task-based but also by user’s traits such as framework to evaluate evaluative arguments. argumentativeness, need for cognition, self- We are currently using this framework to run a esteem and intelligence (as described in Section formal experiment to evaluate arguments about a 2). Furthermore, we assume that argument single entity. However, this is only a first step. effectiveness can be measured by means of the The power of the framework is that it enables behavioral intentions and self-reports described the design and execution of many different in Section 4.2. experiments about evaluative arguments. The The experiment is organized in two phases. In goal of our current experiment is to verify the first phase, the subject fills out three whether tailoring an evaluative argument to the questionnaires on the Web. One questionnaire user and varying the degree of argument implements a method from decision theory to conciseness influence argument effectiveness. acquire a model of the subject’s preferences We envision further experiments along the (Edwards and Barron 1994). The second following lines. questionnaire assesses the subject’s In the short term, we plan to study more Edwards, W. and F. H. Barron (1994). SMARTS and complex arguments, including comparisons SMARTER: Improved Simple Methods for between two entities, as well as comparisons Multiattribute Utility Measurements. between mixtures of entities and set of entities. Organizational Behavior and Human Decision One experiment could assess the influence of Processes 60: 306-325. tailoring and conciseness on the effectiveness of Elhadad, M. (1995). Using argumentation in text these more complex arguments. Another generation. Journal of Pragmatics 24: 189-220. possible experiment could compare different Eugenio, B. D., J. Moore, et al. (1997). Learning argumentative strategies for selecting and Features that Predicts Cue Usage. ACL97, organizing the content of these arguments. In the Madrid, Spain. long term, we intend to evaluate techniques to Grosz, B. J., A. K. Joshi, et al. (1995). Centering: A generate evaluative arguments that combine Framework for Modelling the Local Coherence of natural language and information graphics (e.g., Discourse. Computational Linguistics 21(2):203- maps, tables, charts). 226. Infante, D. A. and A. S. Rancer (1982). A Acknowledgements Conceptualization and Measure of Argumentativeness. Journal of Personality My thanks go to the members of the Autobrief Assessment 46: 72-80. project: J. Moore, S. Roth, N. Green, S. Knott, A. (1996). A Data-Driven Methodology for Kerpedjiev and J. Mattis. I also thank C. Conati Motivating a Set of Coherence Relations, for comments on drafts of this paper. This work University of Edinburgh. was supported by grant number DAA- Lester, J. C. and B. W. Porter (1997). Developing and 1593K0005 from the Advanced Research Empirically Evaluating Robust Explanation Projects Agency (ARPA). Its contents are solely Generators: The KNIGHT Experiments. responsibility of the author. Computational Linguistics 23(1): 65-101. Miller, M. D. and T. R. Levine (1996). Persuasion. References An Integrated Approach to Communication Theory and Research. M. B. Salwen and D. W. Stack. Cacioppo, J. T., R. E. Petty, et al. (1984). The Mahwah, New Jersey: 261-276. efficient Assessment of need for Cognition. Journal of Personality Assessment 48(3): 306-307. Olso, J. M. and M. P. Zanna (1991). Attitudes and beliefs; Attitude change and attitude-behavior Cacioppo, J. T., R. E. Petty, et al. (1983). Effects of consistency. Social Psychology. R. M. Baron and Need for Cognition on Message Evaluation, Recall, W. G. Graziano. and Persuasion. Journal of Personality and Social Psychology 45(4): 805-818. Robin, J. and K. McKeown (1996). Empirically Designing and Evaluating a New Revision-Based Carenini, G. (2000). Evaluating Multimedia Model for Summary Generation. Artificial Interactive Arguments in the Context of Data Intelligence Journal, 85, 135-179. Exploration Tasks. PhD Thesis, Intelligent System Program, University of Pittsburgh. Roth, S. F., M. C. Chuah, et al. (1997). Towards an Information Visualization Workspace: Combining Carenini, G. and J. Moore (2000). A Strategy for Multiple Means of Expression. Human-Computer Evaluating Evaluative arguments Int. Conference Interaction Journal.Vol. 12, No. 1 & 2, pp. 131-185 on NLG, Mitzpe Ramon, Israel. Shaw, J. (1998). Clause Aggregation Using Chu-Carroll, J. and S. Carberry (1998). Collaborative Linguistic Knowledge. 9th Int. Workshop on NLG, Response Generation in Planning Dialogues. Niagara-on-the-Lake, Canada. Computational Linguistics 24(2): 355-400. Young, M. R. Using Grice's Maxim of Quantity to Clemen, R. T. (1996). Making Hard Decisions: an Select the Content of Plan Descriptions. Artificial introduction to decision analysis. Belmont, Intelligence Journal, to appear. California, Duxbury Press. Young, M. R. and J. D. Moore (1994). Does Dale, R., B. di Eugenio, et al. (1998). Introduction to Discourse Planning Require a Special-Purpose the Special Issue on NLG. Computational Planner? Proceedings of the AAAI-94 Workshop Linguistics 24(3): 345-353. on planning for Interagent Communication. Seattle, WA.
"Travel Demand Model umentation"