Project Databes

Document Sample
Project Databes Powered By Docstoc
					        Evaluating Natural Language Generated
                   Database Records
                                                   Rita McCardell

                                                     Department of Defense
                                                  Fort Meade, Maryland 20755

ABSTRACT                                                           A I D S r e s e a r c h details the various vaccines and treat-
With the onslaught of various natural language pro-                               ments that are being developed to prevent
cessing (NLP) systems and their respective applica-                               AIDS.
tions comes the inevitable task of determining a way
in which to compare and thus evaluate the output of                The subdomains of i n c i d e n c e r e p o r t s , t e s t i n g poli-
these systems. This paper focuses on one such evalua-              cies and c a m p a i g n s are found in the Spanish text while
tion technique that originated from the text understand-           the topics of i n c i d e n c e r e p o r t s , n e w t e c h n o l o g i e s
ing system called Project MURASAKI. This evaluation                and A I D S r e s e a r c h are covered in the Japanese text.
technique quantitatively and qualitatively measures the               Project MURASAKI will demonstrate a sufficient
match (or distance) from the output of one text under-             level of full text understanding to be able to identify
standing system to the expected output of another.                 the existence of factual information within either a given
                                                                   Spanish or Japanese text that belongs within a partic-
                                                                   ular Spanish or Japanese language database. Then, it
                                                                   will determine what information in that text constitutes
Introduction                                                       a single record in the selected database.
Project     MURASAKI                                                  The balance of this paper will focus on the evaluation
The purpose of Project MURASAKI is to develop a                    technique: why it was chosen, some basic assumptions
foreign language text understanding system that will               underlying it, as well as the design and application of this
demonstrate the extensibility of message understanding             technique. To illustrate various technical points of this
technology3 In its current design, Project MURASAKI                technique, examples will be given using text excerpted
will process Spanish and Japanese text and extract in-             from the Spanish AIDS corpus and its associated (gener-
formation in order to generate records in both natural             ated) Spanish database records. Appendix A contains a
language databases, respectively. The fields within these          sample Spanish AIDS text (Text #124) and its English
database records will contain a natural language phrase            translation. 2 Appendix B contains a record from the
or expression in that respective language.                         Incidence Reporting database that was generated from
  The domain of Project MURASAKI is the disease                    Text #124. Similarly, Appendix C contains a record
AIDS. The associated software system will include a                from the Testing Policies database that was also gener-
general domain model of AIDS in the knowledge base.                ated from Text #124.
Within this model, there will be five subdomains:

i n c i d e n c e r e p o r t s records the occurrence of AIDS     The Need for a Black Box
                  and HIV infection in countries and regions,      Given the overall design of this foreign language text
                  among various populations,                       understanding program, there arose the need for devel-
                                                                   oping a general purpose evaluation technique[l]. This
t e s t i n g p o l i c i e s covers measures to test groups for   technique would compare the actual, computer generated
                   AIDS,                                           output of one such system to the expected, human gener-
                                                                   ated output of another. T h a t is to say, given some sam-
campaigns      describes measures adopted to combat                ple piece of (foreign language) text as input, some pre-
             AIDS,                                                 defined system output (namely, for project MURASAKI,
                                                                   the generation of a finite number of database records)
n e w t e c h n o l o g i e s lists new equipment and material     could be manually generated so that a determination
                used in detecting and preventing AIDS, and         as to the correct performance of the computer system
                                                                   was made. Given this type of "correct" output, it could
   1Thus, it is no_...tto be confused as a message undel~tanding
project, but rather a multi-paragraph (i.e., text) understanding      2In the MURASAKI text corpus, there do not exist any English
project [51.                                                       translations for any of the text.
therefore be possible to measure the performance of an                                           1. Generate one or more records in the             wrong
automated system based on this type of well-defined in-                                             database.
p u t / o u t p u t pairs. It was precisely this type of ratio-
nale that led to the development of a b l a c k b o x e v a l -                                  2. N o t generate one or more records in the correct
u a t i o n - - evaluation primarily focused on what a sys-                                         database.
 tem produces externally rather than what a system does
 internally. In direct contrast to this type of evaluation is                                    3. Generate t o o m a n y    records    in   the   correct
 glass b o x e v a l u a t i o n - - "looking inside the system and                                 database, i.e., over-generate.
finding ways of measuring how well it does something,
                                                                                                 4. Generate t o o f e w records in the correct database,
 rather than whether or not it does it" [5].
    With the development of the M U R A S A K I evaluation                                          i.e., under-generate.
 technique, comes the notion of two types of measures:
                                                                                                 5. Generate t o o m a n y fields in the correct record.
 a quantitative measure and a qualitative measure. The
 q u a n t i t a t i v e m e a s u r e determines the number of cor-                             6. Generate t o o f e w fields in the correct record.
 rect (and/or incorrect) records that have been generated
 in any one database while the q u a l i t a t i v e m e a s u r e                               7. Generate the w r o n g answer in the fields.
 evaluates the "correctness" of any database record field.
                                                                                                   Situations 1 and 2 illustrate what could go wrong at
                                                                                                the database level while scenarios 3 and 4 depict possi-
Background                                                                                      ble problems arising at the database record level. The
Some Assumptions                                                                                remaining criteria (namely 5, 6 and 7) shows what could
Given the overall design of Project MURASAKI, there                                             happen at the database record field level. However, the
are a few assumptions, or rather, some groundwork that                                          more crucial way of viewing these problems is not so
needs to be laid, in order to proceed in the development                                        much in w h e r e (i.e., at what level) these events occur,
of this evaluation technique. These assumptions are ex-                                         but rather in h o w these problems can be detected and
plained as follows:                                                                             thus measured for evaluation purposes. It is with this
                                                                                                motivation that the following categorization was derived:
    • Given the nature of the AIDS text corpus, any one
                                                                                                a quantitative measure could be designed to account for
      text could possibly generate one or more records
                                                                                                the problems that could arise at both the database and
      in one or more databases. This fact is loosely re-
                                                                                                database record levels while a qualitative measure could
      ferred to as domain complexity. (Furthermore, for
                                                                                                comparably be designed for evaluation at the database
      any record, all fields m a y not be filled.)
                                                                                                record field level.
    • Given the structure of the AIDS domain model, it is                                          In the next section, two examples are given depict-
      just as easy (or hard) to distinguish one subdomain                                       ing how the quantitative measure accounts for problems
      from another. T h a t is, each database is as likely                                      arising at the first two levels. (Note: 'rec.' is the abbre-
      to have a record generated in it as another. This                                         viation for record in these examples.)
      hypothesis is known as subdomain differentiation.
    • Upon the determination of what the expected output
      of Project M U R A S A K I should resemble, a correct                                     A Quantitative Measure
      record (in any database) is uniquely identified by                                        Background
      the contents of its key fields plus the contents of one                                   A scoring function is used for the quantitative measure
      or more non-key fields. This statement constitutes                                        to calculate an aggregate score for the number of correct
      the definition of a correct record. 3                                                     records (as defined previously) generated ('gem' in the
                                                                                                following examples) for a given M U R A S A K I text. This
                                                                                                scoring function assigns one point for the generation of a
 Generated          Output:        What      Could      Go
                                                                                                correct record ('coL') and - p points, where 0 < p < 1,
                                                                                                for the generation of an incorrect record ('inc.').
 After a thorough analysis of the system flow for Project
 M U R A S A K I and given a typical AIDS text as system in-
 put, the following list represents all possible undesirable                                    Some Questions
 situations that could arise:
                                                                                                Given the two examples in Table 1, the following ques-
       3 All a p p r o p r i a t e i n f o r m a t i o n should be extracted f r o m the text   tions come to mind:
 a n d placed in the correct database. A change in any of the key
 fields will result in the generation of a new record. For example,
 if d a t a from a different time period is p r e s e n t e d in the text, a key                   • W h a t should be the value of p? ! ?
                                                                                                                                       2"      i ? 17 Does
                                                                                                                                               3" 4"
 field change is required, and a new record is generated. If d a t a from                            bounding it between 0 and 1 imply        any linguistic
 a new region is presented, a new record is generated. E x a m p l e s
                                                                                                     restrictions on focus or coverage of     the text? Or
 of key a n d non-key fields are found in Appendices B and C. Key
 fields, which are found in the thick, darkened boxes, are the same                                  rather, should these bounds become       parameters of
 t h r o u g h o u t each database.                                                                  this measure?
           Ex. # i :            DB # I            DB #2              DB # 3            TOTAL               Ex. #2:                  DB # i          DB # 2             DB # 3             TOTAL
           Text xxx              3 tee.           2 rec.              1 rec.             6                 Text 124                  3 rec.          1 rec.             0 rec.              4
           what if,              2 gen.           2 gen.             2 gen.                                what if,                 4 gen.          0 gen.             1 gen.
            where                1 cor.            2 cor.             2 inc.                                where                    3 cor.          1 inc.             1 inc.
                                 1 inc.                                                                                              1 inc.
                                (1 inc.)
                                  1-2p                  2               -2p                a-4p                                       3-p                -p                 -p                3-3p
                                                                                              6                                                                                                  4,

                                                  Table 1: Examples of How the Quantitative Measure Works

   • Which is worse: to over-generate or under-generate?                                                                     IV                        Interpretation
     That is, should we have one penalty for one and                                                                          10                     very e a s y to extract
     another penalty for the other? (In Example #1 of                                                                           :
     Table 1, the extra, or over-generated, record is also
                                                                                                                               5        moderately e a s y / h a r d to extract
     penalized by - p points.)
   * What happens if the numerator is negative? Or                                                                             1                     very h a r d to extract
     equal to 0? Should the score in these cases be 0?

   • If the score for a single text is Texti, then should the
     scoring algorithm for the overall (average) Quanti-                                                          With this view of importance values 6, the extraction
     tative Score be ~ N          where i = 1, 2, " " " ' N and                                                process for Project MURASAKI may now be considered
     N is the total number of text?                                                                            as two subprocesses; that is, extraction plus deduction.
                                                                                                               For example, the key field fuente (meaning "source")
                                                                                                               may be filled with OMS or any one of the other period-
A Qualitative Measure                                                                                          icals and technical papers that are listed in the header
                                                                                                               line of each text (reference Appendix A, where the fuente
Background                                                                                                     is El Pa(s). Since the fuente field is constrained to only
Before proceeding into the design of the qualitative mea-                                                      a few possible fillers, an importance value of 9 has been
sure, some background is needed in order to motivate                                                           assigned to it. 7
this measure. For Project MURASAKI, a database
field is defined to be logically equivalent to that of a
S L O T while the contents of that field is equivalent                                                         Scoring Functions & A l g o r i t h m
to its F I L L E R . 4 The slots define three types of DO-                                                     Scoring functions are also used for the qualitative mea-
M A I N S : (1) unordered, e.g., OCCUPATIONS, (2) or-                                                          sure to calculate an aggregate penalty for the fields (both
dered, e.g., MONTHS-OF-THE-YEAR and (3) contin-                                                                key and non-key) in a database record. There are three
uous, e.g., HEIGHT. The slot fillers have three types of                                                       types of scoring functions based upon the cardinality of
ATTRIBUTES: (1) symbolic, e.g., (temperature(value                                                             the slot fillers: (1) single, (2) enumerated and (3) range, s
tepid)), (2) numeric, e.g., (weight(value 141.3)) and (3)                                                      An example of an ordered domain with single fillers is
hybrid, e.g., (test_results(value(i,000 people were de-                                                        that of TEMPERATURE:
ported))). Also, the slot fillers have three types of C A R -
D I N A L I T Y : (1) single, e.g., (sex(value male)), (2) enu-                                                        (make-frame T E M P E R A T U R E
merated, e.g., (subjects(value(math physics art))) and                                                                       (instance-of (value field))
(3) range, e.g., (age(value(0 100))).                                                                                        (database-in (value z))
   The notion of I M P O R T A N C E V A L U E S (IVs) are                                                                   (element-type (value symbol))
introduced here and are used to numerically describe                                                                         (domain-type (value ordered))
how easy/hard it was (is) to extract a particular field's                                                                    (cardinality (value single))
 (or slot's) information from the text. These importance                                                                     (elements (value cold cool tepid
values are assigned to both the key and the non-key fields                                                                               lukewarm warm hot scalding)))
of a database record for each of the five databases. 5 Im-                                                          6l_nt'orrnal f e e d b a c k t h u s far h a s i n d i c a t e d t h a t t h e s e values are
portance values are integers from 1 to 10, inclusive, and                                                      geared to h a v i n g m o r e e m p h a s i s p l a c e d o n t h e records t h a t c o n t a i n
 are interpreted as follows:                                                                                   easier fields a n d less o n t h e harder ones, t h u s n o t r e w a r d i n g t h o s e
                                                                                                               w h o p e r f o r m well on t h e h a r d e r fields.
                                                                                                                    r a n i m p o r t a n c e value of 10 w o u l d h a v e b e e n a s s i g n e d h a d it n o t
      4 T h e o r i g i n a t i o n of this k n o w l e d g e r e p r e s e n t a t i o n s c h e m e          b e e n for t h e fact t h a t in s o m e i n s t a n c e s , t h e " d e d u c t i o n " p o r t i o n of
( K R S ) w a s t a k e n f r o m [4]. T h e a p p l i c a t i o n of this K R S to P r o j e c t              t h e e x t r a c t i o n p r o c e s s for this field specifies t h e conversion of s o m e
M U R A S A K I was t a k e n from[l].                                                                         sources to their r e s p e c t i v e a c r o n y m , e.g., OMS is Organizacidn
      5 Recall t h a t e a c h d a t a b a s e , for b o t h S p a n i s h a n d J a p a n e s e , cor-        Mundial de la Salud (WHO).
r e s p o n d s to one of t h e five different s u b d o m a i n s w i t h i n t h e AIDS                           Sin P r o j e c t M U R A S A K I , only slots t h a t c o n t a i n single fillers
d o m a i n model.                                                                                             h a v e b e e n identified t h u s far.

(The filler x in the database-in slot represents the sin-
gle character identification value for a particular AIDS                                      × Pi)                      (3)
database.) Continuing with this example, if the follow-
ing actual output (AO) were to be matched against what        for i = 1, 2, ..., (number of fields in that database
was expected (EO, expected output),                           record). The Pi's are the computed penalties between
                                                              each field of the actual output and the expected output
AO:   (temperature (value cool))                              for that particular database record. The IVy's are the
EO:   (temperature (value lukewarm))                          importance values for the corresponding fields of that
                                                              database record.
the penalty assigned to this mismatch would depend on
                                                                 The Scoring Algorithm that computes the overall qual-
two variables: (1) D, the distance between the fillers in
                                                              itative measure for the entire text corpus is given below:
the ordered set of values and (2) C, the size of the do-
main. The scoring function that relates these two vari-       for e a c h TEXT
ables is                                                          for e a c h DB RECORD
                          WxD                                            for e a c h DB RECORD FIELD
                      P - - -                           (1)                  if EO_field a n d AO_field a r e equal
                             f(c)                                                   t h e n no penalty
where W is the numerical weight on the distance between                             else
the fillers and :P is a damping function on the size of the                                begin
domain.                                                                                        compute penalty ;;;based on
   As mentioned before, an example of an unordered do-                                             appropriate scoring function
main with single fillers is OCCUPATIONS. Since the dis-                                        weight penalty ;;; according to
tance, D, is not meaningful for this example, the penalty                                          the IV of that field
assigned to the match becomes a function merely of the                                         add weighted penalty
size of the domain (and hence the probability of the cor-                                          to total record penalty
rect filler appearing):                                                                    end
                       P-    ~(C)                       (2)
                                                              Some Unresolved               Issues
   Consider the slot CASOS_NOTIFICADOS from the               So far, fields that contain either numeric fillers or single
Incidence (I) Reporting database. It is a continuous do-      word fillers (fillers that are both easily "distanceable")
main with (single) numeric fillers and its attribute entry    have been discussed. However, one would think that the
is the following:                                             more linguistically complex fields, i.e., those containing
                                                              generated natural language phrases, would be more of a
        (make-frame CASOS_NOTIFICADOS                         true test for the qualitative measure of this evaluation
              (instance-of (value field))                     technique. Consider, for example, a non-key field like
              (database-in (value I))                         p o b l a c i 6 n ("population") (from Appendix C):
              (element-type (value number))
              (domain-type (value continuous))                AO: p o b l a e i 6 n inmigrantes
              (cardinality (value single))                    EO: p o b l a c i 6 n
              (unit-size (value 1))                           personas que pretendlan entrar en el pals ("people who
              (elements (value (0 1200.000))))                try to enter the country")

As before, suppose we are trying to match the                 How should one extend the current notion of the qual-
CASOS_NOTIFICADOS slots between the actual out-               ititative measure to include evaluating the distance be-
put and the expected output:                                  tween natural language phrases of this kind? It would
                                                              appear that p o b l a c i 6 n would be an unordered domain
AO:    (casos_notificados (value 2.700))                      containing symbolic information. However, what are the
EO:    (casos_notificados (value 2.781))                      elements of this domain? Should they have cardinality
                                                              single? Should they include only those phrases that were
Since only numbers can be represented in a continuous         generated from the expected output or should they addi-
domain, the elements of the domain are defined by giv-        tionally include al_! semantically equivalent phrases, i.e.,
ing the endpoints of the domain (or closed interval) and      those containing a common set of semantic primitives or
the unit size of representation is used in computing the      attributes, as well? If the latter situation were to pre-
distance between fillers. When defined in this manner,        vail, then, in the example listed above, should a penalty
the same scoring function that was used for an ordered        be assessed? If so, by how much? Or rather, should one
domain with single fillers (namely Equation 1) can be         group together all semantically equivalent phrases and
used to compute the penalty for continuous domain sets        then determine the distance between these classes?
as well.                                                         Consider another example of an unordered domain
  The overall Score for a single database record is           field from the Testing Policies Database:
AO:    r e s u l t a d o s han deportado a 1000 personas                       [3] Merchant,   R. and M. E. Okurowski. Personal Com-
       que resultaron                                                                munciation. January ~ February, 1990.
EO:    r e s u l t a d o s desde 1985, han deportado a 1000
       personas que resultaron
                                                                               [4]   Nirenburg, S., R. McCardell, E. Nyberg, P. Werner,
                                                                                     S. Huffman, E. Kenschaft and I. Nirenburg. 1988.
Should this non-key field be defined as having both a                                DIOGENES-88, CMU Technical Report CMU-CMT-
symbolic and numeric, i.e., hybrid, attribute? If so,                                88-107, Center for Machine Translation, Carnegie
should a scoring function based on symbolic and numeric                              Mellon University.
text be designed? Given the example above, should a                            [5]   Palmer, M., T. Finin, and S. M. Walter. 1989.
penalty be assigned for lack of a specific time element                              "Workshop on the Evaluation of Natural Lan-
(in the actual output) or are these phrases semantically                             guage Processing Systems". RADC-TR-89-302, Fi-
equivalent?                                                                          nal Technical Report, Unisys Paoli Research Center.
   A possible algorithmic extension to the current quali-
tative measure is outlined as follows:

 1. for a given database field, o b t a i n and e x a m i n e all
                                                                              A p p e n d i x A: S a m p l e Spanish
    possible fillers,                                                         A I D S Text and Translation
                                                                              # ~ 1 2 4 08ju189 E1 Pals Madrid palabras 899
 2. g r o u p / c l a s s i f y semantically equivalent phrases
    (by those that share common semantic primi-                               Los E m i r a t o s A r a b e s U n i d o s h a n d e p o r t a d o ,
    tives/attributes, e.g., theme, agent, actor, time,                        d e s d e 1985, a 1.000
    etc.) and then                                                            The United Arab Emirates has deported, since 1985,
 3. c a l c u l a t e the distance between each group/class
    (through determining by just how many semantic                            p e r s o n a s q u e r e s u l t a r o n p o s l t i v a s e n las p r u e b a s
    primitives/attributes they differ from each other).                       d e d e t e c c i 6 n del S I D A y
                                                                              people who tested positive on A I D S screening tests and
If this approach were taken, the scoring function of Equa-
tion i would be applicable where D would be the distance                      q u e p r e t e n d l a n e n t r a r e n el pals. U n p o r t a v o z
between classes of fillers rather than just between the                       d e su e m b a j a d a e n
fillers themselves.                                                           who tried to enter the country. A n embassy
                                                                              spokesperson in

Conclusion                                                                    E s p a f i a m a n i f e s t 6 q u e "es las s o l u c i 6 n m e n o s
It is hoped that this evaluation technique will prove ef-                     m a l a " , y a q u e la n a c i 6 n "es
fective for Project MURASAKI and thus become the                              Spain said that "it is the less harmful solution", because
basis on which to develop a general purpose evaluation                        the nation "is
tool. Research continues on answering those q u a n t i t a -                 m u y p e q u e f i a , t i e n e m e n o s d e m e d i o m i l l 6 n de
t i v e questions and on resolving those q u a l i t a t i v e issues.
                                                                              habitantes y no puede
                                                                              very small, it has less than half a million inhabitants,
                                                                              and it cannot
I would like to thank Roberta Merchant, Mary Ellen                            h a c e r f r e n t e a los e n f e r m o s " . L a O r g a n i z a c i 6 n
Okurowski and John Prange for their assistance and sup-                       M u n d i a l d e la S a l u d h a
port with this work. Also, I would like to thank Tom Do-                      care for the patients". The World Health Organization
err who was instrumental with the preparation of this
document. But most of all, I would like to thank my                           r e g i s t r a d o 10.000 n u e v o s casos d e S I D A e n el
morn for everything. It is in her memory that this paper                      pasado mes de junio,
will be presented.                                                            registered 10,000 new cases of A I D S last June,

                                                                              a s c e n d i e n d o el n d m e r o t o t a l a 167.373. E s p a f i a
                                                                              t l e n e 2.781 casos
References                                                                    raising the total number to 167,373. Spain has 2,781
 [1] MeCardell, R. 1990. "An Evaluation Technique for                         cases
     STUP Database Records". An unpublished docu-
     ment.                                                                    registrados.
 [2] McCardell, R. 1988. "Lexical Selection for Natural
     Language Generation". Thesis Proposal, Computer
                                                                                 9This is the header line for Text #124. This article was re-
     Science Department, University of Maryland Balti-                        ported in the El Pais newspaper, located in Madrid, on July 8,
     more County.                                                             1989 and contains 89 words.

A p p e n d i x B: A n Incidence R e p o r t i n g D a t a b a s e R e c o r d

                                          INCIDENCIA DEL SIDA

          ar'tfculo 124-021        fecha 00iun89    fuente El Pals
           region todo el mundo
          fuente de la information OMS

          VIH"      varones                 mujeres                  nifios
                    infectados por VIH (porcentaje)
                    infectados por VIH (estimados)
                    infectados por VIH (notificados)
                    modo de transmision
                    prevalencia:            % de populaci6n de
                    tasa de progresion ai SIDA:                      % para            afios
                    tasa de progresi6n al SIDA:                      % para            afios
                    tasa de progresion al SIDA:                      % para            afios
                    tasa de progresi6n al SIDA:                      % para            afios
                    perfodo de duplicaci6n                           meses
                    incremento mensual                               %

          SlDA:     varones                        mujeres                    nifios
                    casos notificados 10.000 nuevos casos en iunio 1989
                    casos estimados                para afio(s)
                    prevalencia:                   % de populaci6n de
                    tasa de letalidad              % / casos notificados en
                    tasa de letalidad              % / casos notificados antes de
                    fallecidos                     (n~mero)
                    fallecidos                     % de los casos notificados
                    relacibn m:f
                    periodo de duplication                                    meses
Appendix C: A Testing Policies Database Record

                                PRUEBAS CONTRA EL SIDA

      articulo 124-01T      fecha 08iul89   fuente El Pais
      region Los Emiratos ,~rabes Unidos
      fuente de la informaciOn portavoz de Los Emiratos .a,rabes Unidos en Espafia

      autoridad de acci6n

      nivel de acciOn


      poblaciOn personas que pretendian entrar en el pais

      local de la prueba

      tipo de prueba
      tipo de prueba
      tipo de prueba

      resultados desde 1985, han deport:ado a 1.000 personas que resultaron

Shared By:
Description: Project Databes document sample