final-day

Document Sample
final-day Powered By Docstoc
					           Attacking the Data Sparseness Problem




     Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish
     Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David
     Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu,
                      Kris Haralambiev, Fred Jelinek




Sunday, 27 February 2011                            JHU WORKSHOP - 2003
                           Motivation for the project

  Texts for text extraction contain sentences like:

      The IRA bombed a family owned shop in Belfast
      yesterday.



      FMLN            set off a series of explosions in central Bogota
      today.



Sunday, 27 February 2011                               JHU WORKSHOP - 2003
                           Motivation for the project

  We‟d like to automatically recognize that both are of the form:

    The IRA bombed a family
 ORGANIZATION   ATTACKED                   owned shop in Belfast
                                               LOCATION
      yesterday.
           DATE




    FMLN set
ORGANIZATION               off a series of explosions in central Bogota
                                    ATTACKED                LOCATION
      today.
                DATE




Sunday, 27 February 2011                                  JHU WORKSHOP - 2003
                           Our Hypotheses


   A transformation of a corpus to replace words and phrases
      with coarse semantic categories will help overcome the data
      sparseness problem encountered in language modeling, and
      text extraction.

   Semantic category information might also help improve
      machine translation

   A noun-centric approach initially will allow bootstrapping for
      other syntactic categories


Sunday, 27 February 2011                          JHU WORKSHOP - 2003
          A six week goal – Labeling noun phrases


   Astronauts aboard the space shuttle Endeavor were forced to
      dodge a derelict Air Force satellite Friday




   Humans aboard space_vehicle dodge satellite timeref.




Sunday, 27 February 2011                            JHU WORKSHOP - 2003
               Preparing the data- Pre-Workshop


   Identify a tag set
   Create a Human annotated corpus
   Create a double annotated corpus
   Process all data for named entity and noun phrase recognition
      using GATE Tools (26 million words)
   Parsed about (26 million words)
   Develop algorithms for mapping target categories to Wordnet
      synsets to support the tag set assessment


Sunday, 27 February 2011                          JHU WORKSHOP - 2003
             The Semantic Classes and the Corpus


   A subset of classes available in Longman's Dictionary of
      contemporary English (LDOCE) Electronic version

   Rationale:
           The number of semantic classes was small
           The classes are somewhat reliable since they were used by a
           team of lexicographers to code Noun senses, Adjective
           preferences and Verb preferences

           Many words have subject area information which might be
           useful


Sunday, 27 February 2011                                JHU WORKSHOP - 2003
                   The Semantic Classes



                             Concrete                        Abstract

         Animate                     Inanimate


 Plant       Animal          Human      Liquid Gas   Solid

MaleAnim FemaleAnim.                          Movable Non-movable
.
                 MaleAnim.           FemaleAnim.


  Sunday, 27 February 2011                             JHU WORKSHOP - 2003
                   The Semantic Classes



                             Concrete                       Abstract

         Animate                    Inanimate

              Male              Female
 Plant       Animal          Human  Liquid Gas      Solid

MaleAnim FemaleAnim.                         Movable Non-movable
.
                 MaleAnim.          FemaleAnim.


  Sunday, 27 February 2011                            JHU WORKSHOP - 2003
                   The Semantic Classes

                                                                    Abstract

                             Concrete                                 Organic
                                                Physical
                                                Qualities
         Animate                    Inanimate

              Male              Female
 Plant       Animal          Human  Liquid Gas              Solid

MaleAnim FemaleAnim.                         Movable Non-movable
.
                 MaleAnim.          FemaleAnim.


  Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
                   The Semantic Classes

                                                                         Abstract

                             Concrete                                      Organic
                                                     Physical
                                                     Qualities
         Animate                    Inanimate

              Male              Female
 Plant       Animal          Human  Liquid Gas                   Solid

MaleAnim FemaleAnim.                                 Movable Non-movable
.                                   FemaleAnim.
                 MaleAnim.              Collective


  Sunday, 27 February 2011                                         JHU WORKSHOP - 2003
                   The human annotated statistics

   Inter-annotator agreement is 94%, so that is the upper limit of our
      task.

   214,446 total annotated noun phrases (262,683 including “None
      of the Above”)

   29,071 unique vocabulary items (Unlemmatized)

   25 semantic categories
       (162 associated subject areas were identified)

   127,569 with semantic category - Abstract, 59 %



Sunday, 27 February 2011                                JHU WORKSHOP - 2003
                           The experimental setup


           Human Annotated with semantic tags– Noun Phrases Only

                                 220,000 instances
                                  2million words




                             BNC (Science, Politics, Business)
                                   26 million words


Sunday, 27 February 2011                                   JHU WORKSHOP - 2003
                  The main development set (dev)



                Training                                   Held out
                113,000 instances                       85,000 instances

                                        Blind portion


                           Machine Learning to improve this




Sunday, 27 February 2011                                          JHU WORKSHOP - 2003
    A challenging development set for experiments
            on useen words (Hard data set)


             Training – all unambiguous words    Held out – ambiguous words
             125,000 instances                   73,000 instances


                                        Blind portion


                           Machine Learning to improve this




Sunday, 27 February 2011                                      JHU WORKSHOP - 2003
                           Our Experiments include:

   Supervised Approaches (Learning from Human
      Annotated data)

   Unsupervised approaches

   Using outside evidence (the dictionary or wordnet)

   Syntactic information from parsing or pattern matching

   Context words, the use of preferences, the use of topical
      information

Sunday, 27 February 2011                          JHU WORKSHOP - 2003
     Experiments on unseen words - Hard data set


   Training corpus has only words with unambiguous
      annotations
         125,000 training instances
         73,000 instances held-out

   Perplexity – 21

   Baseline – Accuracy 45%

   Improvement – Accuracy 68.5 %

   Context can contribute greatly in unsupervised experiments

Sunday, 27 February 2011                       JHU WORKSHOP - 2003
                           Results on the dev set


   Random with some frequent ambiguous words moved into
      testing

           113,000 training instances
           85,000 instances held-out

   Perplexity – 3.44

   Baseline – Accuracy 80%

   Improvement – Accuracy 87 %

Sunday, 27 February 2011                            JHU WORKSHOP - 2003
       The scheme for annotating the large corpus


   After experimenting with the development sets, we need
      a scheme for making use of all of the dev corpus to tag
      the blind corpus.

   We developed a incremental scheme within the
      maximum entropy framework

   Several talks have to do with re-estimation techniques
      useful to bootstrapping process.

Sunday, 27 February 2011                        JHU WORKSHOP - 2003
                           Terminology

   Seen words – words seen in the human annotated data
      (new instances of known words)

     Unseen words – not in the training material but in
      dictionary

     Novel words – not in the training material nor in the
      dictionary/Wordnet



Sunday, 27 February 2011                         JHU WORKSHOP - 2003
                           Bootstrapping



                              Human Annotated


                               Blind portion



                              Unannotated Data




Sunday, 27 February 2011                         JHU WORKSHOP - 2003
              The Unannotated Data – Four types


                                   Human Annotated
                                    Blind portion



               Unambiguous
               515,000 instances




Sunday, 27 February 2011                             JHU WORKSHOP - 2003
              The Unannotated Data – Four types


                                   Human Annotated
                                    Blind portion


                                          Seen in
               Unambiguous                training
               515,000 instances          550,000
                                          instances




Sunday, 27 February 2011                              JHU WORKSHOP - 2003
              The Unannotated Data – Four types


                                   Human Annotated
                                    Blind portion
                                                       Unseen but in
                                                       dictionary 9,000
                                          Seen in
               Unambiguous                training
               515,000 instances          550,000
                                          instances




Sunday, 27 February 2011                              JHU WORKSHOP - 2003
              The Unannotated Data – Four types


                                   Human Annotated
                                    Blind portion
                                                       Unseen but in
                                                       dictionary 9,000
                                          Seen in
               Unambiguous                training
               515,000 instances          550,000
                                          instances      Novel
                                                         20,000




Sunday, 27 February 2011                              JHU WORKSHOP - 2003
     Annotated
                     201K

 Unambiguous
                     515K

           Seen

                     550K
                                Unseen     9K
                                                          Novel    20K




                     Training              Training                Training
                                                                              Tag TestData




   Unambiguous/Annotated        Marked as <0, 0, ...., 0, 1>
                                Marked with appropriate probabilites.
          Ambiguous
                                e.g. seen w is <p(C1|w), ...p(Cn|w)>
Sunday, 27 February 2011                                          JHU WORKSHOP - 2003
                           Results on the Blind Data


   We set aside one tenth of the annotated corpus

   Randomly selected within each of the domains

   It contained 13,000 annotated instances

   The baseline here was very high - 90% with simple techniques

   We were able to achieve 93.5% accuracy




Sunday, 27 February 2011                             JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
              Semantic Categories and MT

   10 test words – high, medium, and low frequency

   Collected their target translations using
      EuroWordNet (e.g. Dutch)

   Crane:
           [lifts and moves heavy objects] – hijskraan, kraan

           [large long-necked wading bird] - kraanvogel



Sunday, 27 February 2011                                  JHU WORKSHOP - 2003
                           SemCats and MT (2)

   Manually mapped synonym sets to semantic
      categories

           automatic mapping will be presented later


   Studied how many synonym sets are ruled out as
      translations by the semantic category



Sunday, 27 February 2011                         JHU WORKSHOP - 2003
                          Some Results
 3 words – full disambiguation

     crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant
     (Plant/Solid)

 7 words – the categories reduce substantially the possible
  translations

     club - [Abstr/an association of people...], [Mov.Solid/stout stick...],
     [Mov.Solid/ an implement used by a golfer...], [Mov.Solid/a playing
     card...], [NonMov.Solid/a building …]

        club/NonMov.Solid – [clubgebouw, clubhuis, …]
        club/Abstr. – [bevolkingsgroep, broederschap, …]
Sunday, club/Mov.Solid – [knots, kolf, malie], [kolf, malie], WORKSHOP - 2003
        27 February 2011                                 JHU [club]
                           The architecture
   The “multiple-knowledge sources” WSD architecture
      (Stevenson 03)

   Allow use of multiple taggers and combine their results
      through a weighted function

   Weights can be learned from a corpus

   All taggers implemented as GATE components and
      combined in applications


Sunday, 27 February 2011                      JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
                            The Bag-of-Words Tagger

 The bag-of-words tagger is an Information Retrieval-inspired
  tagger with parameters:
       Window size: 50 default value

       What POS to put in the content vectors (default: nouns and verbs)

       Which similarity measure to use

 Used in WSD (Leacock et al 92)

 Crane/Animal={species, captivity, disease…}

 Crane/Mov.Solid={worker, disaster, machinery…}
 Sunday, 27 February 2011                               JHU WORKSHOP - 2003
                             BoW classifier (2)

 Seen words classified by calculating the inner product
    between their context vector and the vectors for each
    possible category

 Inner product calculated as:
     Binary vectors – number of matching terms

        Weighted vectors:
             Leacock‟s measure – favour concepts that occur frequently in exactly one
             category

             Take into account the polysemy of concepts in the vectors



Sunday, 27 February 2011                                         JHU WORKSHOP - 2003
                   Current performance measures

   The baseline frequency tagger on its own – 91% on the
      test (blind) set

   Bag-of-words tagger on its own – 92.7%

   Combined architecture –93.2% (window size 50, using
      only nouns, binary vectors)




Sunday, 27 February 2011                     JHU WORKSHOP - 2003
                  Future work on the architecture

 Integrate syntactic information, subject codes, and document
    topics

 Experiment with cosine similarity

 Implement [Yarowsky‟92] WSD algorithm

 Implement the weighted function module

 Experiment with integrating the ME tools as one of the
    taggers supplying preferences for the weighting module

Sunday, 27 February 2011                      JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
                           Accuracy Measurements



                                Kris Haralambiev




• How to measure the accuracy
• How to distinguish “correct”, “almost correct” and “wrong”



Sunday, 27 February 2011                           JHU WORKSHOP - 2003
                      Exact Match Measurements

  W = (w1, w2, …, wn) – vector of the annotated words

  X = (x1, x2, …, xn) – categories assigned by the annotators

  Y = (y1, y2, …, yn) – categories assigned by a program

  Exact match (default) measurement – 1 for match and 0

     for mismatch of each (xi,yi) pair:

                       accuracy(X,Y) = |{i : xi = yi}|


Sunday, 27 February 2011                                 JHU WORKSHOP - 2003
                                 The Hierarchy


                               Concrete C                             Abstract T



              Animate Q                         Inanimate I      Organic 5   PhysQual 4

Plant P       Animal A     Human H          Liquid L    Gas G   Solid S


                                                         Movable N    Non-movable J
          B     D          F    M




Sunday, 27 February 2011                                         JHU WORKSHOP - 2003
                   Ancestor Relation Measurement

• The exact match will assign 0 for the               Animate Q
  pairs (H,M), (H,F), (A,Q), …
                                                   Animal A   Human H
• Give a partial score for two categeories
  in ancestor relation                                         F   M

 weight(Cat) =|{i : xi Tree with root Cat} |

 score(xi, yi) = min(weight(xi)/weight(yi), weight(yi)/weight(xi) )

 accuracy(X,Y) = i score(xi,yi)

 Sunday, 27 February 2011                        JHU WORKSHOP - 2003
                     Edge Distance Measurement

• The ancestor relation will assign some              Animate Q
  score for pairs like (H,M), (A,Q), but
  will assign 0 for pairs like (M,F), (A,H)        Animal A   Human H

• Going further, we want to compute the
  similarity (distance) between X and Y                        F   M


 distance(xi, yi) = the length of the simple path from xi to yi

 each edge can be given individual length or all edges have

  length 1 (we prefer the latter)

Sunday, 27 February 2011                         JHU WORKSHOP - 2003
             Edge Distance Measurement (cont' d)

  distance(X,Y) =             i distance(xi,yi)
  Accuracy – Distance
           100%            -      0
               ?           - distance(X,Y)
               0%          - max_possible_distance

  max_possible_distance =

           = i max(distance(xi,cat))
  might be reasonable to use aver. instead of max

Sunday, 27 February 2011                             JHU WORKSHOP - 2003
                               Some Baselines


  • Training + held-out data
                                     Exact Match EdgeDist.(aver) EdgeDist.(max)
              1st Wordnet sense
              & random for unknown      51,24         49,30          65,58
              1st Wordnet sense
              & 'T' for unknown         64,40         61,61          73,94
              1st Long.Dict. Sense
              & 'T' for unknown         69,77         65,15          76,33

  Blind data
                                     Exact Match EdgeDist.(aver) EdgeDist.(max)
              1st Wordnet sense
              & random for unknown      51,21         51,86          67,45
              1st Wordnet sense
              & 'T' for unknown         62,72         60,07          73,01
              1st Long.Dict. Sense
              & 'T' for unknown         69,75         64,31          75,86

Sunday, 27 February 2011                                           JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
               Supervised Methods using Maximum Entropy


  Jia Cui, David Guthrie, Martin Holub, Jerry Liu, Klaus Macherey




Sunday, 27 February 2011                         JHU WORKSHOP - 2003
 Overview

 • Maximum Entropy Approach

 • Feature Functions

 • Word Classes

 • Experimental Results




Sunday, 27 February 2011      JHU WORKSHOP - 2003
Maximum Entropy Approach

Principle:

• Define suitable features (constraints) on training data

• Find maximum entropy distribution that satisfies constraints (GIS)




Properties :

• Easy to integrate information from several knowledge sources

• Always converges to the global optimum on training data

Usage of: YASMET toolkit (by F. J. Och) & JME (by J. Cui)

Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
Feature Functions

• Prior Features
  Use Unigram probabilities P(c) for semantic categories c as feature




• Lexical Features
  Use the lexical information directly as a feature
  Reduce number of features by using the following definition




Sunday, 27 February 2011                                JHU WORKSHOP - 2003
Feature Functions (cont’d)
• Longman Preference Features
 Longman Dictionary provides subject codes for nouns
 Use frequency of preferences as additional features




• Unknown Word Features
 - Prefix features
 - Suffix features
 - Human-IST feature


Sunday, 27 February 2011                               JHU WORKSHOP - 2003
Word Classes

• Lemmatization
  - Eliminate inflections and reduce words to their base form
  - Assumption: different cases of one word have the same semantic
classes

• Mutual Information
  - Measures the amount of information one random variable contains about
    another




- Applied for nouns and adjectives


Sunday, 27 February 2011                             JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
                           Incorporating Context Features

  Jia Cui, David Guthrie, Martin Holub, Klaus Macherey, Jerry Liu




Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
Overview

• Result Analysis

• Rewind: Encoding Feature Functions

• Incorporating Context Features

• Clustering Methods

• Experimental Results




Sunday, 27 February 2011               JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
Adjectives

Continuing the example “angry kid”,

Describe adjectives by the categories of nouns that they prefer to modify to
avoid Sparseness.

• Obtain a set of categories for both “kid” and “angry”

 - kid: { A, S, H, H, H }

 - angry: { T, H }

• We can concentrate them together (merging): { A, S, H, H, H, T, H }

• Or do some kind of component-wise multiplication (pruning): { H, H, H }

• Simply merging introduces irrelevant categories - increases entropy

• Simply merging introduces irrelevant categories - increases entropy

Sunday, 27 February 2011                                  JHU WORKSHOP - 2003
Clustering Methods

Longman dictionary contains such adjective preferences, but we can also
generate preferences based on corpus.

• Measure the entropy of each adjective, by getting frequency of each
  adjective modifying a noun of a particular category

 - The lower the entropy, the more contextually useful the adjective

 - Measure confidence of adjective by frequency




• Example : “angry”

 - adj: angry, entropy: 2.18, freqs: [155, 55, 9, 7, 0 ....]

 - pick: angry: { T, H }

Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
Sunday, 27 February 2011   JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
                      Hard vs Soft Word Clusters


 Words as features are sparse, we need to cluster up

     Hard clusters
           A feature is assigned to one and only one cluster. (The cluster
           for which there exists the strongest evidence.)
     Soft clusters
           A feature is assigned to as many clusters as there is evidence
           for.




Sunday, 27 February 2011                                 JHU WORKSHOP - 2003
          Using clustering and contextual features

  Baseline – prior + most frequent semantic category
  All words within the target noun phrase (with a threshold
   of 10 occurrences)
  Adjective hard clusters
         Clusters are defined by most frequent semantic category
  Noun soft clusters
     Clusters are defined by all semantic categories
  Combined adjective hard clusters and noun soft clusters




Sunday, 27 February 2011                               JHU WORKSHOP - 2003
                 Results with clusters and context


                                        Accuracy   Edge Distance
            Baseline                     90,52
            Noun Soft Clusters           92,23
            Noun Phrase Words            90,56
            Adjective Hard Clusters      90,68
            Combined (Adjective+Noun)    92,30



  Training on Training+Held-out

  Testing on Blind Data


                                               ME tool: Jia's MaxEnt toolkit

Sunday, 27 February 2011                                 JHU WORKSHOP - 2003
 Measuring Usefulness of Adjectives in Context


  We have a huge number of nouns that are assigned a
  semantic tag from
                   A. Training Data
                   B. The BNC corpus when the noun is unambiguous with
                   regard to the possible semantic category.
  Using the adjectives that modify these nouns we are able to
  compute the entropy:
                     H (C | a )   P(c | a ) log P(c | a )
                                   c
       a is an adjective, C is the set of semantic categories




Sunday, 27 February 2011                                        JHU WORKSHOP - 2003
                           Clustering Adjectives

• We take adjectives with low H(T | a) and make clusters
  form them depending on which semantic category they
  predict
    adjcluster (t )  {a : H (T | a)   1 , f (a)   2 , t  arg max_ t ' ( P(t ' | a))}

                 1 and 2 are thresholds

• Then use each cluster of adjectives as a context feature




Sunday, 27 February 2011                                         JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
              Structuring the context using syntax

Syntactic Model: eXtended Dependency Graph
Syntactic Relations considerd: V_Obj, V_Sog, V_PP, NP_PP
Results
                   Held-out
                   Model          Accuracy
                   Baseline         81,00
                   +CONTEXT         79,49

                                         Used Tools:
                                         Syntactic Parser “Chaos” (Basili & Zanzotto)
                                         Max Entropy Toolkit (implemented by Jia Cui)
Observations:
 Features are too scarce
 We're overfitting! We need more intelligent methods.

Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
                     Semantic Fingerprint:
             Generalizing nouns using EuroWordnet

 Top level generalizations



  Base concepts (tree structure)


  Bottom level – synonym sets
  (directed graph)




Sunday, 27 February 2011                 JHU WORKSHOP - 2003
          Noun semantic fingerprints: an example

  Words in the events are replaced by “basic concepts”
                                      object


                                       location
                           person
                                                               HierarchicalFingerprint
                                                area
                    social group
                                     district       geographic area


              assemblage      administrative district   urban center
                                                                       FlatSemanticFingerprint


              the CEO drove into the city with his own car
Sunday, 27 February 2011                    Solid              JHU WORKSHOP - 2003
           Verb semantic fingerprints: an example

       Generalized Features              Lexicalized Features
       drive_V_PP_into                     drive_V_PP_into
       travel _V_PP_into
       move_V_PP_into




         drive travel          move




         [the CEO] [drove] [into the city] [with his own car]

                   V_Subj   V_PP
Sunday, 27 February 2011                              JHU WORKSHOP - 2003
                    How to exploit the word context?

    [Semantic Category]-Subj-to_think
         Positive observations
           his wife thought he should eat more
           the waitress thought that Italians leave tiny tips
    Our conceptual hierarchy contains FemaleHuman and MaleHuman...



       P(FemaleHuman|Subj-to_think) is high and reliable

? Fabio thought he has had a terrific idea before looking at the results

                     Fabio is a FemaleHuman !
   Sunday, 27 February 2011                                  JHU WORKSHOP - 2003
                         How to exploit the word context?


        “one-semantic-class-per-discourse” hypothesis

  nnn    W      ooo                  H
         H
               yyy       W
                         H
                               kkk   T   H
                                             Verifying the hypothesis
   vvv   W
         H
                                     T   H   #instances                 200K
                                             #instances (w,T,Doc)        60K
         xxx     W
                     H
                         zzz         H       #errors                 1K/200K




Sunday, 27 February 2011                                      JHU WORKSHOP - 2003
                     Syntactic slots and slot fillers



                                             w
       W                      VSubj VObj Modifier Modified Other

                 W


                                              VSubj VObj Modifier Modified Other
        W

             W
                            Training


                                   MI(SemTag,Feature) as feature ranker

Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                 How to exploit the word context?

 Using...
  a revised hierarchy
         Female animal and male animal  Animal
         Female human and male human  Human
         Female and male  Animate
  “one-semantic-class-per-discourse” hypothesis
  the “semantic fingerprint”: generalising nouns in the
     basic concepts of EuroWordnet and verbs in the top most
     in Wordnet




Sunday, 27 February 2011                          JHU WORKSHOP - 2003
                                               Results


     Held-out combined
                          Method                                Accuracy
                          Baseline                                81,75
                          SemanticFingerprint(HW)                 83,30
                          SemanticFingerprint(HW,CONTEXT)         83,44
                          HierarchicalFingerprint(HW,CONTEXT)     83,95




                                                                   Test bed charateristics
Used Tools:                                                                           Train   Test
Syntactic Parser “Chaos” (Basili & Zanzotto)                      Held-out Combined   30K     30K
Max Entropy Toolkit (implemented by Jia Cui)                      Held-out Plain      100K    100K



   Sunday, 27 February 2011                                             JHU WORKSHOP - 2003
                                    Results: a closer look




    Held-out combined
                                                  All    Unambiguous Ambiguous
           Baseline                              81,75      93,45      70,55
           HierarchicalFingerprint(HW,CONTEXT)   83,95      96,08      72,34




Used Tools:
Syntactic Parser “Chaos” (Basili & Zanzotto)
Max Entropy Toolkit (implemented by Jia Cui)

  Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
 Unsupervised Semantic Labeling of Nouns using
                     ME



                                  Frederick Jelinek
                           Semantic Analysis for Sparse Data




Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                           Motivation

 • Base ME features on lexical and grammatical
   relationships found in the context of nouns to be labeled
 • Hand-labeled data too sparse to allow using powerful ME
   compound features
 • Wish to utilize large unlabeled British National Corpus
   (and internet, etc.) for training
 • Will use dictionary and initialization by statistics from
   smaller hand-labeled corpus




Sunday, 27 February 2011                     JHU WORKSHOP - 2003
                 Format of Labeled Training Data

  w is the noun to be labeled
  r1, r2…, rm are the relationships in the context of w which
   correlate with the label appropriate to w
  C is the label denoting semantic class
  f1, f2,…, fK is the label count, I.e., fC=1 and fi=0 for i  C
  Then the event file format is
     (f1, f2,…, fK , w, r1, r2…, rm )




Sunday, 27 February 2011                         JHU WORKSHOP - 2003
                    Format of BNC Training Data

  The label counts fi will be fractional with fi = 0 if the
   dictionary does not allow noun w to have the ith label.
  Always fi = 0 and Si fi = 1
  The problem is the initial selection of values of fi
  Suggestion: let fi = Q(C= i | w) where Q denotes the
   empirical distribution from hand labeled data.




Sunday, 27 February 2011                        JHU WORKSHOP - 2003
   Annotated

Unambiguous


   BNCSeen

      Unseen

         Novel


                                                                          Tag Heldout
                     Training       Training                   Training




              Unambiguous       Marked as {0, 0, ..,1,.., 0}
                                Marked with appropriate probabilities.
               Ambiguous        Eg, Novel is initially{1/25, …,1/25}
   Sunday, 27 February 2011                                JHU WORKSHOP - 2003
                    Inner Loop ME Re-estimation

 The empirical distribution used in the ME iterations is
  obtained from sums of values of fi found in both the
  labeled and BNC data sets.
 These counts determine
     which of the potential features will be selected as
     actual features
     the values of the l parameters in the ME model




Sunday, 27 February 2011                        JHU WORKSHOP - 2003
          Constraints and Equations

        P(c, w, R) k (c, w, R)  Q(c, w, R) k (c, w, R)
      c , w, R
                            i
                                               c , w, R
                                                                i


                            if N   Q(c, w, R) ki (c, w, R)  
                                    c , w, R


                           1
                               if c  c* , w  w* , rj  r *
            ki (c, w, R)  
                           0
                                      otherwise

                             1
                 P(c, w, R)   exp{ li ki (c, w, R)}
                             Z i
Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
                   Outer Loop Re-scaling of Data

  Once the ME model P(C = c | w,r1,…,rm) is estimated, the
     fi values in event files of the BNC portion of data are re-
     scaled.
  fi values in the hand labeled portion remain unchanged

  New empirical counts are thus available
          to determine the identity of new actual features
          the parameters of a new ME probability model
  Etc.


Sunday, 27 February 2011                                 JHU WORKSHOP - 2003
                   Preliminary Results by Jia Cui

 Sheffield annotated corpus and BNC unambiguous nouns provide
 initial statistics.
 Label instances of BNC corpus whose headwords are seen in
 unambiguous data but are ambiguous according to the Longman
 dictionary.
                                        Er r or r at es i n i t er at i on:
                                        shef f i el d+shef f i el d' wi t h BNC seen

                           0. 082
                           0. 081
                            0. 08
                           0. 079
                           0. 078
                           0. 077
                           0. 076
                           0. 075
                           0. 074
                           0. 073
                                    1   2    3    4    5     6    7    8    9    10



Sunday, 27 February 2011                                                               JHU WORKSHOP - 2003
                           Concluding Thoughts

 • Preliminary results are promising
 • Method requires theoretical and practical exploration
 • Changing of features and feature targets is a new
   phenomenon in ME estimation
 • Careful selection of relationships and basing them on
   clusters required will lead to effective features
 • See proposal by Jia Cui and David Guthrie



Sunday, 27 February 2011                         JHU WORKSHOP - 2003
                                 Overview
    Bag of words (Kalina)

    Evaluation (Kris)

    Supervised methods using maximum entropy (Klaus)

    Incorporating context preferences (Jerry)

    Experiments with Adjective Classes and Subject (David, Jia, Martin)

    Structuring the context using syntax and semantics (Cassia, Fabio)

    Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)

    Unsupervised Re-estimation (Roberto)

    Student Proposals (Jia, Dave, Marco)

    Conclusion
Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
 Unsupervised Semantic Tagging


   Roberto Basili, Fabio Zanzotto, Marco Cammisa, Martin Holub, Kris
   Haralambiev, Cassia Martin, Jia Cui, David Guthrie,




   JHU Summer Workshop 2003
   August, 22nd 2003 – Baltimore
Sunday, 27 February 2011                                JHU WORKSHOP - 2003
                                       Summary
  Motivations
               Lexical Information for Semantic Tagging
               Unsupervised Natural Language Learning
               Empirical Estimation for ME bootstrapping
  Weakly Supervised BNC Tagging through Wordnet
               A semantic similarity metric over Wordnet
               Experiments and Results
                    Mapping LDOCE to Wordnet
                    Bootstrapping over an untagged corpus
                    Re-estimation thrugh Wordnet




Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
                           Motivations

   All experiments tell that lexical information is crucial
    for semantic tagging, …
   data sparseness seems to limit the effect of the context

   The contribution of different resources needs to be
      exploited (as in WSD)

   In applications hand-tagging should be applied in a
    cost effective way
   Good results need to scale-up also to technological
    scenarios where poorer (or no) resources are available

Sunday, 27 February 2011                        JHU WORKSHOP - 2003
                              Motivations (cont’d)
  Wordnet contribution to semantic tagging
          A source of evidence for a larger set of lexicals (unseen words)
          A consistent way to generalize single observations
          (hierarchical) constraints over word uses statistics

  Similarity of words uses suggests semantic similarity:
          Corpus-driven syntactic similarity is one possible choice
          Domain or topical similarity is also relevant

  Semantic similarity in the Wordnet hierarchy suggests useful levels
     of generalization
          Specific hypernyms, i.e. able to separate different senses
          General hypernyms, i.e. help to reduce the number of word classes to model


Sunday, 27 February 2011                                               JHU WORKSHOP - 2003
                    Learning Contextual Evidence

   Each syntactic relation provides a “view” on a word
      usage, i.e. suggests a set of nouns with common
      behaviour(s)

   Semantic similarity among nouns is a model of local
      semantic preference


        to drink {beer, water, …, cocoa/L, stock/L, …}
        The {…, president, director, boy, ace/H, brain/H, …} succeeds




Sunday, 27 February 2011                              JHU WORKSHOP - 2003
             Semantic classes vs. language models

   The role of p( C | v, d)
     e.g.
            p( n | v, d)           p( n | C) p( C | v, d)

   Implications

           p(n | C) gives a lexical semantic model that …
           it is likely to depend on the corpus and not on the
           individual context

           p(C | v, d) model selectional preferences and …
           provides disambiguation cues for contexts (v d X)

Sunday, 27 February 2011                            JHU WORKSHOP - 2003
             Semantic classes vs. language models

  Lexical evidence: p(n | C) (or also p(C|n) )
  Contextual evidence: p( C | v, d)

  The idea:
     Contextual evidence can be collected from the corpus by
     involving the lexical knowledge base
     The modeling of lexical evidence can be seen as a side effect of
     the context (p(C | n)  p(n|C) )

  Implied approach
     Learn the second as an estimate for the first and then combine
     for bootstrapping to unseen words


Sunday, 27 February 2011                            JHU WORKSHOP - 2003
                           Conceptual Density

  Basic terminology

          Target noun set T (e.g. {beer, water, stock} nouns in relation
          r=VDirobj with a given verb)

          (Branching Factor) Average number m of children of a node s, i.e.
          the average number of children of any node subsumed by s

          (Marks) Set of marks M, i.e. the subset of nouns in T that are
          subsumed within the WN subhierarchy rooted in s. N = |M|

          (Area) area(s), total number of nodes of the subhierarchy rooted at
          s


Sunday, 27 February 2011                                 JHU WORKSHOP - 2003
                            Conceptual Density (cont’d)

                                    1




                                             3
                            2




                        4       5                 6




                                             9        15
                    8




               10                       11       12        7




       13           14




       Word1                Word2                Word3
Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                           Using Conceptual Density
  Target Noun set T (e.g. subjs of verb to march)

          horse (6 senses in WN1.6),
          ant (1 sense in WN1.6)           FIND the smaller set of synsets {s} that
          troop (4 senses in WN1.6)        “covers” T and maximizes
          division (12 senses in WN1.6)             CD=   S cd(r)(s)
                                                            s

          elephant (2 senses in WN1.6)
                                                           1                    3
     (1) organization organisation :
                                                           4
            horse, troops, divisions                                   2
     (2) placental placental_mammal ...:
            horse, elephant
                                                    division horse             ant
     (3) animal animate_being …:                                  elephant
            horse, elephant, ant
     (4) army_unit:                         troop
            troop, division

Sunday, 27 February 2011                                        JHU WORKSHOP - 2003
                                     Summary
  Motivations
        Lexical Information for Semantic Tagging
        Unsupervised Natural Language Learning
        Empirical Estimation for ME bootstrapping

  Weakly Supervised BNC Tagging through Wordnet
        A semantic similarity metric over Wordnet

               Experiments and Results
                    Mapping LDOCE to Wordnet

                    Bootstrapping over an untagged corpus

                    Re-estimation thrugh Wordnet


Sunday, 27 February 2011                                    JHU WORKSHOP - 2003
                   Results: Mapping LDOCE classes
 Lexical Entries in LDOCE are defined in terms of a Semantic Class
    and a topical tag (Subject Codes), e.g. stock ('L','FO')

 The semantic similarity metrics has been used to derive WN synset(s)
    that represent <SemClass,SubjCode> pairs
       A Wn explanation of lexical entries in a LM class (lexical mapping)
       The position(s) in the WN noun hierarchy of each LM class (category mapping)

 Semantic preference of synsets given words, LM classes (and Subject
    Codes) can be mapped into probabilities, e.g.
                      p( WN_syns | n LM_class )
      and then
    p(LM_class | n WN_syns ), p(LM_class | n), p(LM_class | WN_syns )

Sunday, 27 February 2011                                      JHU WORKSHOP - 2003
                      Mapping LDOCE classes (cont’d)


   Example Cluster: 2---EDZI
         '2„       'Abstract and solid'
         'ED„-'ZI„ 'education - institutions, academic name of '

        T={nursery_school, polytechnic, school, seminary, senate}

           Synset: “school “,
           cd=0.580, coverage: 60%
           Synset “educational_institution”,
           cd=0.527, coverage: 80%
          Synset “gathering assemblage”
          cd=0.028, coverage: 40%


Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
                Case study: the word stock in LDOCE
   stock T   a supply (of something) for use
   stock J   goods for sale
   stock N   the thick part of a tree trunk
   stock A   a group of animals used for breeding
   stock A   farm animals usu . cattle ; LIVESTOCK
   stock T   a family line , esp . of the stated character
   stock T   money lent to a government at a fixed rate of interest
   stock T   the money (CAPITAL) owned by a company, divided into SHAREs
   stock P   a type of garden flower with a sweet smell
   stock L   a liquid made from the juices of meat, bones , etc . , used in cooking
   stock J   (in former times) a stiff cloth worn by men round the neck of a shirt
             compare TIE
   stock N   a piece of wood used as a support or handle, as for a gun or tool
   stock N   the piece which goes across the top of an ANCHOR_1_1 from side
             to side
   stock P   a plant from which CUTTINGs are grown
   stock P   a stem onto which another plant is GRAFTed

Sunday, 27 February 2011                                           JHU WORKSHOP - 2003
                  Case study: stock as Animal (A)
   stock A a group of animals used for breeding
   stock A farm animals usu . cattle ; LIVESTOCK

    Word w        Wordnet Synset s                             P( s | w LDClass)

      stock       'breed strain stock variety „                      0.63

      stock       'placental placental_mammal eutherian              0.18
                  eutherian_mammal „

      stock       'animal animate_being beast brute creature         0.10
                  fauna '

      stock       'organism being „                                  0.06

      stock       'object physical_object '                          0.02



Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                           Case Study: stock (N - P)

     stock N a piece of wood used as a support or handle , as for a gun or tool
     stock N the piece which goes across the top of an ANCHOR_1_1 from side to side
     stock N the thick part of a tree trunk

      Word w       Wordnet Synset s                                  P( s | w LDClass)
       stock       'artifact artefact '                                    0.991
       stock       'object physical_object '                               0.008
       stock       'social_group '                                       0.000001

     stock P a plant from which CUTTINGs are grown
     stock P a stem onto which another plant is GRAFTed
     stock P a type of garden flower with a sweet smell

      Word w       Wordnet Synset s                                  P( s | w LDClass)
       stock       „plant flora plant_life '                                0.595
       stock       'plant_organ '                                           0.219
       stock       'organism being '                                        0.185

Sunday, 27 February 2011                                           JHU WORKSHOP - 2003
                           LM Category Mapping

Longman Class C             prob( s | C)                      Wordnet syns s
       'H'                     0.764       person individual someone somebody mortal human soul

       'H'                     0.036                      'object physical_object '
       'H'                     0.016                         'organism being '


       'I'                     0.237                        'substance matter '
       'I'                     0.139                         'artifact artefact '
       'I'                     0.109                   'cognition knowledge noesis '
       'I'                     0.105               'happening occurrence natural_event '


       'F'                     0.468       person individual someone somebody mortal human soul

       'F'                     0.285                      'woman adult_female '
       'F'                     0.019                      'object physical_object '



Sunday, 27 February 2011                                           JHU WORKSHOP - 2003
                 Results: A Simple (Unsupervised) Tagger
        Estimate over the parsed corpus+Wordnet and by mapping into
        LD categories, the following quantities:
                 P( C | hw r),   P( C | r),   P( C | hw)

          (r ranges over: SubjV, DirObj,       N_P_hw, hw_P_N)

        Apply a simple Bayesian model to any incoming contexts
                                 <hw r1, …, rk>

         and Select argmaxC( p(C| hw) p(C| r1) … p(C | rk))

          (OBS: p(C | rj) is the back-off of p(C | hw rj))

Sunday, 27 February 2011                                     JHU WORKSHOP - 2003
                      Unsupervised Tagger: Evaluation
                            Hard_Set         Hard_Set       Held-out            Blind
                            (Training)


Sup ME Baselines                 -             55%            80%                90%

Pick 1st Longman           76.86 % (82.19)       _           69.77%          69.17% (75.86)

Pick 1st WN                69.50 % (78.34)       -             64.4          62.76% (73.01)

SimpleTagger                  81,24%          54,08%         70,25%             75,23%

SimpleTagger                  86,36%          67.02%         78,24%             81,57%
(ED)



              10% Coverage of the Training Corpus (2K vs. 20K words)
              less than 50% Coverage of the Test Set (1.2K vs. 2.8K)


Sunday, 27 February 2011                                              JHU WORKSHOP - 2003
                Results: Re-estimate probs for a ME model

   Use sentences in training data for learning lexical and
      contextual preferences of nouns and relations
   Use lexical preferences to pre-estimate the empirical
      distributions over unseen data (see constraints Q(c,w,R)
      in Fred‟s)
   Train the ME model over all available data

   Tag held-out and blind data




Sunday, 27 February 2011                           JHU WORKSHOP - 2003
                                         Results

     Features: All syntactic Features     Features: All syntactic Features
     ME Tra: Training Data                ME Tra: Training Data+WN Held Out
     ME Test: Held-out
                                           ME Test: Held-out
     Result: 78-79%
                                           Result: 79-80%
     Features: Only head words
     ME Tra: Training Data                Features: Only head words
     ME Test: Held-out                    ME Tra: Training Data+WN Held Out
     Result: 80.76%
                                           ME Test: Held-out
                                           Result: 81.80%

                                           Features: All synt Features
                                           ME Tra: Training Data+WN Held Out
                                           ME Test: Blind Data
                                           Result: 86,03%

Sunday, 27 February 2011                                       JHU WORKSHOP - 2003
                             Conclusions

  A robust parameter estimation method for semantic tagging
         Less prone to sparse data
         Generalize to meaningful noun classes
         Develop “lexicalized” contextual cues and a semantic dictionary

  A natural and viable way to integrate corpus-driven
    evidence with a general-purpose lexicon
  Results consistent wrt fully supervised methods

  Open perspectives for effective estimation of unseen
    empirical distributions


Sunday, 27 February 2011                               JHU WORKSHOP - 2003
                           Open Issue

   Estimate contextual and lexical probabilities from the
      28M portion of the BNC (already parsed here)

   Alternative formulations of similarity metrics

   Experiment a bootstrapping method by imposing the
      proposed estimates (i.e. p(C| w, SubjV)) as constraints
      to Q(C, w, SubjV))

   Manually assess and measure the automatically derived
      Longman-Wordnet mapping


Sunday, 27 February 2011                        JHU WORKSHOP - 2003
                                 Summary Slide

     IR-inspired approaches (Kalina)

     Evaluation (Kris)

     Supervised methods using maximum entropy
           Incorporating context preferences (Jerry)

           Adjective Classes and Subject markings (David)

           Structuring the context using syntax and semantics (Cassia, Fabio)

     Re-estimation techniques for Maximum Entropy Experiments (Fred)

     Unsupervised Re-estimation (Roberto)




Sunday, 27 February 2011                                               JHU WORKSHOP - 2003
                           Our Accomplishments

   Developed a method for bootstrapping using maximum
      entropy

   More than 300 experiments with features
   Integrated dictionary and syntactic information
   Integrated dictionary, Wordnet, syntactic information
      and topic information experiments which gave us
      significant improvement
   Developed a system for unsupervised tagging


Sunday, 27 February 2011                         JHU WORKSHOP - 2003
                            Lessons learned

   Semantic Tagging has an intermediate complexity
    between the rather successful NE recognition and Word
    Sense Disambiguation
   Semantic tagging over BNC is viable with high accuracy

   Accuracy reached by most of the proposed methods:
      94%
   This task stimulates cross-fertilization between statistical
      and symbolic knowledge grounded on solid linguistic
      principles and resources

                           … is this the end of the story?
Sunday, 27 February 2011                             JHU WORKSHOP - 2003
              NO! The near future at a glance …

   Availability of semantic information for head nouns is
      critical to a variety of linguistic tasks
           IR and CLIR, Information Extraction and Question Answering
           Machine Translation and Language Modeling
   Annotated resources can provide a significant stimulus to
      machine learning of linguistic patterns (e.g. QA answer
      structures)
   Open possibilities for corpus-driven learning of other
      semantic phenomena (e.g. verb argument structures) and
      incremental learning methods
Sunday, 27 February 2011                               JHU WORKSHOP - 2003
                       … and a quick look further

   Unseen phenomena still represent hard cases for any probabilistic
      model (rare vs. impossible labels for unseen/novel words)
   Integration of external resources is problematic
           Projecting observed empirical distributions may lead to overfitting data
           Lexical information (e.g. Wordnet) has not a clear probabilistic
           interpretation
           Soft Features (Jia Cui) seem a promising model

   Better use of the context:
           Design and derivation of class-based contextual features (David Guthrie)

   Existing lexical resources provide large scale and effective
      information for bootstrapping

Sunday, 27 February 2011                                          JHU WORKSHOP - 2003
                           A Final thought

   Thanks to the Johns Hopkins faculty and staff for their
      availability and helpfulness during the workshop.

   Special thanks to Fred Jelinek for answering endless
      questions about maximum entropy and helping to model
      our problem.




Sunday, 27 February 2011                        JHU WORKSHOP - 2003

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:2/27/2011
language:English
pages:120