Docstoc

Unsupervised Disambiguation

Document Sample
Unsupervised Disambiguation Powered By Docstoc
					                          Syntax
 • The study of how words are ordered and
   grouped together
 • Key concept: constituent = a sequence of
   words that acts as a unit


                              }{
                              he          home
                        the man           to his house
                                   went
                 the short man            out of the car
the short man with the large hat          with her
      Phrase Structure
        S

NP            VP


PN      VBD      NP          PP

                           PRP       NP


She     saw   a tall man   with   a telescope
                       Noun Phrases
• Contains a noun plus descriptors, including:
       –   Determiner: the, a, this, that
       –   Adjective phrases: green, very tall
       –   Head: the main noun in the phrase
       –   Post-modifiers: prepositional phrases or relative
           clauses

 det       adj   adj   head   PP          relative clause


That old green couch of yours that I want to throw out
                     Verb Phrases
• Contains a verb (the head) with modifiers
  and other elements that depend on the verb
    head        PP

    want to throw out

      adv    head       direct object             PP

   previously saw the man in the park with her telescope
                         indirect
    modal aux    head     object        DObj   adverb

    might have showed his boss the code yesterday
       Prepositional Phrases
• Preposition as head and NP as complement
                head   complement

                with her grey poodle


           Adjective Phrases
• Adjective as head with modifiers
        adv     head   relative clause

      extremely sure that he would win
              Shallow Parsing
•   Extract phrases from text as „chunks‟
•   Flat, no tree structures
•   Usually based on patterns of POS tags
•   Full parsing conceived of two steps:
    – Chunking / Shallow parsing
    – Attachment of chunks to each other
              Noun Phrases
• Base Noun Phrase: A noun phrase that
  does not contain other noun phrases as a
  component
• Or, no modification to the right of the head
     a large green cow
     The United States Government
     every poor shop-owner’s dream     ?
     other methods and techniques      ?
        Manual Methodology
• Build a regular-expression over POS
• E.g:
  DT? (ADJ | VBG)* (NN)+

• Very hard to do accurately
• Lots of manual labor
• Cannot be easily tuned to a specific corpus
             Chunk Tags
• Represent NPs by tags:
[the tall man] ran with [blinding speed]
DT   ADJ NN1 VBD PRP VBG          NN0
 I I      I     O O         I    I
• Need B tag for adjacent NPs:
On [Tuesday] [the company] went bankrupt
O     I           B I          O O
     Transformational Learning
• Baseline tagger:
   – Most frequent chunk tag for POS or word
• Rule templates (100 total):
current word/POS                    current ctag
word/POS 1 on left/right            current and left ctag
current and left/right word/POS     current and right ctag
word/POS on left and on right       in two ctags to left
in two words/POSs on left/right     in two ctags to right
in three words/POSs on left/right
         Some Rules Learned
1.   (T1=O, P0=JJ)            IO
2.   (T-2=I, T-1=I, P0=DT)     B
3.   (T-2=O, T-1=I, P-1=DT)    I
4.   (T-1=I, P0=WDT)          IB
5.   (T-1=I, P0=PRP)          IB
6.   (T-1=I, W0=who)          IB
7.   (T-1=I, P0=CC, P1=NN)    OI
                    Results
 Training         Prec.      Recall        Tag Acc.
 Baseline         78.2       81.9          94.5
 50K              89.8       90.4          96.9
 100K             91.3       91.8          97.2
 200K             91.8       92.3          97.4
 200K nolex       90.5       90.7          97.0
 950K             93.1       93.5          97.8

• Precision = fraction of NPs predicted that are correct
• Recall = fraction of actual NPs that are found
     Memory-Based Learning
• Match test data to previously seen data and
  classify based on the most similar
  previously seen instances
• E.g:


   boy saw the
              {     boy saw three
                    she saw the
                    the saw was
                    boy ate the
     k-Nearest Neighbor (kNN)
• Find k most similar training examples
• Let them „vote‟ on the correct class for the
  test example
  – Weight neighbors by distance from test
• Main problem: defining „similar‟
  – Shallow parsing – overlap of words and POS
  – Use feature weighting...
                Information Gain
• Not all features are created equal (e.g. saw
  in previous example is more important)
• Weight the features by information gain
   = how much does f distinguish different classes

               H (C )  vV ( f ) P( f i  v) H (C | f i  v)
  w( f i )                        i

                                   H (V ( f i ))

                H ( X )    P( x) log 2 P( x)
                             xX
low information gain
high information gain



                        C2


          C1                 C4



                        C3
           Base Verb Phrase
• Verb phrase not including NPs or PPs

[NP Pierre Vinken NP] , [NP 61 years NP] old ,
[VP will soon be joining VP] [NP the board NP]
as [NP a nonexecutive director NP] .
                   Results
• Context:
  2 words and POS on left and 1 word and POS on right

      Task Context Prec. Recall Acc.
      bNP curr. word 76   80     93
           curr. POS 80   82     95
           2–1       94   94     98
      bVP curr. word 68   73     96
           curr. POS 75   89     97
           2–1       94   96     99
          Efficiency of MBL
• Finding the neighbors can be costly
• Possibility:
  Build decision tree based on information gain of
   features to index data = approximate kNN
                        W0
                 saw            boy
                       the

          W-1            P-1           P-2
                      MBSL
• Memory-based technique relying on
  sequential nature of the data
  – Use “tiles” of phrases in memory to “cover” a
    new candidate (and context), and compute a
    tiling score
                      ADJ   NN1   NP   ]
        PRP [NP DT    ADJ
             [NP DT   ADJ   NN1
        PRP [NP DT          NN1 NP] PRP
 VBD    PRP [[ DT     ADJ   NN1 ]] PRP     NN1
 went   to     the    white house for      dinner
                    Tile Evidence
• Memory:
      [NP DT NN1 NP] VBD [NP DT NN1 NN1 NP] [NP NN2 NP] .
      [NP ADJ NN2 NP] AUX VBG PRP [NP DT ADJ NN1 NP] .

• Some tiles:
       [NP DT                  pos=3   neg=0
       [NP DT NN1              pos=2   neg=0
       DT NN1 NP]              pos=1   neg=1
       NN1 NP]                 pos=3   neg=1
       NN1 NP] VBD             pos=1   neg=0

• Score tile t by ft(t) = pos / total,
    Only keep tiles that pass a threshhold ft(t) > 
                           Covers
• Tile t1 connects to t2 in a candidate if:
   – t2 starts after t1
   – there is no gap between them (may be overlap)
   – t2 ends after t1
•A sequence of tiles covers a candidate if
   –each tile connects to the next
   –the tiles collectively match the entire candidate including
   brackets and maybe some context

                  [NP DT    ADJ

              PRP [NP DT             NN1 NP] PRP
      VBD    PRP [[ DT       ADJ     NN1 ]] PRP     NN1
                   Cover Graph
                            ADJ   NN1   NP   ]


              PRP [NP DT   ADJ

START                                             END
                  [NP DT   ADJ    NN1


              PRP [NP DT          NN1 NP] PRP


        VBD   PRP [[ DT    ADJ    NN1 ]] PRP     NN1
        Measures of „Goodness‟
•   Number of different covers
•   Size of smallest cover (fewest tiles)
•   Maximum context in any cover (left + right)
•   Maximum overlap of tiles in any cover
•   Grand total positive evidence divided by
    grand total positive+negative evidence

Combine these measures by linear weighting
           Scoring a Candidate
CandidateScore(candidate, T)
•  G  CoverGraph(candidate, T)
•  Compute statistics by DFS on G
•  Compute candidate score as linear          function
   of statistics

Complexity (O(l) tiles in candidate of length l):
   –   Creating the cover graph is O(l2)
   –   DFS is O(V+E)=O(l2)
                Full Algorithm
MBSL(sent, C, T)
1. For each subsequence of sent, do:
   1. Construct a candidate s by adding brackets [[ and ]]
      before and after the subsequence
   2. fC(s)  CandidateScore(s, T)
   3. If fC(s) > C, then add s to candidate-set
2. For each c in candidate-set in decreasing order
   of fC(c), do:
   1. Remove all candidates overlapping with c from
      candidate-set
3. Return candidate-set as target instances
            Results
Target Context   T    Prec.   Recall
Type size
NP     3         0.6   92      92

SV     3         0.6   89      85

VO     2         0.5   77      90

				
DOCUMENT INFO