Docstoc

Unsupervised Disambiguation.ppt

Document Sample
Unsupervised Disambiguation.ppt Powered By Docstoc
					                    Bootstrapping
•    Goals:
    –    Utilize a minimal amount of (initial) supervision
    –    Obtain learning from many unlabeled examples (vs.
         selective sampling)
•    General scheme:
    1.   Initial supervision – seed examples for training an
         initial model or seed model itself (indicative features)
    2.   Classify corpus with seed model
    3.   Add most confident classifications to training and
         iterate.
•    Exploits feature redundancy in examples
    Bootstrapping Decision List for
     Word Sense Disambiguation
• Yarowsky (1995) scheme
• Relies on “one sense per collocation”
1. Initialize seed model with all collocations in the sense
   dictionary definition – an initial decision list for each
   sense.
   – Alternatively – train from seed examples
2. Classify all examples with current model
   – Any example containing a feature for some sense
3. Compute odds ratio for each feature-sense combination; if
   above threshold then add the feature to the decision list
4. Optional: add the “one sense per discourse constraint” to
   add/filter examples
5. Iterate (2)
“One Sense per Collocation”
                        Results
• Applies the “one sense per discourse” constraint after final
  classification – classify all word occurrences in a
  document by the majority sense
• Evaluation on ~37,000 examples:
    Co-training for Name Classification
•     Collins and Singer, EMNLP 1999
•     A bootstrapping approach that relies on a
      (natural) partition of the feature space to two
      different sub-spaces
•     The bootstrapping process iterates by switching
      sub-space in each iteration
•     Name classification task: person, organization,
      location
•     Features for decision list:
     1. Spelling features (full, contains, cap, non-alpha)
     2. Context features (words, syntactic type)
                Seed Features




• Initialized with 0.9999 score for the decision list
         DL-CoTrain Algorithm
1. Initialize n=5 (max. #rules learned in an iteration)
2. Initialize spelling decision list with seed
3. Classify corpus by spelling rules (where applicable)
4. Induce contextual rule list from classified
   examples; keep at most the n most frequent rules
   whose score>θ for each class
5. Classify corpus by contextual rules
6. Induce spelling rules, as in step 4, and add to list
7. If n<2500 then n n+5, and return to step 3.
   Otherwise, classify corpus by combined list and
   induce a final classifier from these labeled examples
Co-Train vs. Yarowsky’s Algorithm
• Collins & Singer implemented a cautious
  version of Yarowsky’s algorithm, where n
  rules are added at each iteration
• Same accuracy (~91%) (as well as a
  boosting-based bootstrapping algorithm)
• Abney (ACL 2002) provides some analysis
  for both algorithm, showing that they are
  based on different independence assmptions
     Clustering and Unsupervised
           Disambiguation
• Word sense disambiguation without labeled
  training or other information sources
• Cannot label to predefined senses (there are
  none), so try to find “natural” senses
• Use clustering methods to divide different
  contexts of a word into “sensible” classes
• Other applications of clustering:
  – Thesaurus construction, document clustering
  – Forming word classes for generalization in
    language modeling and disambiguation
Clustering
       Clustering Techniques
• Hierarchical:
  – Bottom-up (agglomerative)
  – Top-down (divisive)
• Flat (non-hierarchical)
  – Usually iterative
  – Soft (vs. hard) clustering
     • “Degree” of membership
                 Comparison
Hierarchical:                Non-hierarchical:
• Preferable for detailed    • Preferable if efficiency
  data analysis                is important or lots of
• More information             data
  provided                   • K-means is the
• No clearly preferred         simplest method and
  algorithm                    often good enough
• Less efficient (at least   • If no Euclidean space,
  O(n2))                       can use EM instead
            Hierarchical Clustering




and   but    in   on   with   for   at   from   of   to   as
           Similarity Functions
• sim(c1,c2) = similarity between clusters
   – Defined over pairs of individual elements, and
     (possibly inductively) over cluster pairs
• Values typically between 0 and 1
For hierarchical clustering, require:
• Monotonicity:
  For all possible clusters,
   min[sim(c1,c2),sim(c1,c3)]  sim(c1,c2  c3)
  Merging two clusters cannot increase similarity!
              Bottom-Up Clustering
Input: Set X = {x1,…,xn} of objects
       Similarity function sim: 2X  2X
for i 1 to n do:
  ci {xi}
C {ci, ..., cn}
j n + 1
while |C| > 1 do:
  (ca, cb)  arg max(cu,cv) sim(cu,cv)
  cj ca  cb; Trace merge to construct cluster hierarchy
  C  (C \ {ca,cb})  {cj}
  j++
 Types of Cluster Pair Similarity
single link     similarity of most similar
                (nearest) members
complete link   similarity of most dissimilar
                (remote) members
                - property of unified cluster
group average average pairwise similarity of
              members in unified cluster
                - property of unified cluster
Single Link Clustering (“chaining” risk)
        Complete Link Clustering
• “Global view” yields tighter clusters – usually
more suitable in NLP
      Efficiency of Merge Step
• Maintain pairwise similarities between
  clusters, and maximal similarity per cluster:
• Single link update - O(n):
  – sim(c1  c2, c3) = max(sim(c1,c3),sim(c2,c3))
• Complete link update - O(n log(n)):
  – Have to compare all pairwise similarities with
    all points in merged cluster to find minimum
       Group Average Clustering
  • A “mid-point” between single and complete link –
    produces relatively tight clusters
  • Efficiency depends on choice of similarity metric
  • Computing average similarity for a newly formed
    cluster from scratch is O(n2)
     – Goal – incremental computation
  • Represent objects as m-dimensional unit vectors
     (length-normalized), and use cosine for similarity:
                    
                                 
                                x y
sim ( x , y )  cos( x , y )    
                                      i xi yi  x  y 
                               x  y i xi2 i yi2
     Cosine Similarity


               cos 1 > cos 2 > cos 3




3
     2   1
     Efficient Cosine Similarity
• Average cosine similarity within a cluster c (to be
  maximized after merge):
                  1                   
       A(c)              xcsim ( x, y)
              c ( c  1) xc  y
                              

                                          
• Define sum vector for cluster: s (c)   x
                                                      
                                                  xc
      s (c )  s (c )     x  y
                             
                          xc yc

                                               
                         c ( c  1) A(c)   x  x
                                           
                                           xc


                         c ( c  1) A(c)  c
                                  
                          s (c )  s (c )  c
       A(c)          
                             c ( c  1)
                                
                  ( s ( c ))  ( s ( c ))  c
        A(c) 
                          c ( c  1)
                                                     
                  ( s ( ca )  s ( cb ))  ( s ( ca )  s ( cb ))  ( ca  c2 )
  A(ca  cb ) 
                                  ( ca  cb )( ca  cb  1)

• On merge, update sum:
                                       
      s (c j )  s (ca  cb )  s (ca )  s (cb )
• Then update A for all new pairs of clusters
• Complexity: O(n) per iteration (assuming constant
  dimensionality); Overall for algorithm: O(n2)
             Top-Down Clustering
Input: Set X = {x1,…,xn} of objects
       Coherence measure         coh: 2X
       Splitting function        split: 2X 2X  2X
C { X }
j 1
while C contains a cluster larger than 1 element do:
  ca  arg mincu coh(cu)
  (cj+1,cj+2)  split(ca)
  C  (C \ ca)  {cj+1,cj+2}
  j += 2
   Top-Down Clustering (cont.)
• Coherence measure – can be any type of cluster
  quality function, including those used for
  agglomerative clustering
   – Single link – maximal similarity in cluster
   – Complete link – minimal similarity in cluster
   – Group average – average pairwise similarity
• Split – can be handled as a clustering problem on
  its own, aiming to form two clusters
   – Can use hierarchical clustering, but often more natural
     to use non-hierarchical clustering (see next)
• May provide a more “global” view of the data (vs.
  the locally greedy bottom-up procedure)
    Non-Hierarchical Clustering
• Iterative clustering:
   – Start with initial (random) set of clusters
   – Assign each object to a cluster (or clusters)
   – Re-compute cluster parameters
                                   
      • E.g. centroids:           s (c ) 1         
                           (c )              
                                    | c | | c | xc
                                                
                                                    x

   – Stop when clustering is “good”


• Q: How many clusters?
              K-means Algorithm
Input: Set X = {x1,…,xn} of objects
        Distance measure d: X  X 
        Mean function : 2XX
Select k initial cluster centers f1, ..., fk
while not finished do:
  for all clusters cj do:
        cj  { xi | fj = arg minf d(xi, f) }
  for all means fj do:
        fj  (cj)

Complexity: O(n), assuming constant number of iterations
K-means Clustering
                   Example
Clustering of words from NY Times using
   cooccurring words
1.   ballot, polls, Gov, seats
2.   profit, finance, payments
3.   NFL, Reds, Sox, inning, quarterback, score
4.   researchers, science
5.   Scott, Mary, Barbara, Edward
           Buckshot Algorithm
• Often, random initialization for K-means works
  well
• If not:
   – Randomly choose n points
   – Run hierarchical group average clustering:
     O(( n)2)=O(n)
   – Use those cluster means as starting points for K-means
   – O(n) total complexity
• Scatter/Gather (Cutting, Karger, Pedersen, 1993)
   – An interactive clustering-based browsing scheme for
     text collections and search results (constant time
     improvements)
            The EM Algorithm
• Soft clustering method to solve
  * = arg max P(X | m())
   – Soft clustering – probabilistic association between
     clusters and objects ( - the model parameters)
Note: Any occurrence of the data consists of:
   – Observable variables: The objects we see
      • Words in each context
      • Word sequence in POS tagging
   – Hidden variables: Which cluster generated which
     object
      • Which sense generates each context
      • Underlying POS sequence
• Soft clustering can capture ambiguity (words) and
  multiple topics (documents)
             Two Principles
Expectation: If we knew  we could
 compute the expectation of the hidden
 variables (e.g, probability of x belonging to some
 cluster)


Maximization: If we knew the hidden
 structure, we could compute the maximum
 likelihood value of 
  EM for Word-sense “Discrimination”
Cluster the contexts (occurrences) of an ambiguous word,
  where each cluster constitutes a “sense”:

• Initialize randomly model parameters P(vj | sk) and P(sk)

• E-step: Compute P(sk | ci) for each context ci and sense sk
• M-step: Re-estimate the model parameters P(vj | sk) and
  P(sk), for context words and senses

• Continue as long as log-likelihood of all corpus contexts
  1 ≤ i ≤ I increases (EM – guaranteed to increase in each
  step till convergence to a maximum, possibly local):
 log  P(ci | sk ) P( sk )  log  P(ci | sk ) P( sk )
       i   k                      i      k
                         E-Step
• To compute P(ci | sk), use naive Bayes assumption:
               P(ci | sk )     P (v
                               v j sk
                                         j   | sk )


• For each context i & each sense sk, estimate the
  posterior probability hik = P(sk | ci) (an expected
  “count” of the sense for ci), using Bayes rule:
                     P(ci | sk ) P( sk )
              hik 
                    k ' P(ci | sk ' ) P(sk ' )
                      M-Step
Re-estimate parameters using maximum-
 likelihood estimation:


                        h      ci :v j ci   ik
        P (v j   |s )
                        h
                  k
                                              ik
                              j ci :v j ci



        P( s ) 
                 h       i    ik
                                         
                                           h      i   ik

                  h
            k
                                    ik             I
                      k   i
            Decision Procedure
Assign senses by (same as Naïve Bayes):
                                                         
    s( wi )  arg max sk log P( sk )   log P(v j | sk )
                         
                                       v j ci           
                                                          
• Can adjust the pre-determined number of senses k
  to get finer or coarser distinctions
   – Bank as physical location vs. abstract corporation

• If adding more senses doesn’t increase log-
  likelihood much, then stop
         Results (Schütze 1998)
Word        Sense                    Accuracy
suit        lawsuit                  95  0
            suit you wear            96  0
motion      physical movement        85  1
            proposal for action      88  13
train       line of railroad cars    79  19
            to teach                 55  33

• Works better for topic-related senses (given the
  broad-context features used)
• Improved IR performance by 14% - representing
  both query and document as senses, and
  combining results with word-based retrieval

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:5/16/2012
language:English
pages:37
censhunay censhunay http://
About