Focused Named Entity Recognition Using Machine Learning by liuqingyan


									Focused Named Entity Recognition
Using Machine Learning

Li Zhang, Yue Pan, Tong Zhang
IBM China Research Laboratory
What is Focused Named Entity?
   The most “topical” named entities in the
   Human agreements with which NE is
    focused are about 60%~80%. (F1-
    measure, which would be discussed later.)
   Focused NEs are helpful in:
       Summarization
       Search result ranking
       Topic detection and tracking
Purpose of This Paper
   To propose a machine learning
    method to recognize focused named
    entities by converting the
    recognition problem into a
    classification problem.
   To propose an effective feature for
    automatic summarization, search
    result ranking, and topic detection
    and tracking.
Assumptions and Pre-working
   Documents with titles
   Named entities in documents should be
   Coreferences of named entities should be
       A simple rule-based approach is taken in this
        paper to resolve coreferences:
        1.   Partitioning
        2.   Pair-wise comparisons
        3.   Clustering
   Named entities in documents are either focused
    or not. (Binary classification)
   There are nine features being considered and
    encoded in this system. (Be discussed later.)
Classification Methods 1
1.   Decision Tree
        Tree Growing:
           Greedily splitting each tree nodes based on a
             certain figure of merit
            Similar to C4.5 program mentioned in Programs of
             Machine Learning (Quinlan, 1993)
        Tree Pruning:
            Removing over-fitted branches so that the
             remaining portion would have a better prediction
            Using a Bayesian model combination approach
            See: A decision-tree-based symbolic rule induction system
             for text categorization (Johnson et al.) for detailed
Classification Methods 2
2.   Naïve Bayes
        A linear classification method:
         ω: weight vector
         θ: threshold              T x    y  1
                                   T
         x: feature vector         x    y  1
         y: classification label
        Training:
                         i: y c xi , j                   i : yi  c
           log
          c                        i
                                                    log

                   d   j 1 i: y c xi , j
          j                  d

        λis fixed to be 1, which correspond to the
         Laplacian smoothing.
Classification Methods 3
3.   Robust Risk Minimization Method
        Also a linear prediction method
                                  p ( x)  0  y  1
              p( x)   T x  b, 
                                  p( x)  0  y  1
        The classification error is defined as:
                                1 if p( x) y  0
                I ( p( x), y)  
                                0 if p( x) y  0
                                                     ˆ ˆ
         A very natural way is to find a weight ( w, b ) that
         minimizes the average classification error in the training
         set:                           n
                     ˆ)  arg min 1 I ( T x  b, y )
                ( , b
                               w ,b n
                                      i 1
                                            i      i

        However, it’s a typical NP-hard problem.
Classification Methods 3 (cont.)
     A more practical loss function instead of I(p,y) should be
     Many loss functions work well for related classification
      problems. (Zhang & Oles, 2001; Li & Yang, 2003)
     The specific loss function considered in this paper is:
                              2 py       py1
                  h( p, y )  1( py1)2
                                           py1, 1
                              0           py1
     And the linear weights is computed by:
                    ˆ)  arg min 1 h( T x  b, y )
               ( , b
                              w ,b n
                                     i 1
                                          i      i
   Entity Type
        Each NE type corresponds to a binary feature components.
        For example, a person type is encoded as [1 0 0 0].
   In Title or Not
        A binary feature (0 or 1)
   Entity Frequency
        The number of times that the NE occurs in the document.
   Entity Distribution
        If an NE occurs in many different parts of a document, then it is more likely
         to be important.
        The entropy of the probability distribution that measures how evenly an NE
         is distributed in a document is exploited.
        Suppose that each NE’s probability distribution is given by:
                                                occurrencein ith section
             { p1 ,, pi ,, pm } where pi 
                                               total occurrence the doc.
                     i 1 pi log pi
        Entropy:                       where m is set to 10
Features (cont.)
   Entity Neighbor
        Context window=1
        There are five types of neighboring sides: PER, LOC, ORG, Other,
         and normal words.
        Ex. [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]

                 LEFT          RIGHT
   First Sentence Occurrence
        A binary feature (0 or 1)
   Document Has Entity in Title or Not
        A binary feature (0 or 1)
   Total Entity Count
        Total number of NEs in the document (Integer)
   Document Frequency in the Corpus
        When this feature is used, the entity frequency feature will be
         computed using (tf/docsize) ∗ log(N/df) , where df is the number of
         documents that an NE occurs in.
Features (cont.)
   Since decision tree and naïve Bayes
    methods only take integer features, the
    floating features are encoded to integer
    values using a simple equal interval
    binning method.
   If a feature x is observed to have values
    bounded by xmin and xmax, then the bin
    width is computed by δ = (xmax−xmin)/10
   The method is applied to each continuous
    feature independently.
   Test corpus:
       Beijing Youth Daily
        news in November
       The NEs in each
        document were
        annotated by
   Human Agreement:
       Twelve people are
        invited to mark
        focused NEs in 20
        documents from
        the corpus. Nine of
        the them are NLP
Experiments (cont.)
    Two data sets:
    1.   The whole corpus of 1,325 articles
    2.   A subset of 726 articles with NEs in their
    Baseline model:
    1.   Marking NEs in titles to be the foci
    2.   Marking most frequent NEs to be the foci
    3.   A combination of the above two, selecting
         those NEs either in title or occurring most
Experiments (cont.)
Experiments (cont.)
   The corpus-level feature (experiment F versus G) has
    different impacts on the three algorithms. It is a good feature
    for naive Bayes, but not for the RRM and decision tree.
    (Unknown reason)
   RRM appears to have the best overall performance.
   The naïve Bayes method requires all features to be
    independent, which is a quite unrealistic assumption in
   The main problem for decision tree is that it easily fragments
    the data, so that the probability estimate at the leaf-nodes
    become unreliable. This is also the reason why voted decision
    trees perform better.
   The decision tree can find rules readable by a human. For
    example, one such rule reads as:
        If a named entities appears at least twice, its left and right neighbors are normal
         words, its discrete distribution entropy is greater than 2, and the entity appears in
         the title, then the probability of it being a focused entity is 0.87.

To top