Topic

Document Sample
scope of work template
							Implementation to an improved
CoBoost algorithm for NEC


                  CMPT 825
                  Project Presentation
                  Yudong Liu
                  12/9/2003
    Preface
   CoBoost is proposed by (Collins and Singer, 1999)
    as an unsupervised frame for Name Entity
    Classification.
   It is based on AdaBoost algorithm
   Classifying names with over 91% accuracy on
    10000 rounds.
   My work is implementing an improved CoBoost
    algorithm, which is motivated by a feature selection
    strategy for BoostLoss in Parsing Reranking task
    (Collins, 2000), and applying it to the original NER
    task. It turns out the efficiency can be improved
    much.
    Agenda
   Problem description

   Algorithm description
       AdaBoost algorithm
       CoBoost algorithm
       More Observations
       Improved CoBoost algorithm

   Experiment Result

   Conclusion
    Problem Description
   The original task that CoBoost applied to is to learn a function
    from an input string to its type, which is assumed to be one of
    the categories: Person, Organization, or Location.
   Given 89305 examples, the only supervision is in the form of 7
    seed rules (namely, the New York, California and U.S. are
    locations; that any name containing Mr. is a person; that any
    name containing Incorporated is an organization; and I.B.M
    and Microsoft are organizations). So we get 8932 labelled and
    80370 unlabeled examples.
   An example: …, Mr. Doernberg, a professor of …
       Both a spelling feature( that the string contains Mr. ) and a
        contextual feature (that professor modifies the string) are strong
        indications that Mr. Doernberg is a type of Person.
    Problem Description(cont.d)
   Actually unlabeled data can give many such
    indications that two features should predicate the
    same label.
    CoBoosting just takes advantage of such redundancy
    in the unlabeled data to build two classifiers in
    parallel.
   Based on this, every example in the data set is
    represented by a pair of (spelling, context). And
    either the spelling or context alone is sufficient to
    classify an example.
    Problem Description(cont.d)
   Features: Full-string=x, contains(x), allcap1, allcap2,
    nonalpha=x , context=x.
   Example 44123:
    X0_Ronald-Berke X01_president X2_Ronald X2_Berke X3_LEFT
       Classifier 1     X0_Ronald-Berke X2_Ronald X2_Berke
       Classifier 2     X01_president X3_LEFT
   So for CoBoost, the learning task is to find two
    classifiers f1 and f2 such that their predictions on
    labelled data should be the same and predictions on
    unlabeled data should be the same as often as
    possible.
    Algorithm Description

   AdaBoost
   CoBoost Algorithm
   More Observations
   Improved CoBoost Algorithm
    AdaBoost--Base of CoBoost
   Proposed by (Freuned and Schapire, 1997)
   Aim to produce a more accurate prediction rule by
    iteratively calling a weak supervised learner
   Main Idea:
       maintain a distribution or set of weights over the training
        set
       Initially, the weights are set equally, but on each round, the
        weights of incorrectly classified examples are increased (and
        accordingly, the weights of correctly classified examples are
        decreased) .That means the weight tends to concentrates
        on “hard” examples.
    AdaBoost- Base of CoBoost
   The weak learner‟s job is to find a weak hypothesis ht
    appropriate for the distribution on instances. The
    goodness of a weak hypothesis is measured by its
    error on classification.
   Once the weak hypothesis ht has been received, a
    parameter t, which intuitively measures the
    importance is assigned to ht.
   Then the distribution is next updated, which effort is
    to increase the weight of examples misclassified by ht,
    and to decrease the weight of correctly classified
    examples.
   Finally output
    CoBoost Algorithm (1)
   New target function----
    “error bound”

   Decompose:



   Define:
CoBoost Algorithm(2)



    Some other definitions:
        Pseudo-label :




        For a hypothesis:
CoBoost Algorithm(3)


    According to the derivation of (Schapire and Singer,
     1999) with W+ > W-, equation above is minimized by
     setting(  is for smoothing):

     Accordingly, Z co  W0  2 WW  Z  ( W  W )
                   1                                     2
 



     where
    More Observations(1)
   Those instances with their pseudo-labels being zero
    will not contribute to target function:



   So in this case the weak learner is free to pick labels
    for these instances, which could speed up the
    boosting process.
    More Observations(2)
   Suppose on round t ,feature k* is selected
   Then in round t+1:
      ---if AdaBoost for supervised learning, the weights
    of those instances on which k* abstains will not be
    updated.
                Dt 1 (i)  Dt (i) exp(  t yi ht ( xi ))
     ---but in CoBoost, we have to check pseudo-label for
    each instance even in case of k* abstains
       Dt j (i )  exp(  ~i g tj 2  ~i t 1ht j 1 )
                          y            y          
    More Observation(3)
    In each round W+,W- have to recalculated for each
    feature. This computation involves about 70000 features
    and 90000 instances. It is expensive!
    Actually only need consider a small number of features
 ---the features co-occurring with k*
 ---the features whose instances have their pseudo-labels changed
eg: X2_Mr. 3 56 450 ………
   Due to the sparseness of the feature space, the number
    of features that are taken into account will be much
    smaller (about 3 to 10000) than the number of features
    in feature space, which is about 70000.
    Improved CoBoost Algorithm(1)
  Define some new functions:
    1. Sign of k
      for any feature k:
       if feature occurring in labelled data
          sign(k)=1 if more instances that containing k are
   labelled 1
          sign(k)=-1 otherwise
       if feature only occurring in unlabeled data
sign(k)=-1 if instances contains k
sign(k)=1 otherwise
   Improved CoBoost Algorithm(2)
2. A,k A, B, Bi
          k  i
    Ak ={x: x contains k and k has the same prediction
   with the pseudo-label of x}
    Ak ={x: x contains k and k has the opposite
   prediction with the pseudo-label of x}
    Bi ={k: k has the same prediction on instance i with
   pseudo-label of i};
    Bi ={k: k has the opposite prediction on instance i
   with pseudo-label of i}.
3. wt for each instance i
   weight(xi)=exp(-1 *wt(i))
Improved Coboost Algorithm(3)
  For t=1 and j=1,2
  --set pseudo-labels
  --find the best hypothesis with maximum W   W 
  --freely pick labels for instances with pseudo-label 0
  --update g
 For t=2 and j=1,2

 --set pseudo-labels
 --update those instance weights with their pseudo-
   label changed
 --Init Ak Ak Bi Bi
 --find best hypothesis with naïve way
 --update g
    Improved CoBoost Algorithm(4)
   For t=3 to T and j=1,2
  --set pseudo-labels
  --update Ak Ak Bi Bi
  --for instances in Ak Ak
    .calculate delta of weights
    .update W+,W- for the according features in Bi and Bi
  --for instances which have their weights changed and
   not in Ak Ak take the same process as above
  --for those features which have their W+ and W-
   changed , reconsider for minimizing Zco
 --update g
                                    
                                        2
 Final hypothesis:      f ( x)  sign( j 1 g j ( x j ))
                                               T
Multi-Class Problem

   Three candidate labels:loc,org and per
   Do CoBoost over 3 labels in turn --an innermost loop
   For the final label of a particular instance, if there is
    some disagreement occurring, e.g. the signs of
    “Person” and “Location” are positive at the same
    time, we will pick the one with a larger
    “confidence”(abs(g)) as the final label.
Experimental Evaluation
   Training data set(5.4 M): 89305 examples
                                8932 labelled examples
   Testing data set:     431 examples randomly picked from
    training data set
   Features involved: 56649 for classifier 1
                          11824 for classifier 2
   In pre-procesing, I produce a hash table with each
    key being each feature and the corresponding value
    the list of instances that contain the feature in their
    representation. Such hash tables are kept for two
    classifiers respectively.
Experimental Evaluation(Cont.d)
   The CoBoost algorithm was run for 10000 rounds on
    the training data. I record g1T, g 2, g1T  g 2 and final
                                       T          T


    label (1 or –1) for each instance on the 3 labels and
    get output file at round 10,100,1000 and 10000.
   = 1/2n, n=89305
   The computation took roughly 20 hours on an Athlon
    1.2GHz server (shared CPU)
Final Result
   On round 10 ,
       the accuracy, the coverage and the agreement are
        34%,95% and 5%
       „Org‟ prediction:coverage 90.2% ,agreement 89.6%
       „Location‟ prediction:coverage 91.5%, agreement 52.5%
       „Person‟ prediction:coverage 93.1% agreement 42.1%
   On round 100,round 1000 and round 10000, these
    results keep unchanged.
   The result is far different from the original result
    Conclusion and future work
   An improved CoBoost algorithm is proposed which is
    based on the original CoBoost algorithm and a feature
    selection strategy in parsing reranking problem.
   Result is frustrating compared with the original one.
        buggy implementation
        Improper presumption
        Incomplete testing data
   Based on the redundancy of features, CoBoost gives an
    impressive result in NER task. Its idea suggests us
    explore other issues with the similar property,
    especially in the semi-supervised and unsupervised
    learning task.
   References
1. Michael Collins 2000 discriminative Reranking for natural
    language parsing in international conference on machine
    learning
2. Steven Abney ,Robert E. Schapire & Yoram Singer Boosting
    Applied to Tagging and PP Attachment , In Proc. of the
    Joint SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora
3. Yoav Freund & Robert E. Schapire A short Introduction to
     Boosting In Journal of Janpanese society for Artificial
     Intelligence, 14(5): 771-780. September,1999
 4. Robert. E. Schapire & Yoram Singer Improved Boosting
     Algorithm using Confidence-rated Predictions ,Machine
     Learning 37(3):297-336,1999

						
Related docs