Topic
Document Sample


Implementation to an improved
CoBoost algorithm for NEC
CMPT 825
Project Presentation
Yudong Liu
12/9/2003
Preface
CoBoost is proposed by (Collins and Singer, 1999)
as an unsupervised frame for Name Entity
Classification.
It is based on AdaBoost algorithm
Classifying names with over 91% accuracy on
10000 rounds.
My work is implementing an improved CoBoost
algorithm, which is motivated by a feature selection
strategy for BoostLoss in Parsing Reranking task
(Collins, 2000), and applying it to the original NER
task. It turns out the efficiency can be improved
much.
Agenda
Problem description
Algorithm description
AdaBoost algorithm
CoBoost algorithm
More Observations
Improved CoBoost algorithm
Experiment Result
Conclusion
Problem Description
The original task that CoBoost applied to is to learn a function
from an input string to its type, which is assumed to be one of
the categories: Person, Organization, or Location.
Given 89305 examples, the only supervision is in the form of 7
seed rules (namely, the New York, California and U.S. are
locations; that any name containing Mr. is a person; that any
name containing Incorporated is an organization; and I.B.M
and Microsoft are organizations). So we get 8932 labelled and
80370 unlabeled examples.
An example: …, Mr. Doernberg, a professor of …
Both a spelling feature( that the string contains Mr. ) and a
contextual feature (that professor modifies the string) are strong
indications that Mr. Doernberg is a type of Person.
Problem Description(cont.d)
Actually unlabeled data can give many such
indications that two features should predicate the
same label.
CoBoosting just takes advantage of such redundancy
in the unlabeled data to build two classifiers in
parallel.
Based on this, every example in the data set is
represented by a pair of (spelling, context). And
either the spelling or context alone is sufficient to
classify an example.
Problem Description(cont.d)
Features: Full-string=x, contains(x), allcap1, allcap2,
nonalpha=x , context=x.
Example 44123:
X0_Ronald-Berke X01_president X2_Ronald X2_Berke X3_LEFT
Classifier 1 X0_Ronald-Berke X2_Ronald X2_Berke
Classifier 2 X01_president X3_LEFT
So for CoBoost, the learning task is to find two
classifiers f1 and f2 such that their predictions on
labelled data should be the same and predictions on
unlabeled data should be the same as often as
possible.
Algorithm Description
AdaBoost
CoBoost Algorithm
More Observations
Improved CoBoost Algorithm
AdaBoost--Base of CoBoost
Proposed by (Freuned and Schapire, 1997)
Aim to produce a more accurate prediction rule by
iteratively calling a weak supervised learner
Main Idea:
maintain a distribution or set of weights over the training
set
Initially, the weights are set equally, but on each round, the
weights of incorrectly classified examples are increased (and
accordingly, the weights of correctly classified examples are
decreased) .That means the weight tends to concentrates
on “hard” examples.
AdaBoost- Base of CoBoost
The weak learner‟s job is to find a weak hypothesis ht
appropriate for the distribution on instances. The
goodness of a weak hypothesis is measured by its
error on classification.
Once the weak hypothesis ht has been received, a
parameter t, which intuitively measures the
importance is assigned to ht.
Then the distribution is next updated, which effort is
to increase the weight of examples misclassified by ht,
and to decrease the weight of correctly classified
examples.
Finally output
CoBoost Algorithm (1)
New target function----
“error bound”
Decompose:
Define:
CoBoost Algorithm(2)
Some other definitions:
Pseudo-label :
For a hypothesis:
CoBoost Algorithm(3)
According to the derivation of (Schapire and Singer,
1999) with W+ > W-, equation above is minimized by
setting( is for smoothing):
Accordingly, Z co W0 2 WW Z ( W W )
1 2
where
More Observations(1)
Those instances with their pseudo-labels being zero
will not contribute to target function:
So in this case the weak learner is free to pick labels
for these instances, which could speed up the
boosting process.
More Observations(2)
Suppose on round t ,feature k* is selected
Then in round t+1:
---if AdaBoost for supervised learning, the weights
of those instances on which k* abstains will not be
updated.
Dt 1 (i) Dt (i) exp( t yi ht ( xi ))
---but in CoBoost, we have to check pseudo-label for
each instance even in case of k* abstains
Dt j (i ) exp( ~i g tj 2 ~i t 1ht j 1 )
y y
More Observation(3)
In each round W+,W- have to recalculated for each
feature. This computation involves about 70000 features
and 90000 instances. It is expensive!
Actually only need consider a small number of features
---the features co-occurring with k*
---the features whose instances have their pseudo-labels changed
eg: X2_Mr. 3 56 450 ………
Due to the sparseness of the feature space, the number
of features that are taken into account will be much
smaller (about 3 to 10000) than the number of features
in feature space, which is about 70000.
Improved CoBoost Algorithm(1)
Define some new functions:
1. Sign of k
for any feature k:
if feature occurring in labelled data
sign(k)=1 if more instances that containing k are
labelled 1
sign(k)=-1 otherwise
if feature only occurring in unlabeled data
sign(k)=-1 if instances contains k
sign(k)=1 otherwise
Improved CoBoost Algorithm(2)
2. A,k A, B, Bi
k i
Ak ={x: x contains k and k has the same prediction
with the pseudo-label of x}
Ak ={x: x contains k and k has the opposite
prediction with the pseudo-label of x}
Bi ={k: k has the same prediction on instance i with
pseudo-label of i};
Bi ={k: k has the opposite prediction on instance i
with pseudo-label of i}.
3. wt for each instance i
weight(xi)=exp(-1 *wt(i))
Improved Coboost Algorithm(3)
For t=1 and j=1,2
--set pseudo-labels
--find the best hypothesis with maximum W W
--freely pick labels for instances with pseudo-label 0
--update g
For t=2 and j=1,2
--set pseudo-labels
--update those instance weights with their pseudo-
label changed
--Init Ak Ak Bi Bi
--find best hypothesis with naïve way
--update g
Improved CoBoost Algorithm(4)
For t=3 to T and j=1,2
--set pseudo-labels
--update Ak Ak Bi Bi
--for instances in Ak Ak
.calculate delta of weights
.update W+,W- for the according features in Bi and Bi
--for instances which have their weights changed and
not in Ak Ak take the same process as above
--for those features which have their W+ and W-
changed , reconsider for minimizing Zco
--update g
2
Final hypothesis: f ( x) sign( j 1 g j ( x j ))
T
Multi-Class Problem
Three candidate labels:loc,org and per
Do CoBoost over 3 labels in turn --an innermost loop
For the final label of a particular instance, if there is
some disagreement occurring, e.g. the signs of
“Person” and “Location” are positive at the same
time, we will pick the one with a larger
“confidence”(abs(g)) as the final label.
Experimental Evaluation
Training data set(5.4 M): 89305 examples
8932 labelled examples
Testing data set: 431 examples randomly picked from
training data set
Features involved: 56649 for classifier 1
11824 for classifier 2
In pre-procesing, I produce a hash table with each
key being each feature and the corresponding value
the list of instances that contain the feature in their
representation. Such hash tables are kept for two
classifiers respectively.
Experimental Evaluation(Cont.d)
The CoBoost algorithm was run for 10000 rounds on
the training data. I record g1T, g 2, g1T g 2 and final
T T
label (1 or –1) for each instance on the 3 labels and
get output file at round 10,100,1000 and 10000.
= 1/2n, n=89305
The computation took roughly 20 hours on an Athlon
1.2GHz server (shared CPU)
Final Result
On round 10 ,
the accuracy, the coverage and the agreement are
34%,95% and 5%
„Org‟ prediction:coverage 90.2% ,agreement 89.6%
„Location‟ prediction:coverage 91.5%, agreement 52.5%
„Person‟ prediction:coverage 93.1% agreement 42.1%
On round 100,round 1000 and round 10000, these
results keep unchanged.
The result is far different from the original result
Conclusion and future work
An improved CoBoost algorithm is proposed which is
based on the original CoBoost algorithm and a feature
selection strategy in parsing reranking problem.
Result is frustrating compared with the original one.
buggy implementation
Improper presumption
Incomplete testing data
Based on the redundancy of features, CoBoost gives an
impressive result in NER task. Its idea suggests us
explore other issues with the similar property,
especially in the semi-supervised and unsupervised
learning task.
References
1. Michael Collins 2000 discriminative Reranking for natural
language parsing in international conference on machine
learning
2. Steven Abney ,Robert E. Schapire & Yoram Singer Boosting
Applied to Tagging and PP Attachment , In Proc. of the
Joint SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora
3. Yoav Freund & Robert E. Schapire A short Introduction to
Boosting In Journal of Janpanese society for Artificial
Intelligence, 14(5): 771-780. September,1999
4. Robert. E. Schapire & Yoram Singer Improved Boosting
Algorithm using Confidence-rated Predictions ,Machine
Learning 37(3):297-336,1999
Related docs
Get documents about "