Embed
Email

Topic

Document Sample
Topic
Implementation to an improved

CoBoost algorithm for NEC





CMPT 825

Project Presentation

Yudong Liu

12/9/2003

Preface

 CoBoost is proposed by (Collins and Singer, 1999)

as an unsupervised frame for Name Entity

Classification.

 It is based on AdaBoost algorithm

 Classifying names with over 91% accuracy on

10000 rounds.

 My work is implementing an improved CoBoost

algorithm, which is motivated by a feature selection

strategy for BoostLoss in Parsing Reranking task

(Collins, 2000), and applying it to the original NER

task. It turns out the efficiency can be improved

much.

Agenda

 Problem description



 Algorithm description

 AdaBoost algorithm

 CoBoost algorithm

 More Observations

 Improved CoBoost algorithm



 Experiment Result



 Conclusion

Problem Description

 The original task that CoBoost applied to is to learn a function

from an input string to its type, which is assumed to be one of

the categories: Person, Organization, or Location.

 Given 89305 examples, the only supervision is in the form of 7

seed rules (namely, the New York, California and U.S. are

locations; that any name containing Mr. is a person; that any

name containing Incorporated is an organization; and I.B.M

and Microsoft are organizations). So we get 8932 labelled and

80370 unlabeled examples.

 An example: …, Mr. Doernberg, a professor of …

 Both a spelling feature( that the string contains Mr. ) and a

contextual feature (that professor modifies the string) are strong

indications that Mr. Doernberg is a type of Person.

Problem Description(cont.d)

 Actually unlabeled data can give many such

indications that two features should predicate the

same label.

 CoBoosting just takes advantage of such redundancy

in the unlabeled data to build two classifiers in

parallel.

 Based on this, every example in the data set is

represented by a pair of (spelling, context). And

either the spelling or context alone is sufficient to

classify an example.

Problem Description(cont.d)

 Features: Full-string=x, contains(x), allcap1, allcap2,

nonalpha=x , context=x.

 Example 44123:

X0_Ronald-Berke X01_president X2_Ronald X2_Berke X3_LEFT

 Classifier 1 X0_Ronald-Berke X2_Ronald X2_Berke

 Classifier 2 X01_president X3_LEFT

 So for CoBoost, the learning task is to find two

classifiers f1 and f2 such that their predictions on

labelled data should be the same and predictions on

unlabeled data should be the same as often as

possible.

Algorithm Description



 AdaBoost

 CoBoost Algorithm

 More Observations

 Improved CoBoost Algorithm

AdaBoost--Base of CoBoost

 Proposed by (Freuned and Schapire, 1997)

 Aim to produce a more accurate prediction rule by

iteratively calling a weak supervised learner

 Main Idea:

 maintain a distribution or set of weights over the training

set

 Initially, the weights are set equally, but on each round, the

weights of incorrectly classified examples are increased (and

accordingly, the weights of correctly classified examples are

decreased) .That means the weight tends to concentrates

on “hard” examples.

AdaBoost- Base of CoBoost

 The weak learner‟s job is to find a weak hypothesis ht

appropriate for the distribution on instances. The

goodness of a weak hypothesis is measured by its

error on classification.

 Once the weak hypothesis ht has been received, a

parameter t, which intuitively measures the

importance is assigned to ht.

 Then the distribution is next updated, which effort is

to increase the weight of examples misclassified by ht,

and to decrease the weight of correctly classified

examples.

 Finally output

CoBoost Algorithm (1)

 New target function----

“error bound”



 Decompose:







 Define:

CoBoost Algorithm(2)







 Some other definitions:

 Pseudo-label :









 For a hypothesis:

CoBoost Algorithm(3)





 According to the derivation of (Schapire and Singer,

1999) with W+ > W-, equation above is minimized by

setting(  is for smoothing):



Accordingly, Z co  W0  2 WW  Z  ( W  W )

1 2









where

More Observations(1)

 Those instances with their pseudo-labels being zero

will not contribute to target function:







 So in this case the weak learner is free to pick labels

for these instances, which could speed up the

boosting process.

More Observations(2)

 Suppose on round t ,feature k* is selected

 Then in round t+1:

---if AdaBoost for supervised learning, the weights

of those instances on which k* abstains will not be

updated.

Dt 1 (i)  Dt (i) exp(  t yi ht ( xi ))

---but in CoBoost, we have to check pseudo-label for

each instance even in case of k* abstains

Dt j (i )  exp(  ~i g tj 2  ~i t 1ht j 1 )

y y 

More Observation(3)

 In each round W+,W- have to recalculated for each

feature. This computation involves about 70000 features

and 90000 instances. It is expensive!

 Actually only need consider a small number of features

---the features co-occurring with k*

---the features whose instances have their pseudo-labels changed

eg: X2_Mr. 3 56 450 ………

 Due to the sparseness of the feature space, the number

of features that are taken into account will be much

smaller (about 3 to 10000) than the number of features

in feature space, which is about 70000.

Improved CoBoost Algorithm(1)

 Define some new functions:

1. Sign of k

for any feature k:

if feature occurring in labelled data

sign(k)=1 if more instances that containing k are

labelled 1

sign(k)=-1 otherwise

if feature only occurring in unlabeled data

sign(k)=-1 if instances contains k

sign(k)=1 otherwise

Improved CoBoost Algorithm(2)

2. A,k A, B, Bi

k i

 Ak ={x: x contains k and k has the same prediction

with the pseudo-label of x}

 Ak ={x: x contains k and k has the opposite

prediction with the pseudo-label of x}

 Bi ={k: k has the same prediction on instance i with

pseudo-label of i};

 Bi ={k: k has the opposite prediction on instance i

with pseudo-label of i}.

3. wt for each instance i

 weight(xi)=exp(-1 *wt(i))

Improved Coboost Algorithm(3)

 For t=1 and j=1,2

--set pseudo-labels

--find the best hypothesis with maximum W   W 

--freely pick labels for instances with pseudo-label 0

--update g

 For t=2 and j=1,2



--set pseudo-labels

--update those instance weights with their pseudo-

label changed

--Init Ak Ak Bi Bi

--find best hypothesis with naïve way

--update g

Improved CoBoost Algorithm(4)

 For t=3 to T and j=1,2

--set pseudo-labels

--update Ak Ak Bi Bi

--for instances in Ak Ak

.calculate delta of weights

.update W+,W- for the according features in Bi and Bi

--for instances which have their weights changed and

not in Ak Ak take the same process as above

--for those features which have their W+ and W-

changed , reconsider for minimizing Zco

--update g



2

 Final hypothesis: f ( x)  sign( j 1 g j ( x j ))

T

Multi-Class Problem



 Three candidate labels:loc,org and per

 Do CoBoost over 3 labels in turn --an innermost loop

 For the final label of a particular instance, if there is

some disagreement occurring, e.g. the signs of

“Person” and “Location” are positive at the same

time, we will pick the one with a larger

“confidence”(abs(g)) as the final label.

Experimental Evaluation

 Training data set(5.4 M): 89305 examples

8932 labelled examples

 Testing data set: 431 examples randomly picked from

training data set

 Features involved: 56649 for classifier 1

11824 for classifier 2

 In pre-procesing, I produce a hash table with each

key being each feature and the corresponding value

the list of instances that contain the feature in their

representation. Such hash tables are kept for two

classifiers respectively.

Experimental Evaluation(Cont.d)

 The CoBoost algorithm was run for 10000 rounds on

the training data. I record g1T, g 2, g1T  g 2 and final

T T





label (1 or –1) for each instance on the 3 labels and

get output file at round 10,100,1000 and 10000.

 = 1/2n, n=89305

 The computation took roughly 20 hours on an Athlon

1.2GHz server (shared CPU)

Final Result

 On round 10 ,

 the accuracy, the coverage and the agreement are

34%,95% and 5%

 „Org‟ prediction:coverage 90.2% ,agreement 89.6%

 „Location‟ prediction:coverage 91.5%, agreement 52.5%

 „Person‟ prediction:coverage 93.1% agreement 42.1%

 On round 100,round 1000 and round 10000, these

results keep unchanged.

 The result is far different from the original result

Conclusion and future work

 An improved CoBoost algorithm is proposed which is

based on the original CoBoost algorithm and a feature

selection strategy in parsing reranking problem.

 Result is frustrating compared with the original one.

 buggy implementation

 Improper presumption

 Incomplete testing data

 Based on the redundancy of features, CoBoost gives an

impressive result in NER task. Its idea suggests us

explore other issues with the similar property,

especially in the semi-supervised and unsupervised

learning task.

References

1. Michael Collins 2000 discriminative Reranking for natural

language parsing in international conference on machine

learning

2. Steven Abney ,Robert E. Schapire & Yoram Singer Boosting

Applied to Tagging and PP Attachment , In Proc. of the

Joint SIGDAT Conference on Empirical Methods in Natural

Language Processing and Very Large Corpora

3. Yoav Freund & Robert E. Schapire A short Introduction to

Boosting In Journal of Janpanese society for Artificial

Intelligence, 14(5): 771-780. September,1999

4. Robert. E. Schapire & Yoram Singer Improved Boosting

Algorithm using Confidence-rated Predictions ,Machine

Learning 37(3):297-336,1999


Related docs
Other docs by rogerholland
Shilpa Bhoj
Views: 2211  |  Downloads: 0
Software Quality Assurance
Views: 1198  |  Downloads: 50
Chapter 2 - The metaphysical impulse
Views: 14  |  Downloads: 0
Sarah Moore 4750 Pear Ridge Dr
Views: 20  |  Downloads: 0
PROJECT 1
Views: 3  |  Downloads: 0
Property Custody Reciept
Views: 23  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!