Document Sample

```					Adaboost for building robust
classifiers

KH Wong

Overview

   Objective of AdaBoost
   2-class problems
   Training
   Detection
   Examples

Objective

   Automatically classify inputs into different
categories of similar features
   Example
   Face detection:
   find the faces in the input image

   Vision based gesture recognition [Chen 2007]

Different detection problems

   Two classes problem (will be discussed here)
   E.g. face detection
   In a picture, are there any faces or no faces?
   Multi-class problems (Not discussed here)
   Adaboost can be extended to handle multi class
problems
   In a picture, are there any faces of men , women,
children etc. (Still an unsolved problem)

Define a 2-class classifier
:its method and procedures
   Supervised training
   Show many positive samples (face) to the system
   Show many negative samples (non-face) to the
system
   Learn the parameters and construct the final
strong classifier
   Detection
   Given an unknown input image, the system can
tell if there are positive samples (faces) faces or
not

We will learn

   Training procedures
   Give +ve and –ve examples to the system, then
the system will learn to to classify an unknown
input.
   E.g. give pictures of faces (+ve examples) and
non-faces (-ve examples) to train the system.
   Detection procedures
   Input an unknown (e.g. an image) , the system will
tell you it is a face or not.

Adaboost v.2a   Face   non-face   6
(Revised)!! First let us learn what
is what a weak classifier h( )
- -Case1- -
If a point [ x  (u,v)] is in the " gray" area , then h(x)  1

otherwiseh(x)  0. It can be written as :
 1 if mu  v  c , where m,x, are given constants                            v
h( x )                                                                                        v=mu+c
 1 otherwise
- -Case2 - -
If a point [x  (u, v)] is in the " white" area , then h(x)  1                  v>mu+c Gradient m
otherwiseh(x)  1. It can be written as :
 1 if - ( mu  v )  c , where m,x, are given constants
h( x )  
 1 otherwise                                                                c        v<mu+c
- -At time t , combine case1 and 2 togather to becomeequation (i)                           (0,0)  u
and use polarity pt to control whichcase you want touse.
 1 if pt f t ( x )  pt t
ht ( x )                                       (i )                   •m,c are used to
  1 otherwise
define the line
where pt  polarity {1 or - 1},
f is the function: ( f ( x[u, v ])  mu  v, ) and  t  c
•Any points in the
where m,c are constants,u,v are variables.                                       gray area satisfy
pt mu  v   pt c, equation(i ) becomes                                       v<mu+c
 1 if pt mu  v   pt c                                           •Any points in the
ht ( x )                                      (ib)
 1 otherwise                                                        white area satisfy
Adaboost v.2a   v>mu+c               7
The weak classifier (a summary)                        Function f is a straight line

v
   By definition a weak classifier                                 v=mu+c
should be slightly better than a
random choice (probability                               Gradient m
=0.5) . Otherwise you should
use a dice!                                    v>mu+c

   In u,v space, function f is a                       c        v<mu+c
(0,0)      u
straight line defined by m,c.

Learn what is h( ), a weak classifier.
Decision stump
   Decision stump                                                Example
definition
   A decision stump is a machine
learning model consisting of a one-                              Temperature T
level decision tree.[1] That is, it is a
decision tree with one internal node
(the root) which is immediately
connected to the terminal nodes. A                         T<=10oc 10oc<T<28oC T>=280c
decision stump makes a prediction
Cold        mild       hot
based on the value of just a single
input feature. Sometimes they are also
called 1-rules.[2]
   From http://en.wikipedia.org/wiki/Decision_stump

A weak learner (classifier ) is a decision stump
   Define weak learners based on rectangle
features
The function of a
decision-line in space

 1 if pt f t ( x )  ptt
ht ( x )  
 1 otherwise                       threshold
window
Pt=
polarity{+1,-1}               Decision
Select which                 line
side separated
by the line you
prefer

Weak classifier we use here: Axis parallel
weak classifier
   We will use special
type: axis parallel weak
If polarity pt=1, this region is -1
classifier                                                   If polarity pt=-1, this region is +1
   The decision line is
parallel to the either the
horizontal or vertical
axis.
ht(x)
use v0  t  threshold
v0
f t  x  ( u, v )   ( v )
 1 if pt (v )  pt v0
ht x  (u, v )   
 1 otherwise

If polarity pt=1, this region is +1
If polarity pt=-1, this region is -1

An example to show how Adaboost works
v-axis [xi={-0.48,0},yi=’+’]

   Training,
   Present ten samples to the system
:[xi={ui,vi},yi={’+’ or ‘-’}]
   5 +ve (blue, diamond) samples
   5 –ve (red, circle) samples
   Train up the system
   Detection
   Give an input xj=(1.5,3.4)
   The system will tell you it is ‘+’ or ‘-
’. E.g. Face or non-face
   Example:
   u=weight, v=height                                  u-axis    [xi={-0.2,-0.5},yi=’+’]
   Classification: suitability to play in
the basket ball team.

Adaboost concept                                                                Training data
6 squares,
    Use this training                                                          5 circles.
data, how to make a
Objective: Train a classifier to
classifier                                        classify an unknown input to see
h3( )                            if it is a circle or square.

h1( )
h2 ( )                                                                 The solution is a
H_complex( )

One axis-parallel weak
The above strong classifier should; work,
classifier cannot achieve 100%
but how can we find it?
classification. h1(), h2(), h3() all
fail.                                              ASNWER:
You may try it yourself!                           Combine many weak classifiers
to achieve it.

How?



h1( )          h2()        h3( )          h4( )     h5( )           h6()               h7()
Classification
Result
2     3 4                 5       6
Weight for each
weak classifier       1                                                 7
i , i  1,2,..,7

Combine to form the
Final strong classifier
 T         
H(x)  sign  αt ht(x)
 t 1      

Given : (x1, y1 ),..(xn ,yn ),where xi  X , yi  Y  {1,1}
Initialze distribution Dt 1 (i )  1 / n; such that n  M  L
Algorithm            M  number of positive (  1 ) examples; L  number of negative ( 1) examples
Initialization   For t  1,...T
{ Step1a : Find the classifier ht : X  {1,1} that minimizes the

error withrespec t to Dt , that means : ht  argmin εq 
 q
         

n
 1 if ht ( xi )  yi 
Step1b : error  t   Dt (i ) * I ht ( xi ) yi , where I ht ( xi ) yi   
i 1                                                       0        otherwise
checking step : prerequisite : εt  0.5, otherwisestop.
1 1  εt
Step2 :  t        ln     ,  t  weight (or confidence value).
2    εt
Dt (i ) exp( t yi ht ( xi ))
Step3 : Dt 1 (i )                                   , see next slide for explanatio n
Zt
j t
Step4 : Current total cascadedclassifier error CEt   E j t , α , h (xi )
j 1

Main                         w the current classifier error E 
hile
1   n

 I t, α , h (xi ),
n  1
Training
and I ( ) is defined as follows :
loop                          If xi is correctlyclassified by the current cascadedclassifier , i.e.
 t           
yi  sign  α h (xi ) , hence error I t , α , h (xi )  0
  1        
If xi is incorrectly classified by the current cascadedclassifier i.e.
 t             
yi  sign  α h (xi ) , hence error I t , α , h (xi )  1
  1          
If CEt  0 then T  t, break;
}
t
1 if yi  signot (i ) 
The output ot ( xi )   α h (xi ), and S t , α , h (xi )  
 1                                       0         otherwise

 T         
The final strong     The final strongclassifier H(x)  sign  αt ht(x)
classifier                                                  t 1      
Note: Normalization factor Zt in step3
AdaBoost choose this weight update function deliberately

Dt 1 (i )  Dt (i ) exp(  t yi ht ( xi ))
Because,
•when a training sample is correctly classified, weight decreases
•when a training sample is in correctly classified, weight increases

Re call :
Dt (i ) exp( t yi ht ( xi ))
Step3 : Dt 1 (i )                                             ,
Zt
where Z t  normalization factor,so Dt becomes a probability distrubution
n _ correctly _ classified                    n _ incorrectly _ classified
Zt                    correct _ weight 
i 1
 incorrrect _ weight
i 1
n _ correctly _ classified                        n _ incorrectly _ classified
             D
i 1
t   (i)e-αt yi ht(xi )                 D
i 1
t      (i)eαt yi ht(xi )

Note: Stopping criterion of the main loop
   The main loops stops when all training data are correctly
classified by the cascaded classifier up to stage t.
j t
Step4 : Current total cascadedclassifier error CEt   E j t , α , h (xi )
j 1

1 n
w the current classifier error E   I t , α , h (xi ),
hile
n  1
and I ( ) is defined as follows :
If xi is correctlyclassified by the current cascadedclassifier , i.e.
 t           
yi  sign  α h (xi ) , hence error I t , α , h (xi )  0
  1        
If xi is incorrectly classified by the current cascadedclassifier i.e.
 t             
yi  sign  α h (xi ) , hence error I t , α , h (xi )  1
  1          
If CEt  0 then T  t, break;
}

Dt(i) =weight

   Dt(i) = probability distribution of the i-th
training sample at time t . i=1,2…n.
   It shows how much you trust this sample.
   At t=1, all samples are the same with equal
weight. Dt=1(all i)=same
   At >1 , Dt>1(i) will be modified, we will see
later.

Initialization

Given : (x1,y1 ),..(xn ,yn ),where where xi  X , yi  Y  {1,1}
Initialze Dt 1 (i )  1 / n; such that n  M  L
M  positive example; L  negative example

   M=5 +ve (blue, diamond) samples
   L=5 –ve (red, circle) samples
   n=M+L=10
   Initialize weight D(t=1)(i)= 1/10 for all
i=1,2,..,10,
   So, D(1)(1)=0.1, D(1) (2)=0.1,……, D(1)(10)=0.1

Main training loop

Step 1a, 1b

Select h( ): For simplicity in implementation we
use the Axis-parallel weak classifier

Recall
1 if pt f t ( x )  ptt
ht ( x )                                    (i )
0 otherwise                                                             hb(x)
v0
where pt  polarity {1 or - 1},  t  v  threshold
f is the function: ( f  mu  c ), m,c are constants,u,v are variables.

Axis - parallel weak classifier
f is a line of gradient m  0 (horizontal line)                          ha (x)
the position of the line can be controlled by v0
or
f is a line of gradient m   (vertcial line)
the position of the line can be controlled by u0

u0      21
{Step1a : Find the classifier ht : X  {1,1} that minimize the error with respect to Dt
Step1a,       That means : ht  arg  min εq 
 q
          

1b            Step1b : checking step : prerequisi te : εt  0.5, otherwise stop.

Incorrectly classified by hq()
   Assume h() can only be
horizontal or vertical                                                             hq()
separators. (axis-parallel
weak classifier)
   There are still many ways
to set h(), here, if this hq()
is selected, there will be 3
incorrectly classified
training samples.
   See the 3 circled training
samples
   We can go through all h( )s
and select the best with the
least misclassification (see
the following 2 slides)

There are 9x2 choices here,
Example :Training example slides from [Smyth 2007]
hi=1,2,3,..9, (polarity +1)
classifier the ten red (circle)/blue (diamond) dots
h’i=1,2,3,..9, (polarity -1)
Step 1a:

hi=1(x) ………….. hi=4(x) ……………… hi=9(x)


You may choose
one of the following
axis-parallel (vertical
v-axis
line) classifiers

 1 if pu  pui
hi ( x )  
  1 otherwise
x  (u, v ), v is not used
because hi(x) is parallel to
the vertical axis.
u1 u2 u3   u4 u5 u6 u7 u8          u9        polarity p  {  1,-1 }
Initialize:
u-axis
Dn(t=1)=1/10
Vertical Dotted lines
are possible choices
Example :Training example slides from [Smyth 2007]     There are 9x2 choices here,
classifier the ten red (circle)/blue (diamond) dots    hj=1,2,3,..9, (polarity +1)
Step 1a:                                               h’j=1,2,3,..9, (polarity -1)

All together including last slide
36 choices

hj=1(x)                                               v1
hj=2(x)                                               v2     You may choose
:                                                     v3     one of the following
v-axis hj=4(x)                                                     axis-parallel (horizontal
:                                                    V4     lines) classifiers
:                                                    V5
:                                                    V6
V7                   1 if pv  pv j
:                                                             h j ( x)  
:                                                    V8                    1 otherwise
x  (u, v ), u is not used
hj=9(x)
v9       because h j(x) is parallel to
the horizontal axis.
u-axis                                polarity p  {  1,-1 }
Initialize:
Dn(t=1)=1/10
Horizontal dotted lines
are possible choices
Step 1b:
Find and check the error of the weak classifier h( )
   To evaluate how successful is your selected weak classifier h( ),
we can evaluate the error rate of the weak classifier
n
 1 if ht ( xi )  yi 
 t   Dt (i ) * I h ( x ) y , where I h ( x ) y   
0        otherwise
t   i   i              t   i   i
i 1

Step1b : checking step : prerequisite : εt  0.5, otherwisestop.

   ɛt = Misclassification probability of h( )
   Checking: If εt>= 0.5 (something wrong), stop the training
   Because, by definition a weak classifier should be slightly
better than a random choice--probability =0.5
   So if εt >= 0.5 , your h( ) is a bad choice, redesign
another h( ) and do the training based on the new h( ).

Exercise0 for Step1a,1b
{Step1a : Find the classfier ht : X  {1,1} that minimizes the error withrespect toDt
Step1b : checking step : prerequisite : εt  0.5, otherwisestop.

   Assume h() can only be
horizontal or vertical
separators.
   How many different
classifiers are available?
   If hj() is selected, circle the
misclassified training
samples. Find ɛ( ) to see
misclassification probability if         hj()
the probability distribution (D)
for each sample is the same.
   Find h() with minimum error.

Result of step2 at t=1
Incorrectly classified by ht=1(x)


ht=1(x)

Step2 at t=1 (refer to the previous slide)
n

     Using εt=1=0.3,                            t   Dt (i ) * I h ( x ) y ,
t   i   i
i 1
because 3 samples are                                                  1 if ht ( xi )  yi 
where I ht ( xi ) yi   
incorrectly classified                                                0        otherwise

εt 1  0.1  0.1  0.1  0.3
1 1  εt
Step2 :  t      ln
2    εt
where εt is the weighted error rate of classifier ht .
so
1 1  0.3
t 1  ln        0.424
2   0.30
The proof can be found at http://vision.ucsd.edu/~bbabenko/data/boosting_note.pdf

Step3 at t=1, update Dt toDt+1



Dt (i ) exp( t yi ht ( xi ))
Step3 : Dt 1 (i ) 
Zt
where Z t  normalization factor,
so Dt is a distrubution (prob.function)

The proof can be found at http://vision.ucsd.edu/~bbabenko/data/boosting_note.pdf

Step 3: Find first Z (the normalization
factor). Note that Dt=1=0.1, at=1 =0.424
Dt 1  0.1, αt 1  0.424

7 correctand 3 incorrectsamples
t 1
Zt       
yi  hi ( xi )
correct _ weight                              incorrect _ weight
yi hi ( xi )

Zt        D (i)e
yi  hi ( xi )
t
 αt yi ht(xi )
       D (i)e
yi hi ( xi )
t
 αt yi ht(xi )
    (i )

correctlyclassified : yi  hi ( xi ), so yi hi ( xi )  1, put it in (i)
incorrectly classified : yi  hi ( xi ), so yi hi ( xi )  1, put it in (i)
Zt        Dt(i)eαt ( 1) 
yi  hi ( xi )
 Dt(i)eαt ( 1) 
yi hi ( xi )
 Dt(i)eαt 
yi  hi ( xi )
 Dt(i)eαt
yi hi ( xi )

(total _ correct _ weight)  (total _ incorrect _ weight)
 0.1 * 7 * 0.65  0.1 * 3 * 1.52
 0.455  0.456
Z t 1  0.911
Step 3: Example:
update Dt toDt+1
Dt (i )  ( correct)
Dt 1           e
                 Zt
0.1 0.42
    e
Zt
0.1
Dt 1 (i ) correct        0.65
Zt
0.1  1 0.1 0.42
Dt 1 (i )incorrect       e      e  0.1 * 1.52
Zt      Zt
since Z t  0.911 , So

                
 decrease Dt 1 (i ) correct 
0.1  1
0.911
e 
0. 1
0.911
0.65  0.0714

                
 increase Dt 1 (i )incorrect 
0.1  1
0.911
e 
0.1
0.911
1.52  0.167

Now run the main training loop second time t=2
0.1  1    0.1
Dt 1 (i )correct         e         0.65  0.0714
0.911      0.911
0.1  1    0.1
Dt 1 (i )incorrect        e          1.52  0.167
0.911       0.911

Final classifier
Exercise: work out 1and 2
h1()


h2()
h3()

2          3
1                            Combine to form the
Final classifier

 T         
H(x)  sign  αt ht(x)
 t 1      
H ( x )  sign0.424 * h1 ( x )  αt 2h2 ( x )  αt 3h3 ( x ) 
Exercise : workout αt 2 , αt 3

Now run the main training loop second time t=2



Final classifier by
combining three weak
classifiers

Exercise1

   if example ==1
        blue(*)=[ -26 38
           3 34
           32 3
           42 10];
        red(O)=[ 23 38
           -4 -33
           -22 -25
           -37 -31];
        datafeatures=[blue;red];
        dataclass=[ -1 -1 -1 -1 1 1 1 1 ];

Exercise1 , initialized, t=1



Step1
Exercise1 , t=1                                    n
 1 if ht ( xi )  yi 
h1(upper half =*, lower= o)          error  t   Dt (i ) * I ht ( xi ) yi  , where I ht ( xi ) yi   
i 1                                                        0        otherwise
   Weak classifier h1(upper half =*,      Step1b : checking step : prerequisite : εt  0.5, otherwisestop.
lower= o)                                      1 1  εt
We see that Feature(5) is wrongly Step2 : t  2 ln ε
classified, 1 sample is wrong                        t

   err =ε(t)=D(t)*1,
   ε(t) =0.125
   Alpha=α=0.5*log[1- ε(t) )/ ε(t)]
   =0.973
   Find next D(t+1)
=D(t)*exp(α*(h(x)≠y)
   I.e. Incorrect=Dt+1(i)=Dt(i)*exp(α)
   D(5)=0.125*exp(0.973)
   =0.3307 (not normalized yet)
   Correct=Dt+1(i)=Dt(i)*exp(-α)
   D(1)=0.125*exp(-0.973)=0.0472
(not normalized yet)                 h1( )
   ------------
   Z=(7*0.0472+0.3307)=0.6611
   After normalization,D at t+1
   D(5)=0.3307 / Z=0.5002
   D(1)=D(2)..etc =0.0472 / Z=0.0714
1 n
Step4 : Current total cascadedclassifier error CEt   S t , α , h (xi )
Example 1,                                                                    n i 1
t
1 if y  signo (i ) 
Result at t=1 output o ( x )   α h (x ), and S t , α , h (x )  
t   i              i              i
i         t

 1                     0        otherwise

    ##display result t_step=1 ## O_=cascaded_sum, S_=sign(O_),Y=train_class,CE=classification
error##
    >i=1, a1*h1(xi)=-0.973, O_=-0.973, S_=-1.000, Y_=-1, CE_=0
    >i=2, a1*h1(xi)=-0.973, O_=-0.973, S_=-1.000, Y_=-1, CE_=0
    >i=3, a1*h1(xi)=-0.973, O_=-0.973, S_=-1.000, Y_=-1, CE_=0
    >i=4, a1*h1(xi)=-0.973, O_=-0.973, S_=-1.000, Y_=-1, CE_=0
    >i=5, a1*h1(xi)=-0.973, O_=-0.973, S_=-1.000, Y_=1, CE_=1
    >i=6, a1*h1(xi)=0.973, O_=0.973, S_=1.000, Y_=1, CE_=0
    >i=7, a1*h1(xi)=0.973, O_=0.973, S_=1.000, Y_=1, CE_=0
    >i=8, a1*h1(xi)=0.973, O_=0.973, S_=1.000, Y_=1, CE_=0
    >weak classifier specifications:
    -dimension: 1=vertical :direction:1=(left="blue_*", right="red_O"); -1=(reverse direction of 1)
    -dimension: 2=horizontal:direction:1=(up="red_O", down="blue_*"); -1=(reverse direction of 1)
    >#-new weak classifier at stage(1):dimension=2,threshold=-25.00;direction=-1
    >Cascaded classifier error up to stage(t=1)for(N=8 training samples) =[sum(CE_)/N]= 0.125

Exercise1 , t=2
Step1
   Weak classifier h1(left =o, eight= *)                      n
 1 if ht ( xi )  yi 
:Feature(1),(2) are wrongly classified,    error  t   Dt (i ) * I ht ( xi ) yi , where I ht ( xi ) yi   
2 samples are wrong.                                   i 1                                                       0        otherwise
Step1b : checking step : prerequisite : εt  0.5, otherwisestop.
   err =ε(t)=Dt(1)+Dt(2)=0.0714+0.0714=
1 1  εt
   ε(t) =0.1428                              Step2 :  t        ln
2    εt
   Alpha=α=0.5*log[1- ε(t) )/ ε(t)]=0.8961
   Find next D(t+1) =D(t)*exp(α*(h(x)≠y),
ie.
   Incorrect=Dt+1(i)=Dt(i)*exp( α)
   D(1)=D(2)=0.0714*exp(0.8961)
   =0.1749 (not normalized yet)
   correct=Dt+1(i)=Dt(i)*exp(-α)
   D(7)=D(6)=D(3,)D=(4)=D(8)=0.071*ex
p(-0.8961)=0.029
   Same for sample (7)(6)(3,)(4), but
   D(5)=0.5*exp(-0.8961)=0.2041
   Z=(2*0.1749
+5*0.029+0.2041)=0.6989
   After normalization
   D at t+1, D(1)=D(2) = 0.1749
/0.6989=0.2503
   D(5)= 0.2041 /0.6989=0.292
   D(8)= 0.029 / 0.6989=0.0415
1 n
Step4 : Current total cascadedclassifier error CEt   S t , α , h (xi )
Example 1,                                t
n i 1
1 if y  signo (i ) 
output o ( x )   α h (x ), and S t , α , h (x )                i         t

Result at t=2                    t   i
 1
    i            
0
i
otherwise

    ##display result t_step=2 ## O_=cascaded_sum,
S_=sign(O_),Y=train_class,CE=classification error##
    >i=1, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, O_=-0.077, S_=-1.000, Y_=-1, CE_=0
    >i=2, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, O_=-0.077, S_=-1.000, Y_=-1, CE_=0
    >i=3, a1*h1(xi)=-0.973, a2*h2(xi)=-0.896, O_=-1.869, S_=-1.000, Y_=-1, CE_=0
    >i=4, a1*h1(xi)=-0.973, a2*h2(xi)=-0.896, O_=-1.869, S_=-1.000, Y_=-1, CE_=0
    >i=5, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, O_=-0.077, S_=-1.000, Y_=1, CE_=1
    >i=6, a1*h1(xi)=0.973, a2*h2(xi)=0.896, O_=1.869, S_=1.000, Y_=1, CE_=0
    >i=7, a1*h1(xi)=0.973, a2*h2(xi)=0.896, O_=1.869, S_=1.000, Y_=1, CE_=0
    >i=8, a1*h1(xi)=0.973, a2*h2(xi)=0.896, O_=1.869, S_=1.000, Y_=1, CE_=0
    >weak classifier specifications:
    -dimension: 1=vertical :direction:1=(left="blue_*", right="red_O"); -1=(reverse
direction of 1)
    -dimension: 2=horizontal:direction:1=(up="red_O", down="blue_*"); -1=(reverse
direction of 1)
    >#-new weak classifier at stage(2):dimension=1,threshold=23.00;direction=-1
    >Cascaded classifier error up to stage(t=2)for(N=8 training samples) =[sum(CE_)/N]=
0.125

Exercise1 , t=3



1 n
Step4 : Current total cascadedclassifier error CEt   S t , α , h (xi )
Example 1,                                     t
n i 1
1 if y  signo (i ) 
output o ( x )   α h (x ), and S t , α , h (x )                      i            t

Result at t=3                         t   i
 1
    i         
0
   i
otherwise

     ##display result t_step=3 ## O_=cascaded_sum, S_=sign(O_),Y=train_class,CE=classification error##
    >i=1, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, a3*h3(xi)=-0.668, O_=-0.745, S_=-1.000, Y_=-1, CE_=0
    >i=2, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, a3*h3(xi)=-0.668, O_=-0.745, S_=-1.000, Y_=-1, CE_=0
    >i=3, a1*h1(xi)=-0.973, a2*h2(xi)=-0.896, a3*h3(xi)=0.668, O_=-1.201, S_=-1.000, Y_=-1, CE_=0
    >i=4, a1*h1(xi)=-0.973, a2*h2(xi)=-0.896, a3*h3(xi)=0.668, O_=-1.201, S_=-1.000, Y_=-1, CE_=0
    >i=5, a1*h1(xi)=-0.973, a2*h2(xi)=0.896, a3*h3(xi)=0.668, O_=0.590, S_=1.000, Y_=1, CE_=0
    >i=6, a1*h1(xi)=0.973, a2*h2(xi)=0.896, a3*h3(xi)=-0.668, O_=1.201, S_=1.000, Y_=1, CE_=0
    >i=7, a1*h1(xi)=0.973, a2*h2(xi)=0.896, a3*h3(xi)=-0.668, O_=1.201, S_=1.000, Y_=1, CE_=0
    >i=8, a1*h1(xi)=0.973, a2*h2(xi)=0.896, a3*h3(xi)=-0.668, O_=1.201, S_=1.000, Y_=1, CE_=0
    >weak classifier specifications:
    -dimension: 1=vertical :direction:1=(left="blue_*", right="red_O"); -1=(reverse direction of 1)
    -dimension: 2=horizontal:direction:1=(up="red_O", down="blue_*"); -1=(reverse direction of 1)
    >#-new weak classifier at stage(3):dimension=1,threshold=3.00;direction=1
    >Cascaded classifier error up to stage(t=3)for(N=8 training samples) =[sum(CE_)/N]= 0.000

Exercise1 , strong classifier



The strong
classifier

Test result, example1



The final strong classifier
 T         
H(x)  sign  αt ht(x)
 t 1      

CEt

1 n
Step4 : Current total cascadedclassifier error CEt   S t , α , h (xi )
n i 1
               t             
1 if yi  sign  α h (xi ) 
S t , α , h (xi )                  1          
 0
                  otherwise
Appendix

Theory
   We first make up a measurement function called “Exponential
Loss function” to measure the strength of a strong classifier.
   Exponential Loss function L (H) =a measurement of the
misclassification rate of a strong classifier H .
   yiH(xi)=+1 ( correctly classified)
   yiH(xi)=-1 ( incorrectly classified)
   A good Strong classifier should have low L(H)
 T             
For a strong classifier H ( x )  sign  k hk ( x ) 
 k 1          
Exponential Loss fun                         d
ction L(H) is define as
n                           n
1
L( H )                            e ( yi H ( xi ))
i 1   e( yi H ( xi ))      i 1

Theory:
By definition, the weight update rule is
chosen to achieve adaptive boosting
   AdaBoost choose this weight update function
deliberately
Dt 1 (i )  Dt (i ) exp(  t yi ht ( xi ))
   Because,
   when a training sample is correctly classified, weight decreases
   when a training sample is in correctly classified, weight increases
   Some other systems may use different weight update formulas but
with the same spirit (correctly classified samples will result in
decreased weight, and vice versa) .

Theory: part1a
Given
εt  Pr obability of the incorrect classification rate of the we classifier ht(x)
ak
1     1  t
selected at stage t. We want toprove t                    log
2       t
Define :

Loss function of H at stage t is LH t 
Loss function of H at stage t  1 is LH t  h ,
For simplicification α  αt , h  ht
Proof :
The Objective is : Find  to minimize LH t  h 
n
Dt (i ) exp( t yi ht ( xi ))
LH t  h    e  yi H t ( xi )h ( xi )  (see AdaBoost Step3 : Dt 1 (i )                                   )
i 1                                                                          Zt
n
LH t  h    e  yi H t ( xi )  e  yih ( xi )          (i )
i 1

correct _ cases : { yi  h( xi ) hence yi h ( xi )  1 cases} 
                                                               
all _ n _ samples
LH t  h              
i 1
                                                                
incorrect _ cases : { yi  h ( xi ) hence yi h( xi )  1 cases}
                                                                

Theory : part2a                                                For simpli cification α  αt , h  ht
correct _ cases : { yi  h( xi ) hence yi h( xi )  1 cases} 
                                                                   
all _ n _ samples
LH t  h  
i 1
                                                               
incorrect _ cases : { yi  h( xi ) hence yi h( xi )  1 cases}
                                                               

Put the above formula to(i)
 e                                            e                                                       

LH t  h                                  yi H t ( xi )
 e  ( 1)                             yi H t ( xi )
 e  ( 1)
yi  h ( xi )                                                yi  h ( xi )

LH t  h   e                           e
yi  h ( xi )
 yi H t ( xi )
 e   e


yi  h ( xi )
 yi H t ( xi )
         (ii)
In (ii)       e
yi  h ( xi )
 yi H t ( xi )
,  e
yi  h ( xi )
 yi H t ( xi )
are independent of α, so they are constants.
dLH t  h 
To minimize LH t  h  in (ii), set                                                              0
d
dLH t  h 

d

 e    e  yi H t ( xi )  e                                e          yi H t ( xi )
 0
yi  h ( xi )                                        yi  h ( xi )

e            e
yi  h ( xi )
 yi H t ( xi )


e           e          yi H t ( xi )
                  (iii)
yi  h ( xi )

Theory : part3a
This is because of step 3 of previous
For simpli cification α  αt , h  ht                                        stage (t-1) and step1b of current stage t
Take log of both sides of (iii) ,                                             So,

 e                        yi H t ( xi )
                                                 e            
 yi H t ( xi )

e 
 y h ( x )                                             1 ε 
yi  h ( xi )
       e          yi H t ( xi )
/ Z
log     log  i i  yi H t ( xi )                                                          e          yi H t ( xi )
  e                 yi H t ( xi )

                       
t

 e
yi  h ( xi )
e                                                                                    yi  h ( xi )                           yi  h ( xi )

 yi h ( xi )                                                                           e          
 yi H t ( xi )


x e yi Ht ( xi )                                      ε
 e
yi  h ( xi )
 yi H t ( xi )
  e                   yi H t ( xi )
           e
yi  h ( xi )
 yi H t ( xi )
/ Z    t
y h ( )
log( e 2 )  log i i  yi H t ( xi )
                    
yi  h ( xi )                            yi  h ( xi )

e                                                        where Z t is the nomalization factor ,
 e                             e                                   
yi  h ( xi )

 e                           
 yi H t ( xi )                          yi H t ( xi )
 yi H t ( xi )                                  Zt 
yi  h ( xi )                           yi  h ( xi )
1         yi  h ( xi )
  log                                                 (iv)
 e                           
Put the above in (iv)
 yi H t ( xi )
2                                                                             1     1 
yi  h ( xi )                                                     log
2      
recall : For the weak classifier ht(x)                                       For simplicification,we set α  αt , h  ht earlier
ε  Incorrect classification probability , hence So
1  ε  correct classification probability            1    1  t
 t  log
2                  t
Pr oved!

Advanced topic: Viola Jones’
implementation, compared with the
       Also , they the y range is {1,0}
rather than {1,-1}
Viola _ Jone _ AdaBoost :                                                 Orginal _ AdaBoost
update the weights:                                                       Re call :
wt 1,i  wt ,i  1ei                                                                           Dt (i ) exp( t yi ht ( xi ))
Step3 : Dt 1 (i ) 
Zt
where ei  0 if example xi is classfied correctlyand ei  1 otherwise,
where Z t  normalization factor,so Dt is a probability distrubution
ε
and β  t                                                                           M                             L
1  εt                                                             Z t   correct _ weight   incorrrect _ weight
i 1                         i 1
Note (assume: wt is already normalized )                                     M                             L
  Dt(i)e-αt yi ht(xi )   Dt(i)eαt yi ht(xi )
 ε 
correct( ei  0) : wt 1,i  wt ,i   wt ,i  t  ,  decreased
10
1 ε 
i 1                         i 1

       t                   Note :
incorrect ( ei  1) : wt 1,i  wt ,i   wt ,i , no _ change
11
Dt (i ) exp( t yi ht ( xi )) Dt (i )e 1
correct yi ht ( xi )  1 : Dt 1 (i )                                                   increase
Zt                    Zt
Dt (i ) exp( t yi ht ( xi )) Dt (i )e 1
incorrect yi ht ( xi )  1 : Dt 1 (i )                                                    decrease
Zt                    Zt

Exercise2
   if example ==2
        blue=[ -46 18
           -30 -30
           -31 -19
           -8 15
           8 -45
           -22 2];
        red=[ 33 38
           30 10
           21 35
           1 19
           14 23
           37 -41];
   datafeatures=[blue;red];
        dataclass=[ -1 -1 -1 -1 -1 -1 1 1 1 1 1 1 ];

Exercise2 ,t=0



Face detection idea

   1) in Adaboost use parallel-axis (tree
decision) classifier
2) in Viola Jones, the weak classifier is the
specially designed classifier described in the
paper.

Useful Features Learned by Boosting

A Cascade of Classifiers
will be discussed in the next chapter

Reference
   [Chen 007] Qing Chen, Nicolas D. Georganas and Emil M. Petriu,” Real-Time Vision-
Based Gesture Recognition Using Haar-like Features”, IMTC 2007, Warsaw, Poland,
May 1-3, 2007
   [smyth 2007] : slides: smyth, “Face Detection using the Viola-Jones Method” slide:
http://www.ics.uci.edu/~smyth/courses/cs175/slides12_viola_jones_face_detection.pp
t
   [Deng 2007 ] slides: Hongbo Deng A brief introduction to adaboost, 6 Feb, 2007,
   [Freund ] slides: A tutorial on boosting , A Tutorial on Boosting A Tutorial on Boosting
    www.cs.toronto.edu/~hinton/csc321/notes/boosting.pdf
   [Hoiem 2004]: sildes: Derek Hoiem, Adaboost , March 31, 2004,
   [Jensen 2008] Jensen , “Implementing the Viola-Jones Face Detection Algorithm”,
   http://orbit.dtu.dk/getResource?recordId=223656&objectId=1&versionId=1
   [Boris Babenko]: Boris Babenko , “Note: A Derivation of Discrete AdaBoost”,
Department of Computer Science and Engineering,University of California, San Diego
http://vision.ucsd.edu/~bbabenko/data/boosting_note.pdf
   [Kroon 2010] http://www.mathworks.com/matlabcentral/fileexchange/27813-classic-

Matlab demo

   [Kroon 2010]
http://www.mathworks.com/matlabcentral/file
   http://people.csail.mit.edu/torralba/shortCours
eRLOC/boosting/boosting.html

Answer0: Exercise for Step1a,1b
{Step1a : Find the classfier ht : X  {1,1} that minimizes the error withrespect toDt
Step1b : checking step : prerequisite : εt  0.5, otherwisestop.
   Assume h() can only be horizontal or
vertical separators.
   How many different classifiers are
available?
   Answer: because there are 12 training
samples, we will have 11x2 vertical +
11x2 horizontal classifies, total=44
.

   If hj() is selected, circle the
misclassified training samples. Find
ɛ( ) to see misclassification probability
if the probability distribution (D) for
each sample is the same.                        hj()
   Answer=(1/12), 4 misclassified
samples. ɛ=4*(1/12)
   Find h() with minimum error. Answer:
   ?? Repeat above, compare and find
result.

Answer , Exercise 2, t=1



Answer, Exercise 2, t=2


Answer, Exercise 2, t=3



Answer, Exercise2, strong classifier



Testing example 2               The final strong classifier
 T         
H(x)  sign  αt ht(x)
 t 1      



Assignment
Find the strong classifier from this training data set. Write
clearly the types of h( ) (e.g. left=blue, right =red, threshold at
u or v etc) and value of ε and α of each stage t.
    if example ==3
         blue=[ -26 -18                            The final strong classifier
            -30 -30
 T         
            31 -19                                 H(x)  sign  αt ht(x)
            -8 -15                                             t 1      
            -22 2];
         red=[ 33 38
            30 10
            21 35
            1 19
            37 -41];
         datafeatures=[blue;red];
         dataclass=[ -1 -1 -1 -1 -1 1
1 1 1 1 ];

AdaBoost Assignment , t=0

