Introduction to Bayesian Learning

Document Sample
Introduction to Bayesian Learning Powered By Docstoc
					BAYESIAN LEARNING

 Jianping Fan
 Dept of Computer Science
 UNC-Charlotte
OVERVIEW


   Bayesian classification:            one example

       E.g. How to decide if a patient is sick or healthy, based on

           A probabilistic model of the observed data

           Prior knowledge
CLASSIFICATION PROBLEM
  Training data: examples of the form (d,h(d))
    where d are the data objects to classify (inputs)
    and h(d) are the correct class info for d, h(d){1,…K}
  Goal: given dnew, provide h(dnew)
WHY BAYESIAN?
 Provides practical learning algorithms
    E.g. Naïve Bayes
 Prior knowledge and observed data can be combined

 It is a generative (model based) approach, which
  offers a useful conceptual framework
     E.g. sequences could also be classified, based on a
      probabilistic model specification
     Any kind of objects can be classified, based on a
      probabilistic model specification
BAYES’ RULE
              P ( d | h) P ( h)
  p(h | d )                                              Understanding Bayes' rule
                   P(d )                                  d  data
                                                          h  hypothesis (model)
                  P ( d | h) P ( h)
                                                         - rearranging

                  P ( d | h) P ( h)
                    h
                                                          p ( h | d ) P ( d )  P ( d | h) P ( h)
                                                          P ( d , h)  P ( d , h)
                                                          the same joint probabilit y
                                                          on both sides
Who is who in Bayes’ rule
P ( h) :                                      ty
                        prior belief (probabili of hypothesish before seeing any data)
P ( d | h) :                                ty
                        likelihood(probabili of the data if the hypothesish is true)
P(d )   P(d | h) P(h) : data evidence (marginalprobability of the data)
           h

P(h | d ) :                                ty
                        posterior (probabili of hypothesish after having seen the data d )
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
PROBABILITIES – AUXILIARY SLIDE FOR
MEMORY REFRESHING
     Have two dice h1 and h2
     The probability of rolling an i given die h1 is denoted
      P(i|h1). This is a conditional probability
     Pick a die at random with probability P(hj), j=1 or 2.
      The probability for picking die hj and rolling an i with
      it is called joint probability and is P(i, hj)=P(hj)P(i| hj).
     For any events X and Y, P(X,Y)=P(X|Y)P(Y)
     If we know P(X,Y), then the so-called marginal
      probability P(X) can be computed as P( X )   P( X , Y )
                                                   Y
DOES PATIENT HAVE CANCER OR NOT?
   A patient takes a lab test and the result comes back
    positive. It is known that the test returns a correct positive
    result in only 98% of the cases and a correct negative
    result in only 97% of the cases. Furthermore, only 0.008 of
    the entire population has this disease.

    1. What is the probability that this patient has cancer?
    2. What is the probability that he does not have cancer?
    3. What is the diagnosis?
 hypothesis1 : ' cancer'
                                 } hypothesis space H
hypothesis2 : ' cancer'
 data : ' '
                       P (  | cancer) P (cancer) ..........              .....
                                                                 ..........
1.P (cancer |  )                                                              ..........
                                    P()                         ..........
                                                         ..........       .....
      P (  | cancer)  0.98
      P (cancer)  0.008
      P (  )  P (  | cancer) P (cancer)  P (  | cancer) P (cancer)
               ..........       ..........
                         ..........              ..........
                                         ..........               .......
                                                         ..........
             P (  | cancer)  0.03
             P (cancer)  ..........

2.P (cancer |  )  ..........      .......
                             ..........

3.Diagnosis ??
CHOOSING HYPOTHESES
   Maximum Likelihood                hML  arg max P(d | h)
    hypothesis:                               hH




   Generally we want the
    most probable hypothesis
    given training data. This
    is the maximum a
    posteriori hypothesis:            hMAP  arg max P(h | d )
                                               hH
       Useful observation: it does
        not depend on the
        denominator P(d)
NOW WE COMPUTE THE DIAGNOSIS
    To find the Maximum Likelihood hypothesis, we evaluate
     P(d|h) for the data d, which is the positive lab test and chose
     the hypothesis (diagnosis) that maximises it:
           P( | cancer)  ..........
                                    ..
           P( | cancer)  ..........
                                     ...
           Diagnosis : hML  ..........
                                       .......

    To find the Maximum A Posteriori hypothesis, we evaluate
     P(d|h)P(h) for the data d, which is the positive lab test and
     chose the hypothesis (diagnosis) that maximises it. This is the
     same as choosing the hypotheses gives the higher posterior
     probability.
          P( | cancer) P(cancer)  ................
          P( | cancer) P(cancer)  ..........    ...
           Diagnosis : hMAP  ..........        ..
                                        ..........
NAÏVE BAYES CLASSIFIER
   What can we do if our data d has several attributes?
   Naïve Bayes assumption: Attributes that describe data
    instances are conditionally independent given the
    classification hypothesis
              P (d | h)  P ( a1 ,..., aT | h)   P (at | h)
                                                t

       it is a simplifying assumption, obviously it may be violated in
        reality
       in spite of that, it works well in practice
   The Bayesian classifier that uses the Naïve Bayes assumption
    and computes the MAP hypothesis is called Naïve Bayes
    classifier
   One of the most practical learning methods
   Successful applications:
     Medical Diagnosis
     Text classification
EXAMPLE. ‘PLAY TENNIS’ DATA
 Day     Outlook    Temperature   Humidity   Wind      Play
                                                      Tennis

 Day1     Sunny         Hot        High      Weak      No
 Day2     Sunny         Hot        High      Strong    No
 Day3    Overcast       Hot        High      Weak      Yes
 Day4     Rain         Mild        High      Weak      Yes
 Day5     Rain         Cool       Normal     Weak      Yes
 Day6     Rain         Cool       Normal     Strong    No
 Day7    Overcast      Cool       Normal     Strong    Yes
 Day8     Sunny        Mild        High      Weak      No
 Day9     Sunny        Cool       Normal     Weak      Yes
 Day10    Rain         Mild       Normal     Weak      Yes
 Day11    Sunny        Mild       Normal     Strong    Yes
 Day12   Overcast      Mild        High      Strong    Yes
 Day13   Overcast       Hot       Normal     Weak      Yes
 Day14    Rain         Mild        High      Strong    No
NAÏVE BAYES SOLUTION
Classify any new datum instance x=(a1,…aT) as:

           hNaive Bayes  arg max P (h) P (x | h)  arg max P (h) P (at | h)
                           h                         h           t


   To do this based on training examples, we need to estimate the
    parameters from the training examples:

       For each target value (hypothesis) h
               ˆ
               P ( h) : estimate P ( h)

       For each attribute value at of each datum instance
               ˆ
               P(at | h) : estimate P(at | h)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
 That means: Play tennis or not?

hNB  arg max P(h) P(x | h)  arg max P(h) P(at | h)
           h[ yes, no ]         h[ yes, no ]   t

 arg max P(h) P(Outlook  sunny | h) P(Temp  cool | h) P( Humidity  high | h) P(Wind  strong | h)
    h[ yes, no ]




   Working:
       P( PlayTennis  yes)  9 / 14  0.64
               P( PlayTennis  no)  5 / 14  0.36
               P(Wind  strong | PlayTennis  yes)  3 / 9  0.33
               P(Wind  strong | PlayTennis  no)  3 / 5  0.60
               etc.
               P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes)  0.0053
              P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no)  0.0206
               answer : PlayTennis( x)  no
LEARNING TO CLASSIFY TEXT
  Learn  from examples which articles are of
   interest
  The attributes are the words
  Observe the Naïve Bayes assumption just
   means that we have a random sequence
   model within each class!
  NB classifiers are one of the most effective
   for this task
  Resources for those interested:
      Tom Mitchell: Machine Learning (book) Chapter
       6.
RESULTS ON A BENCHMARK TEXT CORPUS
REMEMBER
   Bayes’ rule can be turned into a classifier
   Maximum A Posteriori (MAP) hypothesis estimation
    incorporates prior knowledge; Max Likelihood doesn’t
   Naive Bayes Classifier is a simple but effective
    Bayesian classifier for vector data (i.e. data with several
    attributes) that assumes that attributes are
    independent given the class.
   Bayesian classification is a generative approach to
    classification
RESOURCES
   Textbook reading (contains details about using
    Naïve Bayes for text classification):
    Tom Mitchell, Machine Learning (book), Chapter 6.
   Software: NB for classifying text:
    http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-
      bayes.html
   Useful reading for those interested to learn more
    about NB classification, beyond the scope of this
    module:
    http://www-2.cs.cmu.edu/~tom/NewChapters.html
UNIVARIATE NORMAL SAMPLE
                                         X ~ N ( , ) 2




                      Sampling    x  ( x1 , x2 ,   , xn )   T

                  


                                           ?
                                           ˆ
              




                                         ?
                                         ˆ
                  1        (x  )       2
 f ( x | , ) 
          2
                      exp  
                 2         2 2 
MAXIMUM LIKELIHOOD
                                               1        (x  ) 
                              f ( x | , ) 
                                           2
                                                   exp  
                                              2       2     
                                                              2




                  Sampling       x  ( x1 , x2 ,                   , xn )     T

              
          
                                 We want to
                                 maximize it.

  L (  ,  | x) 
           2
                        f ( x |  ,  2 )  f ( x1 |  ,  2 )   f ( xn |  ,  2 )

                                                            n ( xi   )2 
                                                 n/2
   Given x, it is a                 1 
                                                    exp              
function of  and 2                2 2                 i 1  2 2 
LOG-LIKELIHOOD FUNCTION
                                                                n ( xi   )2 
               L (  ,  | x) 
                                                     n/2
                                          1 
                                                           exp  
                            2
                                                                             
                                          2 2               i 1  2 2 

 l (  ,  | x)  log L (  ,  | x)
               2                               2

                    n      1      n
                                     ( xi   ) 2
                    log       
                    2    2 2
                                i 1     2 2
 Maximize
this instead
                      n       n          1               n
                                                                n       n 2
                     log   log 2 
                      2
                            2

                              2         2 2
                                                         x   2  xi  2 2
                                                        i 1
                                                             2
                                                             i
                                                                  i 1

By setting
                                                   
             l (  ,  2 | x)  0 and                     l (  ,  2 | x)  0
                                                    2
MAX. THE LOG-LIKELIHOOD FUNCTION


                                                 n
                                                         n       n 2
 l (  ,  | x)
                      n         n          1
           2         log  2  log 2 
                      2         2         2 2
                                                  x   2  xi  2 2
                                                 i 1
                                                     2
                                                     i
                                                           i 1



                    1    n
                                    n                  1 n
 
    l (  ,  | x)  2
             2

                    
                         x 
                         i 1
                                i    2
                                         0             xi
                                                     ˆ
                                                        n i 1
MAX. THE LOG-LIKELIHOOD FUNCTION
                          1 n
                          xi
                       ˆ                                1 n 2
                                                    2   xi   2
                                                    ˆ           ˆ
                          n i 1                        n i 1

 l (  ,  | x)                                                  n       n 2
                                                          n
               2          n         n          1
                         log  2  log 2 
                          2         2         2 2
                                                          x   2  xi  2 2
                                                         i 1
                                                              2
                                                              i
                                                                   i 1



                              n       1      n
                                               n
                                                      n 2
                                 4 
      l (  ,  2 | x)   2          xi2  4  xi       0
  2
                          2   2 i 1       i 1    2 4


         n               n
 n 2   xi2  2   xi  n 2
        i 1            i 1
                                   2                 2
         n
             2        1
                         n
                                          n
        x    xi     xi 
                   2
                   i
        i 1 n  i 1  n  i 1 
MISS DATA
                1 n                  1 n 2
                xi
             ˆ                   2   xi   2
                                 ˆ           ˆ
                n i 1               n i 1

      Sampling    x  ( x1 ,     , xm , xm 1 ,      , xn )   T


                                           Missing data

                    1 m          n    
                      xi   x j 
                  ˆ
                     n  i 1  j  m 1 

                    1 m 2        n
                                       2
                     xi   x j    2
                  ˆ
                  2
                                          ˆ
                    n  i 1  j  m 1  
E-STEP
                               1 m                     1 m 2        n      
                                                          xi   x 2    2
                                             n
                            ˆ    xi   x j        ˆ                        ˆ
                                                       2
                                                                           j
                               n  i 1                  n  i 1             
                                         j  m 1                 j  m 1




      ( t ) be the estimated parameters at
 Let   2 (t ) the initial of the tth iterations
     
                            n          
 E
      ( t ) , 2
     ˆ
                    (t )     xj     x   (n  m)  (t )
                                                    ˆ
                            j m1     

 E
   ( t ) , 2
  ˆ
                 (t )
                            n 2
                             xj
                            j m1
                                        
                                      x   (n  m)   
                                        
                                                    ˆ      
                                                      (t ) 2 2(t )
                                                                        
                                               1 m          n    
                                            ˆ    xi   x j 
                                            
                                               n  i 1  j  m 1 
E-STEP                                       1 m 2        n
                                                                2
                                              xi   x j    2
                                           ˆ2
                                                                   ˆ
                                             n  i 1  j  m 1  


      ( t ) be the estimated parameters at
 Let   2 (t ) the initial of the tth iterations
     
                            n                                                  m

                             xj     x   (n  m)  (t )
                                                    ˆ                           xi  (n  m)  (t )
                                                                                               ˆ
                                                                       (t )
 E                  (t )
                                                                       s
                                                                       1
      ( t ) , 2
     ˆ
                            j m1                                            i 1




 E
   ( t ) , 2
  ˆ
                 (t )
                            n 2
                             xj
                            j m1
                                        
                                      x   (n  m)   
                                        
                                                    ˆ (t ) 2
                                                            2(t )
                                                                                       
                                                                                                         
                                                                m
                                                              xi2  (n  m)  (t )   2
                                                                                               2   (t )
                                                    s(t )
                                                     2
                                                                              ˆ
                                                                i 1
                                    1 m          n    
                                 ˆ    xi   x j 
                                 
                                    n  i 1  j  m 1 
M-STEP                            1 m 2        n
                                                     2
                                   xi   x j    2
                                ˆ2
                                                        ˆ
                                  n  i 1  j  m 1  


      ( t ) be the estimated parameters at
 Let   2 (t ) the initial of the tth iterations
     
                                                                    m
                                                                   xi  (n  m)  (t )
                      (t )
                    s                                     (t )
                                                          s                       ˆ
      ( t 1)
                     1                                   1
                                                                   i 1
                     n
      2 ( t 1)
                     (
                    s2t )
                        
                           ˆ ( t 1)2

                     n
                                                                                           
                                                    m
                                                   xi2  (n  m)  (t )   2
                                                                              2      (t )
                                         s(t )
                                          2
                                                                   ˆ
                                                   i 1
EXERCISE                        X ~ N ( , )  2



 n = 40 (10 data missing)
 Estimate  ,  2 using different initial conditions.
    375.081556     243.548664     454.981077
    362.275902     382.789939     479.685107
    332.612068     374.419161     336.634962
    351.383048     337.289831     407.030453
    304.823174     418.928822     297.821512
    386.438672     364.086502     311.267105
    430.079689     343.854855     528.267783
    395.317406     371.279406     419.841982
    369.029845     439.241736     392.684770
    365.343938     338.281616     301.910093
  MULTINOMIAL POPULATION


                    Sampling      N samples
( p1 , p2 ,    , pn )             x  ( x1 , x2 ,    , xn )   T


               pi  1             xi : #samples  Ci
                                   x1  x2          xn  N

                                           N!
                p (x | p1 ,   , pn )            p1x1              x
                                                                  pn n
                                       x1 ! xn !
 MAXIMUM LIKELIHOOD
                                                       N!
                            p (x | p1 ,   , pn )            p1x1     x
                                                                     pn n
                                                   x1 ! xn !

                  Sampling         N samples
                                  x  ( x1 , x2 , x3 , x4 )T
(  2,4,4, )
     1
     2
                   1
                      2
                                   xi : #samples  Ci
                                   x1  x2  x3  x4  N

                               N!
  L ( | x)  p ( x | )            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
                                       2   2        4        4        2
                           x1 ! xn !
 MAXIMUM LIKELIHOOD
                                                     N!
                          p (x | p1 ,   , pn )            p1x1     x
                                                                   pn n
                                                 x1 ! xn !

                  Sampling        N samples
                                  x  ( x1 , x2 , x3 , x4 )T
(  2,4,4, )
     1             1

                                           xi : #samples  Ci
     2                2

              We want to
              maximize it.                  x1  x2  x3  x4  N

                               N!
  L ( | x)  p ( x | )            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
                                       2   2        4        4        2
                           x1 ! xn !
LOG-LIKELIHOOD
                                      N!
                      L ( | x)            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
                                              2   2        4        4        2
                                  x1 ! xn !

 l ( | x)  log L ( | x)
           x1 log( 1   )  x2 log(  )  x3 log(  )  const
                    2   2             4             4


   
  
     l ( | x)  
                    x1  x2 x3
                        
                   1  
                                             0
            x1  x2 (1   )  x3 (1   )  0

                      x2  x3
            ˆ 
                   x1  x2  x3
MIXED ATTRIBUTES
                                             x2  x3
                                   ˆ 
                                          x1  x2  x3

                Sampling   N samples
                           x  ( x1 , x2 , x3  x4 )     T
(  2,4,4, )
    1
    2
                 1
                    2
                                    x3 is not available
                                                                      x2  x3
                                                            ˆ 
                                                                   x1  x2  x3
E-STEP

                   Sampling        N samples
                                  x  ( x1 , x2 , x3  x4 )             T
(  2,4,4, )
    1
    2
                    1
                       2
                                              x3 is not available

  Given (t), what can you say about x3?
                                              (t )
        E ( t )  x3 | x  ( x3  x4 ) 1     4
                                                             x3t )
                                                              ˆ(
                                        2          (t )
                                                      4
                                                                 x2  x3
                                                       ˆ 
                                                              x1  x2  x3
M-STEP




        ˆ                  x2  x
                                ˆ        (t )
           ( t 1)
                                        3

                        x1  x2  x3
                                  ˆ (t )



                                         (t )
   E ( t )  x3 | x  ( x3  x4 ) 1     4
                                                        x3t )
                                                         ˆ(
                                   2          (t )
                                                 4
EXERCISE



 xobs  ( x1 , x2 , x3  x4 )  (38,34,125)
                        T                 T



 Estimate  using different initial conditions?
BINOMIAL/POISON MIXTURE
                                M : married obasong
                                X : # Children

  # Children
  # Obasongs n0        n1 n2       n3    n4      n5            n6
                                           X | M ~ P ( )
Married Obasongs      P( M )  1                      x e 
                                        P( x | M ) 
                                                          x!
 Unmarried Obasongs
    (No Children)
                        P( M c )      P( X  0 | M c )  1
BINOMIAL/POISON MIXTURE
                             M : married obasong
                             X : # Children


 # Children
 # Obasongs n0      n1 n2      n3     n4   n5   n6
                       n0  nA  nB
 Married Obasongs      Unobserved data:
  Unmarried Obasongs     nA : # married Ob’s
     (No Children)
                         nB : # unmarried Ob’s
BINOMIAL/POISON MIXTURE
                                  M : married obasong
                                  X : # Children

 # Children
 # Obasongs n0         n1 n2        n3   n4    n5   n6
 Complete      n A , nB n 1 n 2     n3   n4    n5   n6
   data
 Probability   p A, p B p 1 p 2     p3   p4   p5    p6
BINOMIAL/POISON MIXTURE
   pA              pB  e   (1   )
              x e               x  1, 2,
      px               (1   )
                x!

 # Children
 # Obasongs n0                 n1 n2       n3   n4   n5   n6
 Complete        n A , nB n 1 n 2          n3   n4   n5   n6
   data
 Probability     p A, p B p 1 p 2          p3   p4   p5   p6
 COMPLETE DATA LIKELIHOOD

pA           pB  e (1   ) n  ( nA , nB , n1 ,
                                                        , n6 )T
                                                               n0  nA  nB
           x e                  n obs  (n0 , n1 ,
   px               (1   )                            , n6 )T
             x!
    # Children
                                  x  1, 2,
    # Obasongs n0               n1 n2         n3        n4    n5      n6
    Complete            n A , nB n 1 n 2      n3        n4    n5      n6
      data
   Probability          p A, p B p 1 p 2      p3        p4    p5      p6
                                (nA  nB  n1   n6 )! nA nB n1
L ( ,  | n)  p(n |  ,  ) 
                                                                         n
                                                       p A pB p1        p6 6
                                    nA !nB !n1 ! n6 !
MAXIMUM LIKELIHOOD



               X  {x1 , x 2 , , x N }
                               N
           L ( | X)  p(X|)   p(x i|)
                                i 1


               arg max L ( | X)
                *
                          
LATENT VARIABLES
                               Incomplete
                                  Data


                   X  {x1 , x 2 , , x N }
     
                   Y  {y1 , y 2 ,, y N }

   Complete Data   Z  ( X, Y)
COMPLETE DATA LIKELIHOOD
               Complete Data      Z  ( X, Y)

                   X  {x1 , x 2 , , x N }
      
                   Y  {y1 , y 2 ,, y N }
 L ( | Z)  p(Z|)  p(X, Y|)
           p(Y | X, ) p( X|)
COMPLETE DATA LIKELIHOOD
                      Complete Data        Z  ( X, Y)

 L ( | Z)  p(Y | X, ) p(X|)
                         A function of     A function of
    A function of      latent variable Y   parameter 
       random          and parameter 
     variable Y.

                       The result is in
 If we are given ,    term of random        Computable
                         variable Y.
EXPECTATION STEP
                           L ( | Z)  p(X, Y | )
 Let (i1) be the parameter vector obtained at the
 (i1)th step.

 Define
        ( i 1)                    ( i 1)
  Q(Θ, Θ )  E[log L (Θ | Z) | X, Θ ]
    
     yY
           log p( X, y | Θ)  p(y | X, Θ(i 1) )dy continuous
   
        log p( X, y | Θ)  p(y | X, Θ(i 1) )    discrete
     yY
MAXIMIZATION STEP
                            L ( | Z)  p(X, Y | )

 Let (i1) be the parameter vector obtained at ( i 1)
  Θ  arg max Q(Θ, Θ
      (i )
 (i1)th step.
                                                the
                                                          )
                           Θ
 Define
        ( i 1)                    ( i 1)
  Q(Θ, Θ )  E[log L (Θ | Z) | X, Θ ]
     
      yY
            log p( X, y | Θ)  p(y | X, Θ(i 1) )dy continuous
    
         log p( X, y | Θ)  p(y | X, Θ(i 1) )    discrete
      yY
MIXTURE MODELS


  If there is a reason to believe that a data set is
   comprised of several distinct populations, a
   mixture model can be used.
  It has the following form:


               M                               M
  p (x | Θ)    j p j (x |  j )    with    
                                               j 1
                                                      j   1
               j 1

         Θ  (1 ,,  M ,1 , ,, M )
MIXTURE MODELS
                                      M
                         p (x | Θ)    j p j (x |  j )
                                      j 1


                    X  {x1 , x 2 , , x N }
      
                    Y  { y1 , y2 , , y N }
  Let yi{1,…, M} represents the source that
  generates the data.
MIXTURE MODELS
                                        M
                          p (x | Θ)    j p j (x |  j )
                                        j 1


                         p ( x | y  j , Θ)  p j ( x |  j )
                     x
      
                     y j           p ( y  j | Θ)   j


  Let yi{1,…, M} represents the source that
  generates the data.
MIXTURE MODELS                                M
                               p (x | Θ)    j p j (x |  j )
                                              j 1

         p(z i | Θ)  p(x i , yi | Θ)  p( yi | x i , Θ) p(x i | Θ)
      p ( y  j | Θ)   j       p ( x | y  j , Θ)  p j ( x |  j )

                         xi
                                    zi
                         yi
MIXTURE MODELS
                                                     M
                                      p (x | Θ)    j p j (x |  j )
                                                     j 1


                    p ( x i , yi , Θ ) p ( x i | yi , Θ ) p ( yi , Θ )
 p( yi | x i , Θ)                    
                      p ( x i , Θ)               p ( x i , Θ)
      p (x i | yi , Θ) p ( yi | Θ) p (Θ) p (x i | yi , Θ) p ( yi | Θ)
                                       
               p (x i | Θ) p (Θ)                  p (x i | Θ)

       p yi (xi |  yi ) yi
      M

       
       j 1
              j   p j (x |  j )
EXPECTATION

                 M     N                                    N
 Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
                 l 1 i 1                       yY        j 1

     M     N                     M       M             M           N
    log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
     l 1 i 1                   y1 1   yi 1     y N 1       j 1



                                             Zero when yi  l

                       N     M
 Q(Θ, Θ )      yi ,l log  l pl (xi | l )   p( y j | x j , Θ g )
                                                                N
            g

                 yY i 1 l 1                                j 1
EXPECTATION

                   M    N                                                      N
 Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
                  l 1 i 1                                     yY            j 1

     M     N                             M             M              M               N
    log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
     l 1 i 1                           y1 1       yi 1        y N 1           j 1

                                M                                                      
    log  l pl ( xi | l )                                1  p( y j | X, Θ g )  p(l | xi , Θ g )
     M   N                                     M       M          M        N

                                 y 1                                                
    l 1 i 1                    1          yi1 1 yi 1 1    y N  j 1             
                                                                      j i             
EXPECTATION

                   M    N                                                      N
 Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
                  l 1 i 1                                     yY            j 1

     M     N                             M             M              M               N
    log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
     l 1 i 1                           y1 1       yi 1        y N 1           j 1

                                M                                                      
    log  l pl ( xi | l )                                1  p( y j | X, Θ g )  p(l | xi , Θ g )
     M   N                                     M       M          M        N

                                 y 1                                                
    l 1 i 1                    1          yi1 1 yi 1 1    y N  j 1             
                                                                      j i             

                   M   N
                                      N  M                  
 Q(Θ, Θ )   log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )
             g
                                      j 1  y 1            
            l 1 i 1
                                      j i
                                            j               
                                                               
EXPECTATION

                        M         N
  Q(Θ, Θ g )   log  l pl (x i |  l )p (l | x i , Θ g )
                       l 1 i 1

                  M     N                        M    N
 Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
          g                                g

                  l 1 i 1                     l 1 i 1



                                                            1
                       M      N
                                       N  M                  
  Q(Θ, Θ )   log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )
              g
                                       j 1  y 1            
             l 1 i 1
                                       j i
                                             j               
                                                                
MAXIMIZATION
                                    Θ  (1 ,,  M ,1 , ,, M )

 Given the initial guess                                     g,

                M     N                          M    N
 Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
          g                                g

                l 1 i 1                       l 1 i 1



 We want to find , to maximize the above
 expectation.
 In fact, iteratively.
THE GMM (GUASSIAN MIXTURE MODEL)


  Guassian model of a d-dimensional source, say j :
                                   1             1            1           
    p j (x | μ j , Σ j )                    exp (x  μ j ) Σ j (x  μ j )
                                                             T

                           (2 ) | Σ j |
                                d /2    1/ 2
                                                 2                         
    j  (μ j , Σ j )

  GMM with M sources:
                                          M                            j 0
    p j ( x | μ1 , Σ1 ,  , μ M , Σ M )    j p j (x | μ j , Σ j )
                                          j 1                            j   1
GOAL

                                                  M
 Mixture Model                   p(x | Θ)    l pl (x |  l )
                                                  l 1

                                 Θ  (1 ,,  M ,1 , ,, M )
                                                     M
                                subject to         l 1
                                                             l   1

 To maximize:
                M     N                          M       N
 Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
          g                                g

                l 1 i 1                       l 1 i 1
FINDING L
                M     N                              M     N
 Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
          g                                    g

                l 1 i 1                            l 1 i 1


 Due to the constraint on l’s, we introduce Lagrange
 Multiplier , and solve the following equation.

    M N                               M        
        log( l ) p(l | xi , Θ )     l  1  0,                l  1,, M
                                 g

   l  l 1 i 1                       l 1    

                                N
                                      1
                               
                               i 1
                                          p (l | x i , Θ g )    0,   l  1,  , M
                                      l

                                N

                                p(l | x i , Θ g )   l   0,
                               i 1
                                                                        l  1,  , M

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/9/2012
language:English
pages:63