# Introduction to Bayesian Learning

Document Sample

```					BAYESIAN LEARNING

Jianping Fan
Dept of Computer Science
UNC-Charlotte
OVERVIEW

   Bayesian classification:            one example

   E.g. How to decide if a patient is sick or healthy, based on

   A probabilistic model of the observed data

   Prior knowledge
CLASSIFICATION PROBLEM
 Training data: examples of the form (d,h(d))
 where d are the data objects to classify (inputs)
 and h(d) are the correct class info for d, h(d){1,…K}
 Goal: given dnew, provide h(dnew)
WHY BAYESIAN?
 Provides practical learning algorithms
 E.g. Naïve Bayes
 Prior knowledge and observed data can be combined

 It is a generative (model based) approach, which
offers a useful conceptual framework
 E.g. sequences could also be classified, based on a
probabilistic model specification
 Any kind of objects can be classified, based on a
probabilistic model specification
BAYES’ RULE
P ( d | h) P ( h)
p(h | d )                                              Understanding Bayes' rule
P(d )                                  d  data
h  hypothesis (model)
P ( d | h) P ( h)
                                          - rearranging

 P ( d | h) P ( h)
h
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
the same joint probabilit y
on both sides
Who is who in Bayes’ rule
P ( h) :                                      ty
prior belief (probabili of hypothesish before seeing any data)
P ( d | h) :                                ty
likelihood(probabili of the data if the hypothesish is true)
P(d )   P(d | h) P(h) : data evidence (marginalprobability of the data)
h

P(h | d ) :                                ty
posterior (probabili of hypothesish after having seen the data d )
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
PROBABILITIES – AUXILIARY SLIDE FOR
MEMORY REFRESHING
   Have two dice h1 and h2
   The probability of rolling an i given die h1 is denoted
P(i|h1). This is a conditional probability
   Pick a die at random with probability P(hj), j=1 or 2.
The probability for picking die hj and rolling an i with
it is called joint probability and is P(i, hj)=P(hj)P(i| hj).
   For any events X and Y, P(X,Y)=P(X|Y)P(Y)
   If we know P(X,Y), then the so-called marginal
probability P(X) can be computed as P( X )   P( X , Y )
Y
DOES PATIENT HAVE CANCER OR NOT?
   A patient takes a lab test and the result comes back
positive. It is known that the test returns a correct positive
result in only 98% of the cases and a correct negative
result in only 97% of the cases. Furthermore, only 0.008 of
the entire population has this disease.

1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the diagnosis?
hypothesis1 : ' cancer'
} hypothesis space H
hypothesis2 : ' cancer'
 data : ' '
P (  | cancer) P (cancer) ..........              .....
..........
1.P (cancer |  )                                                              ..........
P()                         ..........
..........       .....
P (  | cancer)  0.98
P (cancer)  0.008
P (  )  P (  | cancer) P (cancer)  P (  | cancer) P (cancer)
 ..........       ..........
..........              ..........
..........               .......
..........
P (  | cancer)  0.03
P (cancer)  ..........

2.P (cancer |  )  ..........      .......
..........

3.Diagnosis ??
CHOOSING HYPOTHESES
   Maximum Likelihood                hML  arg max P(d | h)
hypothesis:                               hH

   Generally we want the
most probable hypothesis
given training data. This
is the maximum a
posteriori hypothesis:            hMAP  arg max P(h | d )
hH
   Useful observation: it does
not depend on the
denominator P(d)
NOW WE COMPUTE THE DIAGNOSIS
   To find the Maximum Likelihood hypothesis, we evaluate
P(d|h) for the data d, which is the positive lab test and chose
the hypothesis (diagnosis) that maximises it:
P( | cancer)  ..........
..
P( | cancer)  ..........
...
 Diagnosis : hML  ..........
.......

   To find the Maximum A Posteriori hypothesis, we evaluate
P(d|h)P(h) for the data d, which is the positive lab test and
chose the hypothesis (diagnosis) that maximises it. This is the
same as choosing the hypotheses gives the higher posterior
probability.
P( | cancer) P(cancer)  ................
P( | cancer) P(cancer)  ..........    ...
 Diagnosis : hMAP  ..........        ..
..........
NAÏVE BAYES CLASSIFIER
   What can we do if our data d has several attributes?
   Naïve Bayes assumption: Attributes that describe data
instances are conditionally independent given the
classification hypothesis
P (d | h)  P ( a1 ,..., aT | h)   P (at | h)
t

   it is a simplifying assumption, obviously it may be violated in
reality
   in spite of that, it works well in practice
   The Bayesian classifier that uses the Naïve Bayes assumption
and computes the MAP hypothesis is called Naïve Bayes
classifier
   One of the most practical learning methods
   Successful applications:
 Medical Diagnosis
 Text classification
EXAMPLE. ‘PLAY TENNIS’ DATA
Day     Outlook    Temperature   Humidity   Wind      Play
Tennis

Day1     Sunny         Hot        High      Weak      No
Day2     Sunny         Hot        High      Strong    No
Day3    Overcast       Hot        High      Weak      Yes
Day4     Rain         Mild        High      Weak      Yes
Day5     Rain         Cool       Normal     Weak      Yes
Day6     Rain         Cool       Normal     Strong    No
Day7    Overcast      Cool       Normal     Strong    Yes
Day8     Sunny        Mild        High      Weak      No
Day9     Sunny        Cool       Normal     Weak      Yes
Day10    Rain         Mild       Normal     Weak      Yes
Day11    Sunny        Mild       Normal     Strong    Yes
Day12   Overcast      Mild        High      Strong    Yes
Day13   Overcast       Hot       Normal     Weak      Yes
Day14    Rain         Mild        High      Strong    No
NAÏVE BAYES SOLUTION
Classify any new datum instance x=(a1,…aT) as:

hNaive Bayes  arg max P (h) P (x | h)  arg max P (h) P (at | h)
h                         h           t

   To do this based on training examples, we need to estimate the
parameters from the training examples:

   For each target value (hypothesis) h
ˆ
P ( h) : estimate P ( h)

   For each attribute value at of each datum instance
ˆ
P(at | h) : estimate P(at | h)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
 That means: Play tennis or not?

hNB  arg max P(h) P(x | h)  arg max P(h) P(at | h)
h[ yes, no ]         h[ yes, no ]   t

 arg max P(h) P(Outlook  sunny | h) P(Temp  cool | h) P( Humidity  high | h) P(Wind  strong | h)
h[ yes, no ]

   Working:
P( PlayTennis  yes)  9 / 14  0.64
P( PlayTennis  no)  5 / 14  0.36
P(Wind  strong | PlayTennis  yes)  3 / 9  0.33
P(Wind  strong | PlayTennis  no)  3 / 5  0.60
etc.
P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes)  0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no)  0.0206
 answer : PlayTennis( x)  no
LEARNING TO CLASSIFY TEXT
 Learn  from examples which articles are of
interest
 The attributes are the words
 Observe the Naïve Bayes assumption just
means that we have a random sequence
model within each class!
 NB classifiers are one of the most effective
 Resources for those interested:
   Tom Mitchell: Machine Learning (book) Chapter
6.
RESULTS ON A BENCHMARK TEXT CORPUS
REMEMBER
   Bayes’ rule can be turned into a classifier
   Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood doesn’t
   Naive Bayes Classifier is a simple but effective
Bayesian classifier for vector data (i.e. data with several
attributes) that assumes that attributes are
independent given the class.
   Bayesian classification is a generative approach to
classification
RESOURCES
Naïve Bayes for text classification):
Tom Mitchell, Machine Learning (book), Chapter 6.
   Software: NB for classifying text:
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-
bayes.html
about NB classification, beyond the scope of this
module:
http://www-2.cs.cmu.edu/~tom/NewChapters.html
UNIVARIATE NORMAL SAMPLE
X ~ N ( , ) 2

Sampling    x  ( x1 , x2 ,   , xn )   T



 ?
ˆ


 ?
ˆ
1        (x  )       2
f ( x | , ) 
2
exp  
2         2 2 
MAXIMUM LIKELIHOOD
1        (x  ) 
f ( x | , ) 
2
exp  
2       2     
2

Sampling       x  ( x1 , x2 ,                   , xn )     T



We want to
maximize it.

L (  ,  | x) 
2
f ( x |  ,  2 )  f ( x1 |  ,  2 )   f ( xn |  ,  2 )

 n ( xi   )2 
n/2
Given x, it is a                 1 
                  exp              
function of  and 2                2 2                 i 1  2 2 
LOG-LIKELIHOOD FUNCTION
 n ( xi   )2 
L (  ,  | x) 
n/2
 1 
exp  
2
                                    
 2 2               i 1  2 2 

l (  ,  | x)  log L (  ,  | x)
2                               2

n      1      n
( xi   ) 2
 log       
2    2 2
i 1     2 2
Maximize
n       n          1               n
 n       n 2
  log   log 2 
2
2

2         2 2
 x   2  xi  2 2
i 1
2
i
i 1

By setting
                                        
l (  ,  2 | x)  0 and                     l (  ,  2 | x)  0
                                          2
MAX. THE LOG-LIKELIHOOD FUNCTION

n
 n       n 2
l (  ,  | x)
n         n          1
2         log  2  log 2 
2         2         2 2
 x   2  xi  2 2
i 1
2
i
i 1

                  1    n
n                  1 n

l (  ,  | x)  2
2


x 
i 1
i    2
0             xi
ˆ
n i 1
MAX. THE LOG-LIKELIHOOD FUNCTION
1 n
   xi
ˆ                                1 n 2
 2   xi   2
ˆ           ˆ
n i 1                        n i 1

l (  ,  | x)                                                  n       n 2
n
2          n         n          1
  log  2  log 2 
2         2         2 2
 x   2  xi  2 2
i 1
2
i
i 1

                            n       1      n
n
n 2
4 
l (  ,  2 | x)   2          xi2  4  xi       0
 2
2   2 i 1       i 1    2 4

n               n
n 2   xi2  2   xi  n 2
i 1            i 1
2                 2
n
2        1
n
         n
  x    xi     xi 
2
i
i 1 n  i 1  n  i 1 
MISS DATA
1 n                  1 n 2
   xi
ˆ                   2   xi   2
ˆ           ˆ
n i 1               n i 1

Sampling    x  ( x1 ,     , xm , xm 1 ,      , xn )   T

                                  Missing data

               1 m          n    
    xi   x j 
ˆ
n  i 1  j  m 1 

1 m 2        n
2
    xi   x j    2
ˆ
2
ˆ
n  i 1  j  m 1  
E-STEP
1 m                     1 m 2        n      
    xi   x 2    2
n
ˆ    xi   x j        ˆ                        ˆ
2
                                               j
n  i 1                  n  i 1             
j  m 1                 j  m 1

 ( t ) be the estimated parameters at
Let   2 (t ) the initial of the tth iterations

 n          
E
 ( t ) , 2
ˆ
(t )     xj     x   (n  m)  (t )
ˆ
 j m1     

E
 ( t ) , 2
ˆ
(t )
 n 2
  xj
 j m1

x   (n  m)   

ˆ      
(t ) 2 2(t )

1 m          n    
ˆ    xi   x j 

n  i 1  j  m 1 
E-STEP                                       1 m 2        n
2
    xi   x j    2
ˆ2
ˆ
n  i 1  j  m 1  

 ( t ) be the estimated parameters at
Let   2 (t ) the initial of the tth iterations

 n                                                  m

  xj     x   (n  m)  (t )
ˆ                           xi  (n  m)  (t )
ˆ
(t )
E                  (t )
s
1
 ( t ) , 2
ˆ
 j m1                                            i 1

E
 ( t ) , 2
ˆ
(t )
 n 2
  xj
 j m1

x   (n  m)   

ˆ (t ) 2
2(t )

              
m
  xi2  (n  m)  (t )   2
2   (t )
s(t )
2
ˆ
i 1
1 m          n    
ˆ    xi   x j 

n  i 1  j  m 1 
M-STEP                            1 m 2        n
2
    xi   x j    2
ˆ2
ˆ
n  i 1  j  m 1  

 ( t ) be the estimated parameters at
Let   2 (t ) the initial of the tth iterations

m
  xi  (n  m)  (t )
(t )
s                                     (t )
s                       ˆ
   ( t 1)
   1                                   1
i 1
n
2 ( t 1)
(
s2t )
                      
ˆ ( t 1)2

n
                 
m
  xi2  (n  m)  (t )   2
2      (t )
s(t )
2
ˆ
i 1
EXERCISE                        X ~ N ( , )  2

n = 40 (10 data missing)
Estimate  ,  2 using different initial conditions.
375.081556     243.548664     454.981077
362.275902     382.789939     479.685107
332.612068     374.419161     336.634962
351.383048     337.289831     407.030453
304.823174     418.928822     297.821512
386.438672     364.086502     311.267105
430.079689     343.854855     528.267783
395.317406     371.279406     419.841982
369.029845     439.241736     392.684770
365.343938     338.281616     301.910093
MULTINOMIAL POPULATION

Sampling      N samples
( p1 , p2 ,    , pn )             x  ( x1 , x2 ,    , xn )   T

 pi  1             xi : #samples  Ci
x1  x2          xn  N

N!
p (x | p1 ,   , pn )            p1x1              x
pn n
x1 ! xn !
MAXIMUM LIKELIHOOD
N!
p (x | p1 ,   , pn )            p1x1     x
pn n
x1 ! xn !

Sampling         N samples
x  ( x1 , x2 , x3 , x4 )T
(  2,4,4, )
1
2
         1
2
xi : #samples  Ci
x1  x2  x3  x4  N

N!
L ( | x)  p ( x | )            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
2   2        4        4        2
x1 ! xn !
MAXIMUM LIKELIHOOD
N!
p (x | p1 ,   , pn )            p1x1     x
pn n
x1 ! xn !

Sampling        N samples
x  ( x1 , x2 , x3 , x4 )T
(  2,4,4, )
1             1

xi : #samples  Ci
2                2

We want to
maximize it.                  x1  x2  x3  x4  N

N!
L ( | x)  p ( x | )            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
2   2        4        4        2
x1 ! xn !
LOG-LIKELIHOOD
N!
L ( | x)            ( 1   ) x1 (  ) x2 (  ) x3 ( 1 ) x4
2   2        4        4        2
x1 ! xn !

l ( | x)  log L ( | x)
 x1 log( 1   )  x2 log(  )  x3 log(  )  const
2   2             4             4



l ( | x)  
x1  x2 x3
 
1  
0
 x1  x2 (1   )  x3 (1   )  0

x2  x3
ˆ 
x1  x2  x3
MIXED ATTRIBUTES
x2  x3
ˆ 
x1  x2  x3

Sampling   N samples
x  ( x1 , x2 , x3  x4 )     T
(  2,4,4, )
1
2
         1
2
x3 is not available
x2  x3
ˆ 
x1  x2  x3
E-STEP

Sampling        N samples
x  ( x1 , x2 , x3  x4 )             T
(  2,4,4, )
1
2
         1
2
x3 is not available

Given (t), what can you say about x3?
 (t )
E ( t )  x3 | x  ( x3  x4 ) 1     4
 x3t )
ˆ(
2          (t )
4
x2  x3
ˆ 
x1  x2  x3
M-STEP

ˆ                  x2  x
ˆ        (t )
    ( t 1)
                  3

x1  x2  x3
ˆ (t )

 (t )
E ( t )  x3 | x  ( x3  x4 ) 1     4
 x3t )
ˆ(
2          (t )
4
EXERCISE

xobs  ( x1 , x2 , x3  x4 )  (38,34,125)
T                 T

Estimate  using different initial conditions?
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children

# Children
# Obasongs n0        n1 n2       n3    n4      n5            n6
X | M ~ P ( )
Married Obasongs      P( M )  1                      x e 
P( x | M ) 
x!
Unmarried Obasongs
(No Children)
P( M c )      P( X  0 | M c )  1
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children

# Children
# Obasongs n0      n1 n2      n3     n4   n5   n6
n0  nA  nB
Married Obasongs      Unobserved data:
Unmarried Obasongs     nA : # married Ob’s
(No Children)
nB : # unmarried Ob’s
BINOMIAL/POISON MIXTURE
M : married obasong
X : # Children

# Children
# Obasongs n0         n1 n2        n3   n4    n5   n6
Complete      n A , nB n 1 n 2     n3   n4    n5   n6
data
Probability   p A, p B p 1 p 2     p3   p4   p5    p6
BINOMIAL/POISON MIXTURE
pA              pB  e   (1   )
 x e               x  1, 2,
px               (1   )
x!

# Children
# Obasongs n0                 n1 n2       n3   n4   n5   n6
Complete        n A , nB n 1 n 2          n3   n4   n5   n6
data
Probability     p A, p B p 1 p 2          p3   p4   p5   p6
COMPLETE DATA LIKELIHOOD

pA           pB  e (1   ) n  ( nA , nB , n1 ,
                               , n6 )T
n0  nA  nB
 x e                  n obs  (n0 , n1 ,
px               (1   )                            , n6 )T
x!
# Children
x  1, 2,
# Obasongs n0               n1 n2         n3        n4    n5      n6
Complete            n A , nB n 1 n 2      n3        n4    n5      n6
data
Probability          p A, p B p 1 p 2      p3        p4    p5      p6
(nA  nB  n1   n6 )! nA nB n1
L ( ,  | n)  p(n |  ,  ) 
n
p A pB p1        p6 6
nA !nB !n1 ! n6 !
MAXIMUM LIKELIHOOD

X  {x1 , x 2 , , x N }
                          N
L ( | X)  p(X|)   p(x i|)
i 1

  arg max L ( | X)
*

LATENT VARIABLES
Incomplete
Data

X  {x1 , x 2 , , x N }

Y  {y1 , y 2 ,, y N }

Complete Data   Z  ( X, Y)
COMPLETE DATA LIKELIHOOD
Complete Data      Z  ( X, Y)

X  {x1 , x 2 , , x N }

Y  {y1 , y 2 ,, y N }
L ( | Z)  p(Z|)  p(X, Y|)
 p(Y | X, ) p( X|)
COMPLETE DATA LIKELIHOOD
Complete Data        Z  ( X, Y)

L ( | Z)  p(Y | X, ) p(X|)
A function of     A function of
A function of      latent variable Y   parameter 
random          and parameter 
variable Y.

The result is in
If we are given ,    term of random        Computable
variable Y.
EXPECTATION STEP
L ( | Z)  p(X, Y | )
Let (i1) be the parameter vector obtained at the
(i1)th step.

Define
( i 1)                    ( i 1)
Q(Θ, Θ )  E[log L (Θ | Z) | X, Θ ]

 yY
log p( X, y | Θ)  p(y | X, Θ(i 1) )dy continuous

    log p( X, y | Θ)  p(y | X, Θ(i 1) )    discrete
 yY
MAXIMIZATION STEP
L ( | Z)  p(X, Y | )

Let (i1) be the parameter vector obtained at ( i 1)
Θ  arg max Q(Θ, Θ
(i )
(i1)th step.
the
)
Θ
Define
( i 1)                    ( i 1)
Q(Θ, Θ )  E[log L (Θ | Z) | X, Θ ]

 yY
log p( X, y | Θ)  p(y | X, Θ(i 1) )dy continuous

    log p( X, y | Θ)  p(y | X, Θ(i 1) )    discrete
 yY
MIXTURE MODELS

 If there is a reason to believe that a data set is
comprised of several distinct populations, a
mixture model can be used.
 It has the following form:

M                               M
p (x | Θ)    j p j (x |  j )    with    
j 1
j   1
j 1

Θ  (1 ,,  M ,1 , ,, M )
MIXTURE MODELS
M
p (x | Θ)    j p j (x |  j )
j 1

X  {x1 , x 2 , , x N }

Y  { y1 , y2 , , y N }
Let yi{1,…, M} represents the source that
generates the data.
MIXTURE MODELS
M
p (x | Θ)    j p j (x |  j )
j 1

p ( x | y  j , Θ)  p j ( x |  j )
x

y j           p ( y  j | Θ)   j

Let yi{1,…, M} represents the source that
generates the data.
MIXTURE MODELS                                M
p (x | Θ)    j p j (x |  j )
j 1

p(z i | Θ)  p(x i , yi | Θ)  p( yi | x i , Θ) p(x i | Θ)
p ( y  j | Θ)   j       p ( x | y  j , Θ)  p j ( x |  j )

xi
                               zi
yi
MIXTURE MODELS
M
p (x | Θ)    j p j (x |  j )
j 1

p ( x i , yi , Θ ) p ( x i | yi , Θ ) p ( yi , Θ )
p( yi | x i , Θ)                    
p ( x i , Θ)               p ( x i , Θ)
p (x i | yi , Θ) p ( yi | Θ) p (Θ) p (x i | yi , Θ) p ( yi | Θ)
                                   
p (x i | Θ) p (Θ)                  p (x i | Θ)

p yi (xi |  yi ) yi
   M


j 1
j   p j (x |  j )
EXPECTATION

M     N                                    N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1                       yY        j 1

M     N                     M       M             M           N
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1                   y1 1   yi 1     y N 1       j 1

Zero when yi  l

N     M
Q(Θ, Θ )      yi ,l log  l pl (xi | l )   p( y j | x j , Θ g )
N
g

yY i 1 l 1                                j 1
EXPECTATION

M    N                                                      N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1                                     yY            j 1

M     N                             M             M              M               N
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1                           y1 1       yi 1        y N 1           j 1

M                                                      
  log  l pl ( xi | l )                                1  p( y j | X, Θ g )  p(l | xi , Θ g )
M   N                                     M       M          M        N

 y 1                                                
l 1 i 1                    1          yi1 1 yi 1 1    y N  j 1             
                                      j i             
EXPECTATION

M    N                                                      N
Q(Θ, Θ g )   log l pl (xi | l ) yi ,l  p( y j | x j , Θ g )
l 1 i 1                                     yY            j 1

M     N                             M             M              M               N
  log l pl (xi | l )   yi ,l  p( y j | x j , Θ g )
l 1 i 1                           y1 1       yi 1        y N 1           j 1

M                                                      
  log  l pl ( xi | l )                                1  p( y j | X, Θ g )  p(l | xi , Θ g )
M   N                                     M       M          M        N

 y 1                                                
l 1 i 1                    1          yi1 1 yi 1 1    y N  j 1             
                                      j i             

M   N
 N  M                  
Q(Θ, Θ )   log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )
g
 j 1  y 1            
l 1 i 1
 j i
       j               

EXPECTATION

M         N
Q(Θ, Θ g )   log  l pl (x i |  l )p (l | x i , Θ g )
l 1 i 1

M     N                        M    N
Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
g                                g

l 1 i 1                     l 1 i 1

1
M      N
 N  M                  
Q(Θ, Θ )   log l pl (xi | l )   p( y j | X, Θ g )  p(l | xi , Θ g )
g
 j 1  y 1            
l 1 i 1
 j i
       j               

MAXIMIZATION
Θ  (1 ,,  M ,1 , ,, M )

Given the initial guess                                     g,

M     N                          M    N
Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
g                                g

l 1 i 1                       l 1 i 1

We want to find , to maximize the above
expectation.
In fact, iteratively.
THE GMM (GUASSIAN MIXTURE MODEL)

Guassian model of a d-dimensional source, say j :
1             1            1           
p j (x | μ j , Σ j )                    exp (x  μ j ) Σ j (x  μ j )
T

(2 ) | Σ j |
d /2    1/ 2
 2                         
 j  (μ j , Σ j )

GMM with M sources:
M                            j 0
p j ( x | μ1 , Σ1 ,  , μ M , Σ M )    j p j (x | μ j , Σ j )
j 1                            j   1
GOAL

M
Mixture Model                   p(x | Θ)    l pl (x |  l )
l 1

Θ  (1 ,,  M ,1 , ,, M )
M
subject to         l 1
l   1

To maximize:
M     N                          M       N
Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
g                                g

l 1 i 1                       l 1 i 1
FINDING L
M     N                              M     N
Q(Θ, Θ )   log( l ) p (l | x i , Θ )   log[ pl (x i |  l )] p (l | x i , Θ g )
g                                    g

l 1 i 1                            l 1 i 1

Due to the constraint on l’s, we introduce Lagrange
Multiplier , and solve the following equation.

 M N                               M        
 log( l ) p(l | xi , Θ )     l  1  0,                l  1,, M
g

 l  l 1 i 1                       l 1    

N
1

i 1
p (l | x i , Θ g )    0,   l  1,  , M
l

N

 p(l | x i , Θ g )   l   0,
i 1
l  1,  , M

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 10/9/2012 language: English pages: 63