Docstoc

2.5Mb PPT - Faculty of Mathematics _ Computer Science

Document Sample
2.5Mb PPT - Faculty of Mathematics _ Computer Science Powered By Docstoc
					Information Bottleneck
        versus
 Maximum Likelihood


    Felix Polyakov
        Overview of the talk
Brief review of the Information Bottleneck

• Maximum Likelihood

• Information Bottleneck and Maximum
  Likelihood

• Example from Image Segmentation
                   A Simple Example...

        Israel Health www Drug Jewish Dos Doctor ...
Doc1      12     0     0    0    8     0    0    ...
Doc2     0      9      2     11    1     0     6       ...
Doc3     0     10      1      6    0     0     20      ...
Doc4     9      1      0      0    7     0     1       ...
Doc5     0      3      9      0     1    10    0       ...
Doc6     1     11      0      6     0     1    7       ...
Doc7     0      0      8      0     2    12    2       ...
Doc8     15     0      1      1    10     0    0       ...
Doc9     0     12      1     16     0     1    12      ...
Doc10    1     0       9     0     1     11    2       ...
  ...    ...   ...     ...   ...   ...   ...   ...     ...

                                                   N. Tishby
                            Simple Example

        Israel    Jewish     Health     Drug      Doctor     www Dos           ....
Doc1    12        8          0          0         0          0   0              ...
Doc4    9         7          1          0         1          0   0              ...
Doc8    15        10         0          1         0          1   0              ...

Doc2    0         1          9          11        6          2         0        ...
Doc3    0         0          10         6         20         1         0        ...
Doc6    1         0          11         6         7          0         1        ...
Doc9    0         0          12         16        12         1         1        ...

Doc5    0         1          3          0         0          9         10       ...
Doc7    0         2          0          0         2          8         12       ...
Doc10   1         1          0          0         2          9         11       ...
  ...       ...       ...         ...       ...        ...       ...    ...     ...
                                                                           N. Tishby
                    A new compact representation
           Israel    Jewish   Health Drug     Doctor   www Dos       ...

Cluster1    36        25        1     1         1       1      0     ...

Cluster2     1         1       42    39        45       4      2     ...

Cluster3     1         4        3     0         4      26     33     ...

   ...       ...       ...     ...    ...       ...    ...     ...   ...




The document clusters preserve the relevant
information between the documents and words

                                                             N. Tishby
                               Feature Selection?
• NO ASSUMPTIONS about the source of the data

• Extracting relevant structure from data
   – functions of the data (statistics) that preserve information

• Information about what?

• Need a principle that is both general and precise.



                                                                    N. Tishby
Documents                    Words

   X        Cx               Y
   x1
   x2                         y1
                              y2
            c1
            c 2 I (Cx, Y )


            ck

                              ym
   xn
                                     N. Tishby
     The information bottleneck or
      relevance through distortion
                                N. Tishby, F. Pereira, and W. Bialek


• We would like the relevant partitioning T to
  compress X as much as possible, and to capture
  as much information about Y as possible
                pt | x               p y | t 


          X                       T                                Y
                                     p y | t  
   I  T ; Y    p y, t  log p                I X ;Y 
                 y t                 p y  pt 
• Goal: find q(T | X)
    – note Markovian independence
      relation T  X  Y
           Y

                                      Y

X       P X ,Y           T        qT , Y 
Variational problem



 Iterative algorithm
        Overview of the talk
Short review of the Information Bottleneck

Maximum Likelihood

• Information Bottleneck and Maximum
  Likelihood

• Example from Image Segmentation
                     A simple example...
A coin is known to be biased
The coin is tossed three times – two heads and one tail
Use ML to estimate the probability of throwing a head

•Model:
     −p(head) = P
     −p(tail) = 1 - P


                                 Likelihood of the Data
Try P = 0.2
L(O) = 0.2 * 0.2 * 0.8 = 0.032

Try P = 0.4
L(O) = 0.4 * 0.4 * 0.6 = 0.096

Try P = 0.6
L(O) = 0.6 * 0.6 * 0.4 = 0.144

Try P = 0.8
L(O) = 0.8 * 0.8 * 0.2 = 0.128                            Probability of a head
        A bit more complicated example… :
        Mixture Model
•        Three baskets with
         white (O = 1), grey (O
         = 2), and black (O = 3)
                                    B1            B2           B3
         balls
•        15 balls were drawn as follows:
        1.    Choose a basket according to p(i) =  bi
        2.    Draw the ball j from basket i with probability




    •        Use ML to estimate  given the observations:
             sequence of balls’ colors
•Likelihood of observations

•Log Likelihood of observations

•Maximal Likelihood of observations
     Likelihood of the observed data
x – hidden random variables      [e.g. basket]
y – observed random variables [e.g. color]
- model parameters              [e.g. they define p(y|x)]
0 – current estimate of model parameters

log L y;    px |  y; 0  log L y;   E x| y;0  log p y; 
                 x

 E x| y;0  log px, y;   E x| y;0  log px | y; 

  px | y; 0  log px, y; 
          px | y; 0  log px | y; 
    x


           x
   px | y; 0  log px, y; 
             px | y; 0  log px | y; 
                                                         KL
       x


                     px | y; 0  log px | y; 0 
              x

                      x
                     px | y; 0  log px | y; 0         H
                      x

  px | y; 0  log px, y;   KLx| y  0 ||    H x| y  0 
   x



 log L y;   Q , 0   KLx| y  0 ||    H x| y  0 
log L y;   Q , 0   KLx| y  0 ||    H x| y  0 

  Expectation-maximization algorithm (I)
 1. Expectation
    − Compute p X | y;t 
    − Get Q ,t   E x| y ;t  log px, y; 
 2. Maximization
    −  t 1  arg max Q , t 
                     


• log L y;t 1   log L y;t 
• EM algorithm converges to local maxima
Log-likelihood is non-decreasing, examples
           EM – another approach
Goal:
                           p  x, y ;                 p  x, y ;  
log p y;   log  qx                   qx  log
                   x           qx         x               qx 

 Jensen’s inequality
 for concave function
   n        n
f   k xk    k f xk 
   k 1     k 1
k  0,  k  1

              qx  log px, y;   H q   F q, 
                x
   log p y;    qx  log px, y;   H q   F q, 
                        x



                                                                  p  x, y ;  
0  log p y;   F q,    qx  log p y;    q x  log
                              x                      x                qx 

                    qx 
   q x  log                 KLq x  || p x | y; 
   x            p  x, y ;  
0  log p y;   F q,   KLqx || px | y; 




log p y;   F  px | y; ,   max q; 
                                         F
                                      q
   Expectation-maximization algorithm (II)
  1. Expectation
      qt   arg max F q;t   px | y;t 
       ˆ
                       q
  2. Maximization
        t 1 q   arg max F q; 
               ˆ                ˆ
                        

t 1  arg max  qx log px, y;   H (q)
                  ˆ                        ˆ
            
                 x

     arg max  px | y; t  log px, y; 
            
                 x


t 1  arg max Q ,t       (I) and (II) are equivalent
            
     Scheme of the approach

log p y;t   F  px | y;t ;t 


 F  px | y;t ;t 1   F  px | y;t 1 ;t 1 
        Overview of the talk
Short review of the Information Bottleneck

Maximum Likelihood

Information Bottleneck and Maximum
 Likelihood for a toy problem

• Example from Image Segmentation
                                                        Words - Y


                                                                                                  
                        Israel Jewish Health Drug Doctor                      www       Dos     ....
                Doc1    12        8          0           0         0          0         0       ...
Documents - X


                Doc4    9         7          1           0         1          0         0       ...
                Doc8    15        10         0           1         0          1         0       ...




                                                                                                  
                Doc2    0         1          9           11        6          2         0       ...
                Doc3    0         0          10          6         20         1         0       ...
                Doc6    1         0          11          6         7          0         1       ...
                Doc9    0         0          12          16        12         1         1       ...




                                                                                                  
                Doc5    0         1          3           0         0          9         10      ...
                Doc7    0         2          0           0         2          8         12      ...
                Doc10   1         1          0           0         2          9         11      ...
                  ...       ...        ...        ...        ...        ...       ...    ...    ...


                t ~ (t)
                x ~ (x)                                                                       Topics - t
                y|t ~ (y|t)
                                  Model parameters
        0.2                                          Example
       0.15                                          • xi = 9
(X)




        0.1                                             – t(9) = 2
       0.05                                             – sample from (y|2)
         0
                                                            get yi = “Drug”
              1   2   3   4   5   6   7   8   9 10      − set n(9, “Drug”) =
                      X = Documents
                                                                     n(9, “Drug”) +
                                                          1

   Sampling algorithm
   • For i = 1:N
          – choose xi by sampling from (x)
          – choose yi by sampling from (y|t(xi))
          – increase n(xi, yi) by one
                                                                                             
        Israel Jewish Health Drug Doctor                     www           Dos        ....
Doc1    12        8          0          0         0          0             0          ...
Doc4    9         7          1          0         1          0             0          ...    (y|t=1)
Doc8    15        10         0          1         0          1             0          ...




                                                                                             
Doc2    0         1          9          11        6          2             0          ...
Doc3
Doc6
        0
        1
                  0
                  0
                             10
                             11
                                        6
                                        6
                                                  20
                                                  7
                                                             1
                                                             0
                                                                           0
                                                                           1
                                                                                      ...
                                                                                      ...
                                                                                                 (y|t=2)
Doc9    0         0          12         16        12         1             1          ...

Doc5
Doc7
Doc10
  ...
        0
        0
        1
            ...
                  1
                  2
                  1
                       ...
                             3
                             0
                             0
                                  ...
                                        0
                                        0
                                        0
                                            ...
                                                  0
                                                  2
                                                  2
                                                       ...
                                                             9
                                                             8
                                                             9
                                                                 ...
                                                                           10
                                                                           12
                                                                           11
                                                                            ...
                                                                                      ...
                                                                                      ...
                                                                                      ...
                                                                                      ...
                                                                                                (y|t=3)


                             X 1        2         3     4    5         6     7 8             9     10
                                                                           t(X)
                             
                             t     1    2         2     1    3         2          3      1   2     3
         Toy problem: which parameters
         maximize the likelihood?
                                
         Lnx, y  : , ,     px, y, t : , ,  
                                         
                                         t

           Israel   Jewish   Health Drug       Doctor        www Dos     ...

Cluster1    36       25        1        1            1        1    0     ...

Cluster2     1        1       42        39       45           4    2     ...

Cluster3     1        4        3        0            4       26    33    ...

   ...      ...       ...     ...       ...      ...         ...   ...   ...

t = topics                                               
X = documents
                                    X         t(x)       t     (y|t(x)) Y
Y = words
               EM approach
• E-step
                                             Normalization
   pt ( x) | y x  :                     factor


               q x t   k x  t e    n  x KL n  y| x ||  y|t 


• M-step                                                         n  x, y 
                                                  n y | x  
    t   x qx t 
   
                                                                  n x 

   
     y | t   x nx, y q x t 
   
                      IB approach
              qt    KL p  y| x || q t| x 
qt | x              e
            Z  , x 
                                 Normalization
                                 factor


                                 qIB x, y, t   px, y qt | x

qt    x , y qIB x, y, t    x px qt | x 



 q y | t   q1t   x qIB x, y, t   q1t   x px qt | x 
                                          
qx t   k x  t e          n  x KL n  y| x ||  y|t 


 t   x qx t 

                                                                ML
  y | t   x nx, y q x t 
                                                                      N   x , y n  x, y 


q x t        qt | x  ,       n  x, y             px, y  ,
                               1
                                                                       N          r
                               N
              qt    KL p  y| x || q t| x 
                                                                       r is a scaling

qt | x  
                                                                       constant
                       e
            Z  , x                                             IB
qt    x , y qIB x, y, t    x px qt | x 



 q y | t   q1t   x qIB x, y, t   q1t   x px qt | x 
                                          
q x t    qt | x  ,       n  x, y    px, y  ,
                           1
                                                       N   r
                           N

• X is uniformly distributed
• r = |X|
 The EM algorithm is equivalent to the IB iterative
  algorithm

                      IB      ML mapping

                                                +

            +                                 +
                  +
                                          +
     IB             +                                  ML
• X is uniformly distributed
•  = n(x)
 All the fixed points of the likelihood L are mapped to
  all the fixed points of the IB-functional
  L = I(T;X) -  I(T;Y)

 At the fixed points –log L  L   + const

                   IB    ML mapping

               +
                                        +

                                    +
          IB       +                         ML
• X is uniformly distributed
•  = n(x)



-(1/r) F - H(Y) = L
   -F  L     + const

Every algorithm increases F, iff it decreases L
                              Deterministic case
      • N (or  )
              1, t  arg min t  KLn y | x ||  y | t 
EM: qx t                                                                     ML
              0, otherwise


         q y | t     y | t 
                             1, t  arg min t  KL p y | x || q y | t 
IB:     q   new
                  t | x   
                             0, otherwise
                                                                                 IB
• N (or  )
  – Do not speak about uniformity of X here
All the fixed points of L are mapped to all
 the fixed points of L
-F  L + const
Every algorithm which finds a fixed point
 of L, induces a fixed point of L and vice
 versa
In case of several different f.p., the solution
 that maximized L is mapped to the solution
 that minimizes L.
• This does not mean that q(t) = (t)

                Example
(x)      x                             EM IB
                                   t
2/3       Yellow                        (t) q(t)
          submarine               1     1/2 2/3
                         N=
1/3       Red bull
                          =     2     1/2 1/3


  Non  uniform px   qt    t 
When N, every algorithm increases F
 iff it decrease L with  

• How large must N (or ) be?

• How is it related to the “amount of
  uniformity” in n(x)?
Simulations for iIB
Simulations for EM
              Simulations
• 200 runs = 100 (small N) + 100 (large N)

58 runs IIB converged to a smaller value of
 (-F) than EM

46 runs EM converged to (-F) related to a
 smaller value of L
 Quality estimation for EM solution

• The quality of IB solution is measured through
  the theoretic upper bound
  I T ; Y 
             1
  I X ;Y 

• Using IB ML mapping, one can adopt this
  measure for the ML esimation problem, for
  large enough N
       Summary: IB versus ML
• ML and IB approaches are equivalent under certain
  conditions
• Models comparison
   – The mixture model assumes that Y is independent of X
     given T(X): X  T  Y
   – In the IB framework, T is defined through the IB
     Markovian independence relation: T  X  Y
• Can adapt the quality estimation measure from IB to
  the ML estimation problem, for large N
            Overview of the talk
Brief review of the Information Bottleneck

Maximum Likelihood

Information Bottleneck and Maximum
 Likelihood

Example from Image Segmentation
  (L. Hermes et. al.)
               The clustering model




• Pixels oi, i = 1, …, n
• Deterministic clusters c,,  = 1, …, k
• Boolean assignment matrix MM= {0, 1}n x k ,   S
  Mi1                X i  xi1 ,  , xi n 
                                           i

• Observations
               l
  p x |     p | g  x |  , S 
               1
                                                   oi


                                                            q




                                                   
                                                   r



• Observations               
                       X i  xi1 ,  , xi ni           ni  q  r
                l

• px |    p | g x |  , S 
              1
                  Likelihood
• Discretization of the color space into
  intervals Ij
• Set G  j   I g x  dx
                  j

• Data likelihood
                                               nij M i
                                             
  p ,M |       p    p | G  j  
                            j  m   l
                  in  k                     
Relation to the IB



      T

  T           T
  Log-Likelihood
   L |  , M  log p , M |  
                                                   
       M i log p   nij log   p | G  j 
       i               j                         

 IB functional
                                                       
 L   M i log p         nij log   p | G  j 
                                                        
      i               ni    j




• Assume that ni = const, set = ni then L = -log
  L
Images generated from the learned statistics
                             References

•   N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method.
•   Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’
•   R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies
    incremental, sparse, and other variants.
•   J. Goldberger. Lecture notes
•   L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering
    for image segmentation.




                                The end

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:10
posted:3/30/2011
language:English
pages:51