Feature Selection as Relevant Information Encoding by dhp30827

VIEWS: 4 PAGES: 25

									        Feature Selection
as Relevant Information Encoding

              Naftali Tishby
School of Computer Science and Engineering
  The Hebrew University, Jerusalem, Israel

               NIPS 2001
  Many thanks to:


   Noam Slonim
 Amir Globerson

   Bill Bialek
Fernando Pereira
  Nir Friedman
              Feature Selection?
• NOT generative modeling!
  –no assumptions about the source of the data


• Extracting relevant structure from data
  – functions of the data (statistics) that preserve
    information


• Information about what?

• Approximate Sufficient Statistics

• Need a principle that is both general and
  precise.
  –Good Principles survive longer!
             A Simple Example...
        Israel Health www Drug Jewish Dos Doctor ...
Doc1      12     0     0    0    8     0    0    ...
Doc2     0      9     2    11     1    0     6     ...
Doc3     0     10     1     6     0    0     20    ...
Doc4     9      1     0     0     7    0     1     ...
Doc5     0      3     9     0     1    10    0     ...
Doc6     1     11     0     6     0     1    7     ...
Doc7     0      0     8     0     2    12    2     ...
Doc8     15     0     1     1    10     0    0     ...
Doc9     0     12     1    16     0     1    12    ...
Doc10    1     0      9    0      1    11    2     ...
  ...    ...   ...   ...   ...   ...   ...   ...   ...
                  Simple Example
        Israel    Jewish    Health     Drug      Doctor     www Dos          ....
Doc1    12        8         0          0         0          0   0             ...
Doc4    9         7         1          0         1          0   0             ...
Doc8    15        10        0          1         0          1   0             ...

Doc2    0         1         9          11        6          2         0      ...
Doc3    0         0         10         6         20         1         0      ...
Doc6    1         0         11         6         7          0         1      ...
Doc9    0         0         12         16        12         1         1      ...

Doc5    0         1         3          0         0          9         10     ...
Doc7    0         2         0          0         2          8         12     ...
Doc10   1         1         0          0         2          9         11     ...
  ...       ...       ...        ...       ...        ...       ...    ...   ...
         A new compact representation
           Israel   Jewish   Health Drug   Doctor   www Dos     ...

Cluster1    36       25        1     1       1       1    0     ...

Cluster2     1        1       42    39      45       4    2     ...

Cluster3     1        4        3     0       4      26    33    ...

   ...       ...      ...     ...    ...     ...    ...   ...   ...




The document clusters preserve the relevant
information between the documents and words
Documents                    Words

   X        Cx               Y
   x1
   x2                         y1
                              y2
            c1
            c 2 I (Cx, Y )


            ck

                              ym
   xn
          Mutual information
How much X is telling about Y?
I(X;Y): function of the joint probability
  distribution p(x,y) -
minimal number of yes/no questions (bits) needed to
  ask about x, in order to learn all we can about Y.
Uncertainty removed about X when we know Y:
I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)
                                            I(X;Y)

                H(X|Y)    H(Y|X)
           Relevant Coding
• What are the questions that we need to ask
 about X in order to learn about Y?
•Need to partition X into relevant domains, or
 clusters, between which we really need to
 distinguish...

   X|y1            P(x|y1)
                               y1

                               y2
  X|y2             P(x|y2)
            X                   Y
    Bottlenecks and Neural Nets
•   Auto association: forcing compact representations
  ˆ
• X                          X
            is a relevant code of        Y       w.r.t.

    Input                                    Output

      X                                      Y
Sample 1                                     Sample 2

    Past
                         ˆ
                         X                   Future
• Q: How many bits are needed to determine
 the relevant representation?
  – need to index the max number of non-overlapping
    green blobs inside the blue blob:
  (mutual information!)


                     ˆ
              H ( X |X )
          2
                                           ˆ
                                        p( x | x )     ˆ
                                                       X
           2    H( X )        X

                                  ˆ                         ˆ
   2   H( X )
                   /2      H ( X |X )
                                         2          I ( X ,X )
                                    ˆ
• The idea: find a compressed signalX
                                           ˆ
  that needs short encoding ( smallI ( X , X )           )
  while preserving as much as possible the
                           ˆ
    information on the relevant signal (
                       I ( X ,Y )                            )

                         I ( X ,Y )



                   ˆ
                p( x | x )    ˆ
                              X              ˆ
                                      p( y | x )
        X                                            Y
                                ˆ
                             p( x )                    ˆ
          ˆ
    I( X ,X )                                      I ( X ,Y )
          A Variational Principle
 We want a short representation of X that keeps
  the information about another variable, Y, if
  possible.
              X  Y
                   I ( X ,Y )

            ˆ
    I ( X , X )
     in


                             ˆ
                           out
                         I ( X ,Y )
              ˆ
              X

L  p ( x | x)   I
        ˆ            in ( X , X )   1I out ( X , Y )
                              ˆ                 ˆ
       The Self Consistent Equations
•   Marginal:          p( x )   p( x | x ) p( x )
                          ˆ          ˆ
                                  x



•   Markov condition: p ( y   | x )   p( y | x ) p( x | x )
                                ˆ                         ˆ
                                       x

                                    p( x )
•   Bayes’ rule:       p( x | x ) 
                              ˆ               ˆ
                                           p( x | x )
                                       ˆ
                                    p( x )

 L [ p ( x| x )]                       ˆ
                                                                    
          ˆ                         p( x)
                  0   p ( x|x )
                           ˆ                 exp   DKL [ x , x ]
                                                               ˆ
   p ( x| x )
        ˆ                         Z ( x , )
   The emerged effective distortion measure:

     DKL  x, x   DKL  p( y | x ) | p( y | x )
              ˆ                               ˆ
                        p( y | x )
       p( y | x ) log
       y                       ˆ
                        p( y | x )
• Regular if            ˆ
                 p( y | x ) is absolutely continuous w.r.t. p( y | x )
• Small if   ˆ
             x   predicts y as well as x:
                                                
                                            x   y
                                                   p ( y|x )


                                                  ˆ
                                                p( x | x)

                                                
                                            x   y
                                            ˆ            ˆ
                                                   p ( y|x )
The iterative algorithm: (Generalized Blahut-Arimo




              ˆ
            p(x)                    ˆ
                                 p( x | x )


    Generalized
    BA-algorithm
                           ˆ
                    p( y | x )
The Information Bottleneck Algorithm
 min        min      min         log Z ( x ,  )   
                                                       “free energy”
        ˆ       ˆ       ˆ
p ( y | x ) p ( x ) p ( x| x )


             min                      ˆ
                                 I(X, X )            D            ˆ
                                                                ( x,x )
                                                           KL
        ˆ        ˆ        ˆ
p ( y | x ), p ( x ), p ( x| x )


                                     exp D KL ( x, x ) 
                               ˆ
                          pt ( x )
           ˆ | x) 
    pt 1 ( x                                  t
                                                      ˆ
                        Z t ( x,  )
  
                pt ( x )   p( x ) pt ( x | x )
                      ˆ                   ˆ
                              x
           pt ( y | x )   p( y | x ) pt ( x | x )
                      ˆ                           ˆ
                              x
•                                       I( ˆ
     The Information - plane, the optimal X , Y ) for a
        ˆ
    given, X )
      I(X                       is a concave function:



             impossible

                                                     ˆ
                                            I (Y , X )
    ˆ                                                      1
I ( X ,Y )                                           ˆ
                                            I ( X , X )
I ( X ,Y )            Possible phase




                     ˆ
                   I(X, X )/ H(X )
       Manifold of relevance
 The self consistent equations:

    log p( x | x)  log p( x)  log Z ( x,  )   D( x | x)
    
                ˆ                                          ˆ
          log p( y | x)  log  p( y | x) p( x | x)
                       ˆ                            ˆ
    
                                x
Assuming a continuous manifold for
                                Xˆ


          log p( x | x)
                       ˆ                  log p( y | x)
                                                       ˆ
                            M x [ x]
                                      ˆ
              xˆ                            x ˆ
           log p( y | x)
                         ˆ               log p( x | x)
                                                     ˆ
                            M y [ x]
                                    ˆ
         
               x ˆ                         xˆ

Coupled (local inˆ ) eigenfunction equations,
                 X
 with  as an eigenvalue.
Document classification - information
               curves
Multivariate Information Bottleneck
• Complex relationship between many variables
• Multiple unrelated dimensionality reduction
  schemes
• Trade between known and desired dependencies
• Express IB in the language of Graphical Models
• Multivariate extension of Rate-Distortion Theory
     Multivariate Information Bottleneck:
     Extending the dependency graphs

L  p1 (T | x ), p 2 ( x | T ), p (T )   I         (X ,T )   I
                                                               1
                                               Gin                   G out
                                                                             (T , Y )
   ~                                                p( x1 ,...,xn )
   I ( X 1 , X 2 ,...,X n )   p( x1 ,...,xn ) log
                              
                              x                       p( xi )
                                                           i
        (Multi-information)
     Sufficient Dimensionality Reduction
                          (with Amir Globerson)
• Exponential families have sufficient statistics
• Given a joint distribution P( x, y ) , find an approximation of
         the exponential form:
                                                 d
                                      1
                     P ( x, y )            exp   r ( x) r ( y )
                                  Z [ , ]     r 1


  This can be done by alternating maximization of Entropy under
                          the constraints:

    r    p
               r    p
                           ;    r     p
                                            r    p
                                                        , 1 r  d

     The resulting functions are our relevant features at rank d.
                            Summary
•   We present a general information theoretic approach for
    extracting relevant information.
• It is a natural generalization of Rate-Distortion theory with
 similar convergence and optimality proofs.
• Unifies learning, feature extraction, filtering, and
 prediction...
• Applications (so far) include:
      –Word sense disambiguation
      –Document classification and categorization
      –Spectral analysis
      –Neural codes
      –Bioinformatics,„
      –Data clustering based on multi-distance distributions
      –„

								
To top