VIEWS: 4 PAGES: 25 CATEGORY: Business POSTED ON: 8/30/2010
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001 Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman Feature Selection? • NOT generative modeling! –no assumptions about the source of the data • Extracting relevant structure from data – functions of the data (statistics) that preserve information • Information about what? • Approximate Sufficient Statistics • Need a principle that is both general and precise. –Good Principles survive longer! A Simple Example... Israel Health www Drug Jewish Dos Doctor ... Doc1 12 0 0 0 8 0 0 ... Doc2 0 9 2 11 1 0 6 ... Doc3 0 10 1 6 0 0 20 ... Doc4 9 1 0 0 7 0 1 ... Doc5 0 3 9 0 1 10 0 ... Doc6 1 11 0 6 0 1 7 ... Doc7 0 0 8 0 2 12 2 ... Doc8 15 0 1 1 10 0 0 ... Doc9 0 12 1 16 0 1 12 ... Doc10 1 0 9 0 1 11 2 ... ... ... ... ... ... ... ... ... ... Simple Example Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ... Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ... Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ... ... ... ... ... ... ... ... ... ... A new compact representation Israel Jewish Health Drug Doctor www Dos ... Cluster1 36 25 1 1 1 1 0 ... Cluster2 1 1 42 39 45 4 2 ... Cluster3 1 4 3 0 4 26 33 ... ... ... ... ... ... ... ... ... ... The document clusters preserve the relevant information between the documents and words Documents Words X Cx Y x1 x2 y1 y2 c1 c 2 I (Cx, Y ) ck ym xn Mutual information How much X is telling about Y? I(X;Y): function of the joint probability distribution p(x,y) - minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y. Uncertainty removed about X when we know Y: I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X) I(X;Y) H(X|Y) H(Y|X) Relevant Coding • What are the questions that we need to ask about X in order to learn about Y? •Need to partition X into relevant domains, or clusters, between which we really need to distinguish... X|y1 P(x|y1) y1 y2 X|y2 P(x|y2) X Y Bottlenecks and Neural Nets • Auto association: forcing compact representations ˆ • X X is a relevant code of Y w.r.t. Input Output X Y Sample 1 Sample 2 Past ˆ X Future • Q: How many bits are needed to determine the relevant representation? – need to index the max number of non-overlapping green blobs inside the blue blob: (mutual information!) ˆ H ( X |X ) 2 ˆ p( x | x ) ˆ X 2 H( X ) X ˆ ˆ 2 H( X ) /2 H ( X |X ) 2 I ( X ,X ) ˆ • The idea: find a compressed signalX ˆ that needs short encoding ( smallI ( X , X ) ) while preserving as much as possible the ˆ information on the relevant signal ( I ( X ,Y ) ) I ( X ,Y ) ˆ p( x | x ) ˆ X ˆ p( y | x ) X Y ˆ p( x ) ˆ ˆ I( X ,X ) I ( X ,Y ) A Variational Principle We want a short representation of X that keeps the information about another variable, Y, if possible. X Y I ( X ,Y ) ˆ I ( X , X ) in ˆ out I ( X ,Y ) ˆ X L p ( x | x) I ˆ in ( X , X ) 1I out ( X , Y ) ˆ ˆ The Self Consistent Equations • Marginal: p( x ) p( x | x ) p( x ) ˆ ˆ x • Markov condition: p ( y | x ) p( y | x ) p( x | x ) ˆ ˆ x p( x ) • Bayes’ rule: p( x | x ) ˆ ˆ p( x | x ) ˆ p( x ) L [ p ( x| x )] ˆ ˆ p( x) 0 p ( x|x ) ˆ exp DKL [ x , x ] ˆ p ( x| x ) ˆ Z ( x , ) The emerged effective distortion measure: DKL x, x DKL p( y | x ) | p( y | x ) ˆ ˆ p( y | x ) p( y | x ) log y ˆ p( y | x ) • Regular if ˆ p( y | x ) is absolutely continuous w.r.t. p( y | x ) • Small if ˆ x predicts y as well as x: x y p ( y|x ) ˆ p( x | x) x y ˆ ˆ p ( y|x ) The iterative algorithm: (Generalized Blahut-Arimo ˆ p(x) ˆ p( x | x ) Generalized BA-algorithm ˆ p( y | x ) The Information Bottleneck Algorithm min min min log Z ( x , ) “free energy” ˆ ˆ ˆ p ( y | x ) p ( x ) p ( x| x ) min ˆ I(X, X ) D ˆ ( x,x ) KL ˆ ˆ ˆ p ( y | x ), p ( x ), p ( x| x ) exp D KL ( x, x ) ˆ pt ( x ) ˆ | x) pt 1 ( x t ˆ Z t ( x, ) pt ( x ) p( x ) pt ( x | x ) ˆ ˆ x pt ( y | x ) p( y | x ) pt ( x | x ) ˆ ˆ x • I( ˆ The Information - plane, the optimal X , Y ) for a ˆ given, X ) I(X is a concave function: impossible ˆ I (Y , X ) ˆ 1 I ( X ,Y ) ˆ I ( X , X ) I ( X ,Y ) Possible phase ˆ I(X, X )/ H(X ) Manifold of relevance The self consistent equations: log p( x | x) log p( x) log Z ( x, ) D( x | x) ˆ ˆ log p( y | x) log p( y | x) p( x | x) ˆ ˆ x Assuming a continuous manifold for Xˆ log p( x | x) ˆ log p( y | x) ˆ M x [ x] ˆ xˆ x ˆ log p( y | x) ˆ log p( x | x) ˆ M y [ x] ˆ x ˆ xˆ Coupled (local inˆ ) eigenfunction equations, X with as an eigenvalue. Document classification - information curves Multivariate Information Bottleneck • Complex relationship between many variables • Multiple unrelated dimensionality reduction schemes • Trade between known and desired dependencies • Express IB in the language of Graphical Models • Multivariate extension of Rate-Distortion Theory Multivariate Information Bottleneck: Extending the dependency graphs L p1 (T | x ), p 2 ( x | T ), p (T ) I (X ,T ) I 1 Gin G out (T , Y ) ~ p( x1 ,...,xn ) I ( X 1 , X 2 ,...,X n ) p( x1 ,...,xn ) log x p( xi ) i (Multi-information) Sufficient Dimensionality Reduction (with Amir Globerson) • Exponential families have sufficient statistics • Given a joint distribution P( x, y ) , find an approximation of the exponential form: d 1 P ( x, y ) exp r ( x) r ( y ) Z [ , ] r 1 This can be done by alternating maximization of Entropy under the constraints: r p r p ; r p r p , 1 r d The resulting functions are our relevant features at rank d. Summary • We present a general information theoretic approach for extracting relevant information. • It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs. • Unifies learning, feature extraction, filtering, and prediction... • Applications (so far) include: –Word sense disambiguation –Document classification and categorization –Spectral analysis –Neural codes –Bioinformatics,„ –Data clustering based on multi-distance distributions –„