Document Sample

Machine Learning based on Attribute Interactions Aleks Jakulin Advised by Acad. Prof. Dr. Ivan Bratko 2003-2005 Learning = - data shapes the model - model is made of possible hypotheses Modelling - model is generated by an algorithm - utility is the goal of a model Utility = -Loss Hypothesis MODEL Data Space B { A: “A bounds B” The fixed data sample restricts the model to be consistent with it. Learning Algorithm Our Assumptions about Models • Probabilistic Utility: logarithmic loss (alternatives: classification accuracy, Brier score, RMSE) • Probabilistic Hypotheses: multinomial distribution, mixture of Gaussians (alternatives: classification trees, linear models) • Algorithm: maximum likelihood (greedy), Bayesian integration (exhaustive) • Data: instances + attributes Expected Minimum Loss = Entropy The diagram is a visualization of a probabilistic model P(A,C) Entropy given C’s empirical probability distribution (p = [0.2, 0.8]). A C H(C|A) = H(C)-I(A;C) Conditional entropy - Remaining uncertainty H(A) in C after learning A. Information which came with I(A;C)=H(A)+H(C)-H(AC) H(AC) the knowledgeinformation or information gain --- Mutual of A Joint entropy How much have A and C in common? 2-Way Interactions • Probabilistic models take the form of P(A,B) • We have two models: – Interaction allowed: PY(a,b) := F(a,b) – Interaction disallowed: PN(a,b) := P(a)P(b) = F(a)G(b) • The error that PN makes when approximating PY: D(PY || PN) := Ex ~ Py[L(x,PN)] = I(A;B) (mutual information) • Also applies for predictive models: D( P(Y | A) || P(Y )) I ( A; Y ) • Also applies for Pearson’s correlation coefficient: P is a bivariate Gaussian, obtained via max. likelihood Rajski’s Distance • The attributes that have more in common can be visualized as closer in some imaginary Euclidean space. • How to avoid the influence of many/few-valued attributes? (Complex attributes seem to have more in common.) • Rajski’s distance: • This is a metric (e.g.: the triangle inequality) Interactions between US Senators dark: strong interaction, high mutual information light: weak interaction low mutual information Interaction matrix A Taxonomy of Machine Learning Algorithms Interaction dendrogram CMC dataset 3-Way Interactions label C importance of attribute A importance of attribute B attribute attribute A B attribute correlation 3-Way Interaction: 2-Way Interactions What is common to A, B and C together; and cannot be inferred from any subset of attributes. Interaction Information How informative are A and B together? I(A;B;C) := I(AB;C) - I(A;C) - I(B;C) = I(B;C|A) - I(B;C) = I(A;C|B) - I(A;C) (Partial) history of independent reinventions: Quastler ‘53 (Info. Theory in Biology) - measure of specificity McGill ‘54 (Psychometrika) - interaction information Han ‘80 (Information & Control) - multiple mutual information Yeung ‘91 (IEEE Trans. On Inf. Theory) - mutual information Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index Matsuda ‘00 (Physical Review E) - higher-order mutual inf. Brenner et al. ‘00 (Neural Computation) - average synergy Demšar ’02 (A thesis in machine learning) - relative information gain Bell ‘03 (NIPS02, ICA2003) - co-information Jakulin ‘02 - interaction gain Useful attributes farming Interaction soil Dendrogram In classification tasks we are only interested in those interactions that involve the label vegetation Useless attributes Interaction Graph • The Titanic data set – Label: survived? – Attributes: describe the passenger or crew member • 2-way interactions: – Sex then Class; Age not as important • 3-way interactions: – negative: ‘Crew’ dummy is wholly contained within ‘Class’; ‘Sex’ largely explains the death rate among the crew. – positive: • Children from the first and second class were prioritized. • Men from the second class mostly died (third class men and the crew were better off) blue: redundancy, negative int. • Female crew members had good odds of survival. red: synergy, positive int. An Interaction Drilled Data for ~600 people What’s the loss assuming no interaction between eyes in hair? Area corresponds to probability: • black square: actual probability • colored square: predicted probability Colors encode the type of error. The more saturated the color, the more “significant” the error. Codes: • blue: overestimate • red: underestimate • white: correct estimate KL-d: 0.178 Rules = Constraints No interaction: • Rule 1: • Rule 2: Blonde hair is Black hair is connected with connected with blue or green brown eyes. eyes. KL-d: 0.178 Both rules: KL-d: 0.045 KL-d: 0.134 KL-d:0.022 Attribute Value Taxonomies Interactions can also be computed between pairs of attribute (or label) values. This way, we can structure attributes with many values (e.g., Cartesian products ☺). ADULT/CENSUS Attribute Selection with Interactions • 2-way interactions I(A;Y) are the staple of attribute selection – Examples: information gain, Gini ratio, etc. – Myopia! We ignore both positive and negative interactions. • Compare this with controlled 2-way interactions: I(A;Y | B,C,D,E,…) – Examples: Relief, regression coefficients – We have to build a model on all attributes anyway, making many assumptions… What does it buy us? – We add another attribute, and the usefulness of a previous attribute is reduced? Attribute Subset Selection with NBC The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise? Attribute Subset Selection with NBC NO! We sorted the attributes from the worst to the best. It is some of the best attributes that ruin the performance! Why? NBC gets confused by redundancies. Accounting for Redundancies At each step, we pick the next best attribute, accounting for the attributes already in the model: – Fleuret’s procedure: – Our procedure: Example: the naïve Bayesian Classifier ↑ Interaction-proof myopic → Predicting with Interactions • Interactions are meaningful self-contained views of the data. • Can we use these views for prediction? • It’s easy if the views do not overlap: we just multiply them together, and normalize: P(a,b)P(c)P(d,e,f) • If they do overlap: P( x1 , y ) P( x2 , y ) P( y | x1 , x2 ) P( y ) • In a general overlap situation, Kikuchi approximation efficiently handles the intersections between interactions, and intersections-of-intersections. • Algorithm: select interactions, use Kikuchi approximation to fuse them into a joint prediction, use this to classify. Interaction Models • Transparent and intuitive • Efficient • Quick • Can be improved by replacing Kikuchi with Conditional MaxEnt, and Cartesian product with something better. Summary of the Talk • Interactions are a good metaphor for understanding models and data. They can be a part of the hypothesis space, but do not have to. • Probability is crucial for real-world problems. • Watch your assumptions (utility, model, algorithm, data) • Information theory provides solid notation. • The Bayesian approach to modelling is very robust (naïve Bayes and Bayes nets are not Bayesian approaches) Summary of Contributions Theory Practice • A meta-model of machine • A number of novel learning. visualization methods. • A formal definition of a k- • A heuristic for efficient way interaction, non-myopic attribute independent of the utility selection. and hypothesis space. • An interaction-centered • A thorough historic machine learning method, overview of related work. Kikuchi-Bayes • A novel view on • A family of Bayesian interaction significance priors for consistent tests. modelling with interactions.

DOCUMENT INFO

Shared By:

Categories:

Tags:
PhD defense, PhD thesis, PhD Defenses, thesis defense, University of Copenhagen, the committee, External Examiner, Associate Professor, PhD student, how to

Stats:

views: | 10 |

posted: | 2/20/2010 |

language: | Dutch |

pages: | 24 |

OTHER DOCS BY shimeiyan1

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.