# phd-defense

Document Sample

```					 Machine Learning
based on
Attribute Interactions
Aleks Jakulin

2003-2005
Learning =                 - data shapes the model
- model is made of possible hypotheses
Modelling                  - model is generated by an algorithm
- utility is the goal of a model

Utility = -Loss

Hypothesis
MODEL                   Data
Space
B { A:
“A bounds B”
The fixed data sample
restricts the model to
be consistent with it.
Learning Algorithm
• Probabilistic Utility: logarithmic loss
(alternatives: classification accuracy, Brier
score, RMSE)
• Probabilistic Hypotheses: multinomial
distribution, mixture of Gaussians
(alternatives: classification trees, linear
models)
• Algorithm: maximum likelihood (greedy),
Bayesian integration (exhaustive)
• Data: instances + attributes
Expected Minimum Loss = Entropy
The diagram is a visualization of a probabilistic model P(A,C)

Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).

A            C                  H(C|A) = H(C)-I(A;C)
Conditional entropy -
Remaining uncertainty
H(A)                                        in C after learning A.
Information
which came with
I(A;C)=H(A)+H(C)-H(AC)                      H(AC)
the knowledgeinformation or information gain ---
Mutual of A                                     Joint entropy
How much have A and C in common?
2-Way Interactions
• Probabilistic models take the form of P(A,B)
• We have two models:
– Interaction allowed:    PY(a,b) := F(a,b)
– Interaction disallowed: PN(a,b) := P(a)P(b) = F(a)G(b)
• The error that PN makes when approximating PY:
D(PY || PN) := Ex ~ Py[L(x,PN)] = I(A;B)
(mutual information)
• Also applies for predictive models:
D( P(Y | A) || P(Y ))  I ( A; Y )
• Also applies for Pearson’s correlation coefficient:
P is a bivariate Gaussian,
obtained via max. likelihood
Rajski’s Distance
• The attributes that have more in common can be
visualized as closer in some imaginary
Euclidean space.
• How to avoid the influence of many/few-valued
attributes? (Complex attributes seem to have
more in common.)
• Rajski’s distance:



• This is a metric (e.g.: the triangle inequality)
Interactions between
US Senators
dark: strong interaction,
high mutual information

light: weak interaction
low mutual information

Interaction matrix
A Taxonomy of
Machine Learning Algorithms

Interaction dendrogram
CMC dataset
3-Way Interactions
label

C
importance of attribute A           importance of attribute B

attribute                                   attribute
A                           B
attribute correlation

3-Way Interaction:
2-Way Interactions
What is common to A, B and C together;
and cannot be inferred from any subset of attributes.
Interaction Information
How informative are A and B together?

I(A;B;C) :=
I(AB;C) - I(A;C) - I(B;C)
= I(B;C|A) - I(B;C)
= I(A;C|B) - I(A;C)
(Partial) history of independent reinventions:
Quastler ‘53 (Info. Theory in Biology)     - measure of specificity
McGill ‘54 (Psychometrika)                 - interaction information
Han ‘80 (Information & Control)            - multiple mutual information
Yeung ‘91 (IEEE Trans. On Inf. Theory)     - mutual information
Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index
Matsuda ‘00 (Physical Review E)            - higher-order mutual inf.
Brenner et al. ‘00 (Neural Computation)     - average synergy
Demšar ’02 (A thesis in machine learning)  - relative information gain
Bell ‘03 (NIPS02, ICA2003)                 - co-information
Jakulin ‘02                                - interaction gain
Useful attributes

farming
Interaction
soil      Dendrogram

we are only interested in
those interactions that
involve the label

vegetation

Useless attributes
Interaction Graph
• The Titanic data set
– Label: survived?
– Attributes: describe the
passenger or crew member
• 2-way interactions:
– Sex then Class; Age not as
important
• 3-way interactions:
– negative: ‘Crew’ dummy is
wholly contained within ‘Class’;
‘Sex’ largely explains the death
rate among the crew.
– positive:
• Children from the first and
second class were prioritized.
• Men from the second class
mostly died (third class men and
the crew were better off)
blue: redundancy, negative int.          • Female crew members had
good odds of survival.
red: synergy, positive int.
An Interaction Drilled
Data for ~600 people

What’s the loss assuming no
interaction between eyes in hair?

Area corresponds to probability:
• black square: actual probability
• colored square: predicted
probability

Colors encode the type of error.
The more saturated the color, the
more “significant” the error. Codes:
• blue: overestimate
• red: underestimate
• white: correct estimate
KL-d: 0.178
Rules = Constraints                   No interaction:

• Rule 1:          • Rule 2:
Blonde hair is     Black hair is
connected with     connected with
blue or green      brown eyes.
eyes.
KL-d: 0.178
Both rules:

KL-d: 0.045         KL-d: 0.134

KL-d:0.022
Attribute Value Taxonomies
Interactions can also be computed between pairs
of attribute (or label) values. This way, we can
structure attributes with many values (e.g.,
Cartesian products ☺).

Attribute Selection with Interactions
• 2-way interactions I(A;Y) are the staple of
attribute selection
– Examples: information gain, Gini ratio, etc.
– Myopia! We ignore both positive and negative
interactions.
• Compare this with controlled 2-way interactions:
I(A;Y | B,C,D,E,…)
– Examples: Relief, regression coefficients
– We have to build a model on all attributes anyway,
making many assumptions… What does it buy us?
– We add another attribute, and the usefulness of a
previous attribute is reduced?
Attribute Subset Selection with NBC

The calibration of the classifier (expected likelihood of
an instance’s label) first improves then deteriorates
as we add attributes. The optimal number is ~8
attributes. The first few attributes are important, the
rest is noise?
Attribute Subset Selection with NBC

NO! We sorted the attributes from the worst to
the best. It is some of the best attributes that
ruin the performance! Why? NBC gets
confused by redundancies.
Accounting for Redundancies
At each step, we pick the next best attribute,
accounting for the attributes already in the
model:
– Fleuret’s procedure:

– Our procedure:
Example:
the naïve
Bayesian
Classifier

↑
Interaction-proof

myopic →
Predicting with Interactions
• Interactions are meaningful self-contained views of the
data.
• Can we use these views for prediction?
• It’s easy if the views do not overlap: we just multiply
them together, and normalize: P(a,b)P(c)P(d,e,f)
• If they do overlap:                P( x1 , y ) P( x2 , y )
P( y | x1 , x2 ) 
P( y )
• In a general overlap situation, Kikuchi approximation
efficiently handles the intersections between
interactions, and intersections-of-intersections.
• Algorithm: select interactions, use Kikuchi approximation
to fuse them into a joint prediction, use this to classify.
Interaction
Models
•   Transparent and intuitive
•   Efficient
•   Quick
•   Can be improved by replacing
Kikuchi with Conditional MaxEnt,
and Cartesian product with
something better.
Summary of the Talk
• Interactions are a good metaphor for
understanding models and data. They can be a
part of the hypothesis space, but do not have to.
• Probability is crucial for real-world problems.
• Watch your assumptions (utility, model,
algorithm, data)
• Information theory provides solid notation.
• The Bayesian approach to modelling is very
robust (naïve Bayes and Bayes nets are not
Bayesian approaches)
Summary of Contributions

Theory                          Practice
• A meta-model of machine       • A number of novel
learning.                       visualization methods.
• A formal definition of a k-   • A heuristic for efficient
way interaction,                non-myopic attribute
independent of the utility      selection.
and hypothesis space.         • An interaction-centered
• A thorough historic             machine learning method,
overview of related work.       Kikuchi-Bayes
• A novel view on               • A family of Bayesian
interaction significance        priors for consistent
tests.                          modelling with
interactions.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 10 posted: 2/20/2010 language: Dutch pages: 24