phd-defense

Document Sample
phd-defense Powered By Docstoc
					 Machine Learning
           based on
Attribute Interactions
            Aleks Jakulin
  Advised by Acad. Prof. Dr. Ivan Bratko

                 2003-2005
  Learning =                 - data shapes the model
                             - model is made of possible hypotheses
  Modelling                  - model is generated by an algorithm
                             - utility is the goal of a model


               Utility = -Loss



Hypothesis
                  MODEL                   Data
  Space
                                                             B { A:
                                                         “A bounds B”
                                                     The fixed data sample
                                                     restricts the model to
                                                      be consistent with it.
             Learning Algorithm
Our Assumptions about Models
• Probabilistic Utility: logarithmic loss
  (alternatives: classification accuracy, Brier
  score, RMSE)
• Probabilistic Hypotheses: multinomial
  distribution, mixture of Gaussians
  (alternatives: classification trees, linear
  models)
• Algorithm: maximum likelihood (greedy),
  Bayesian integration (exhaustive)
• Data: instances + attributes
 Expected Minimum Loss = Entropy
      The diagram is a visualization of a probabilistic model P(A,C)


 Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).




                   A            C                  H(C|A) = H(C)-I(A;C)
                                                   Conditional entropy -
                                                   Remaining uncertainty
       H(A)                                        in C after learning A.
    Information
 which came with
              I(A;C)=H(A)+H(C)-H(AC)                      H(AC)
the knowledgeinformation or information gain ---
       Mutual of A                                     Joint entropy
         How much have A and C in common?
            2-Way Interactions
• Probabilistic models take the form of P(A,B)
• We have two models:
   – Interaction allowed:    PY(a,b) := F(a,b)
   – Interaction disallowed: PN(a,b) := P(a)P(b) = F(a)G(b)
• The error that PN makes when approximating PY:
   D(PY || PN) := Ex ~ Py[L(x,PN)] = I(A;B)
                                    (mutual information)
• Also applies for predictive models:
   D( P(Y | A) || P(Y ))  I ( A; Y )
• Also applies for Pearson’s correlation coefficient:
                                        P is a bivariate Gaussian,
                                        obtained via max. likelihood
           Rajski’s Distance
• The attributes that have more in common can be
  visualized as closer in some imaginary
  Euclidean space.
• How to avoid the influence of many/few-valued
  attributes? (Complex attributes seem to have
  more in common.)
• Rajski’s distance:


                                 

• This is a metric (e.g.: the triangle inequality)
Interactions between
     US Senators
    dark: strong interaction,
           high mutual information

    light: weak interaction
           low mutual information




                                     Interaction matrix
      A Taxonomy of
Machine Learning Algorithms




                              Interaction dendrogram
    CMC dataset
           3-Way Interactions
                            label

                            C
importance of attribute A           importance of attribute B


     attribute                                   attribute
                 A                           B
                     attribute correlation

                      3-Way Interaction:
                     2-Way Interactions
           What is common to A, B and C together;
      and cannot be inferred from any subset of attributes.
     Interaction Information
                               How informative are A and B together?


                                         I(A;B;C) :=
                                             I(AB;C) - I(A;C) - I(B;C)
                                                       = I(B;C|A) - I(B;C)
                                                       = I(A;C|B) - I(A;C)
(Partial) history of independent reinventions:
          Quastler ‘53 (Info. Theory in Biology)     - measure of specificity
          McGill ‘54 (Psychometrika)                 - interaction information
          Han ‘80 (Information & Control)            - multiple mutual information
          Yeung ‘91 (IEEE Trans. On Inf. Theory)     - mutual information
          Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index
          Matsuda ‘00 (Physical Review E)            - higher-order mutual inf.
          Brenner et al. ‘00 (Neural Computation)     - average synergy
          Demšar ’02 (A thesis in machine learning)  - relative information gain
          Bell ‘03 (NIPS02, ICA2003)                 - co-information
          Jakulin ‘02                                - interaction gain
                   Useful attributes




   farming
              Interaction
   soil      Dendrogram

                In classification tasks
              we are only interested in
               those interactions that
                   involve the label

vegetation

              Useless attributes
                Interaction Graph
                                  • The Titanic data set
                                     – Label: survived?
                                     – Attributes: describe the
                                       passenger or crew member
                                  • 2-way interactions:
                                     – Sex then Class; Age not as
                                       important
                                  • 3-way interactions:
                                     – negative: ‘Crew’ dummy is
                                       wholly contained within ‘Class’;
                                       ‘Sex’ largely explains the death
                                       rate among the crew.
                                     – positive:
                                         • Children from the first and
                                           second class were prioritized.
                                         • Men from the second class
                                           mostly died (third class men and
                                           the crew were better off)
blue: redundancy, negative int.          • Female crew members had
                                           good odds of survival.
red: synergy, positive int.
              An Interaction Drilled
Data for ~600 people

What’s the loss assuming no
interaction between eyes in hair?

Area corresponds to probability:
• black square: actual probability
• colored square: predicted
probability

Colors encode the type of error.
The more saturated the color, the
more “significant” the error. Codes:
• blue: overestimate
• red: underestimate
• white: correct estimate
                                       KL-d: 0.178
Rules = Constraints                   No interaction:


• Rule 1:          • Rule 2:
  Blonde hair is     Black hair is
  connected with     connected with
  blue or green      brown eyes.
  eyes.
                                       KL-d: 0.178
                                        Both rules:




    KL-d: 0.045         KL-d: 0.134

                                      KL-d:0.022
    Attribute Value Taxonomies
  Interactions can also be computed between pairs
    of attribute (or label) values. This way, we can
    structure attributes with many values (e.g.,
    Cartesian products ☺).



ADULT/CENSUS
Attribute Selection with Interactions
 • 2-way interactions I(A;Y) are the staple of
   attribute selection
   – Examples: information gain, Gini ratio, etc.
   – Myopia! We ignore both positive and negative
     interactions.
 • Compare this with controlled 2-way interactions:
   I(A;Y | B,C,D,E,…)
   – Examples: Relief, regression coefficients
   – We have to build a model on all attributes anyway,
     making many assumptions… What does it buy us?
   – We add another attribute, and the usefulness of a
     previous attribute is reduced?
Attribute Subset Selection with NBC




The calibration of the classifier (expected likelihood of
  an instance’s label) first improves then deteriorates
  as we add attributes. The optimal number is ~8
  attributes. The first few attributes are important, the
  rest is noise?
Attribute Subset Selection with NBC




NO! We sorted the attributes from the worst to
 the best. It is some of the best attributes that
 ruin the performance! Why? NBC gets
 confused by redundancies.
Accounting for Redundancies
At each step, we pick the next best attribute,
  accounting for the attributes already in the
  model:
  – Fleuret’s procedure:



  – Our procedure:
                               Example:
                               the naïve
                               Bayesian
                               Classifier


        ↑
Interaction-proof




                    myopic →
  Predicting with Interactions
• Interactions are meaningful self-contained views of the
  data.
• Can we use these views for prediction?
• It’s easy if the views do not overlap: we just multiply
  them together, and normalize: P(a,b)P(c)P(d,e,f)
• If they do overlap:                P( x1 , y ) P( x2 , y )
                         P( y | x1 , x2 ) 
                                              P( y )
• In a general overlap situation, Kikuchi approximation
  efficiently handles the intersections between
  interactions, and intersections-of-intersections.
• Algorithm: select interactions, use Kikuchi approximation
  to fuse them into a joint prediction, use this to classify.
        Interaction
          Models
•   Transparent and intuitive
•   Efficient
•   Quick
•   Can be improved by replacing
    Kikuchi with Conditional MaxEnt,
    and Cartesian product with
    something better.
        Summary of the Talk
• Interactions are a good metaphor for
  understanding models and data. They can be a
  part of the hypothesis space, but do not have to.
• Probability is crucial for real-world problems.
• Watch your assumptions (utility, model,
  algorithm, data)
• Information theory provides solid notation.
• The Bayesian approach to modelling is very
  robust (naïve Bayes and Bayes nets are not
  Bayesian approaches)
    Summary of Contributions

Theory                          Practice
• A meta-model of machine       • A number of novel
  learning.                       visualization methods.
• A formal definition of a k-   • A heuristic for efficient
  way interaction,                non-myopic attribute
  independent of the utility      selection.
  and hypothesis space.         • An interaction-centered
• A thorough historic             machine learning method,
  overview of related work.       Kikuchi-Bayes
• A novel view on               • A family of Bayesian
  interaction significance        priors for consistent
  tests.                          modelling with
                                  interactions.

				
DOCUMENT INFO