Docstoc

Estimating

Document Sample
Estimating Powered By Docstoc
					Learning and Making Decisions
When Costs and Probabilities are
        Both Unknown
    by B. Zadrozny and C. Elkan
                 Contents
•   Introduce the Problem
•   Previous work
•   Direct Cost Sensitive Decision Making
•   The Dataset
•   Estimating Class Membership Probabilities
•   Estimating Costs
•   Results and Conclusions
              Introduction
• Costs/Benefits are the values assigned to
  classification decisions.
• Cost are often different for different
  examples
• Often we are interested in the rare class in
  cost-sensitive learning
• Hence the problem of unbalanced data
     Cost Sensitive Decisions
• Each training and test example has associated
  cost
• General optimal prediction
         arg min  P( j | x)C (i, j , x)
             i     j
• Methods differ w.r.t. P( j | x) and C (i, j , x)
• Previous literature has assumed cost are known
  in advance and independent of examples.

         C (i, j, x)  C (i, j, y) x, y
                    MetaCost
• Estimation of P( j | x)
   – Estimated in training only.
• Estimation of C (i, j )
   – Example independent
• Training changes labelling to its optimum
• Learns a classifier to predict labeling of test
  examples.
Direct Cost-Sensitive-Decision-
            Making
• Estimation of P( j | x)
   – Average of Naïve Bayes and Decision Trees .
   – Estimated on training and test sets.
• Estimation of C (i, j , x)
   – Multiple Linear Regression.
   – Unbiased estimate using Heckman procedure
   – Example dependant.
• Evaluation
   – Evaluate against MetaCost and KDD competition
     results.
   – Using a large and difficult dataset, KDD.
    MetaCost Implementation
• For evaluation MetaCost is adapted
  – Probability class estimates found by simple methods
    using decision trees
  – Cost are made example dependant during training
• Adapted MetaCost vs. DCSDM
  – DCSDM uses two models on test example MetaCost
    one.
  – Estimation P( j | x) was made in both training and test
    examples in DCSDM.
      The Data Mining Task
• Data on persons who have donated in
  the past to a certain charity, KDD '98
  competition
  – Non-donor and donor labelling based on
    last campaign
• The task is to choose which donors to
  ask for new donations
• Training/Test set
  – 95,412 records, labelled donors or non-
    donors and donation amount
  – 96,367 unlabelled records from same
    donation campaign
     Data Mining Task cont.
• Cost of soliciting $0.68
• Donations range from $1-200
• 5% donors and 95% non-donors
  – Very low response rate and varying
    donations make hard to beat soliciting to
    everyone.
• The dataset set is hard
  – Already been filtered to be a reasonable set
    of prospects
  – The task is to improve upon the unknown
    method that produced the set
              Applied DCSDM
• In KDD we will change C(i,j,x) to B(i,j,x)
   – Costs become benefits

                    Actual             Actual
                    Non-donor          Donor
 Predict            0                  0
 Non-donor
 Predict Donor      -0.68              y(x)-0.68

• B(1,1,x) is example dependant
   – Replaced by a constant by previous literature
                Optimal policy
• The expected benefit of not soliciting, i = 0
 P( j  0 | x) B(0,0, x)  P( j  1 | x) B(0,1, x)  0
• Expected benefit of soliciting, i = 1
  P( j  0 | x) B(1,0, x)  P( j  1 | x) B(1,1, x)
   (1  P( j  1 | x))(0.68)  P( j  1 | x)( y ( x)  0.68)
   P( j  1 | x) y ( x)  0.68

• Optimal policy: P( j  1 | x) y ( x)  0.68  0
                  P( j | x)
• Optimal decisions require P ( j | x)
• Class sizes may be highly unbalanced
• Two methods proposed
  – Decision Trees - Smoothing Curtailment
  – Naïve Bayes - Binning
   Problems w. Decision Trees
• Decision trees assign as a score to each leaf
  the raw training frequency of that leaf.
                          k
                       p
                          n
• High Bias
  – Decision trees growing methods try to make leaves
    Homogeneous. p’s tend to be over or under
    estimates
• High Variance
  – When n is small p not to be trusted.
                        Smoothing
• Pruning is no good for us.
• To make the estimates less extreme lets replace:

                    k       k  b.m
                 p   p' 
                    n        nm
• b – base rate, m – heuristic value (smoothing strength)
• Effect
   – where k, n small p’ essentially just base rate.
   – If k, n larger then p’ is ‘combination’ of base rate and original
     score
Smoothed Scores
               Curtailment
• What if the leaves have enough training
  examples to be statistically reliable?
  – Then smoothing seems to be unnecessary.
• Curtailment searches through the tree and
  removes nodes where n < v.
  – V chosen either through cross-validation, or a
    heuristic, like b.v = 10.
Curtailed Tree
Curtailed Scores
       Naïve Bayes Classifiers
• Naïve Bayes
   – Assumes that within any class the attribute values are
     independent variables.
• This assumption gives inaccurate probability
  estimates
• But, attributes tend to be positively correlated so
  naïve Bayes estimates tend to be too extreme,
  i.e. close to zero or one.
• So, they do rank examples well:
if n( x)  n( y) then P( j  1 | x)  P( j  1 | y)
Calibrating Naïve Bayes Scores
• The Histogram method:
   – Sort training examples by n.b. scores
   – Divide sorted set into b subsets of equal size,
     called bins
   – For each bin compute lower and upper
     boundary n.b. scores
• Given a new data point x
   – Calculate n(x ) and find the associated bin
   – Let P( j | x) = fraction of positive training
     examples in that bin
Averaging Probability Estimates
• If probability estimates are partially uncorrelated
  then it follows that averaging these estimates will
  reduce their variance.
• Assuming all estimates have the same variance
  the average estimate will have a variance given
  by:             2  1   ( N  1) 2
                                 
                           N
         2   The individual classifier variances

          N   The number of classifiers
             The correlation factor among all classifiers
   Estimating Donation Amount
• Solicit the person based on policy.
• Policy

         P( j  1 | x) y ( x)  0.68  0

           y (x )   is estimated donation amount
          Cost and Probability
• Good Decisions
  – Estimating Cost well is more important than
    estimating probabilities.
• Why?
  – Relative variation of cost across different examples is
    much greater than the relative variation of
    probabilities
• Probability
  – Estimating Donation probability is difficult.
• Estimating donation amount are easier because
  past amount are excellent predictor of future
  amounts.
       Training and Test data
• Two random process
  – Donate or not to.
  – How much to donate?
  Donation Amount.
                  Donor        Non donor
  Training data   Known         -
  Test data       unknown      unknown

• Method used for estimating donation amount is
  called as Multiple Linear regression (MLR).
    Multiple Linear Regression
Two attributes are used
  – lastgift : dollar amount of most recent gift.
  – ampergift : average gift amount in response to the last
    22 promotions
• Linear Regression equation is used to estimate
  donation amount.
• 46 of 4843 donations recorded have donation
  amount more than $50.
• Donors that have donated at most $50 are used
  as input for linear regression.
   Problem of Sample Selection
               Bias
• Reasoning outside your learning space.
• Donation Amount
                   Donor                Non donor
Training data      Known                -
Test data          unknown              unknown

• Estimating Donation Amount
  – Any donation estimator is learned on the basis of
    people who actually donated.
  – This estimator is applied to different population
    consisting of donors and non-donors.
Donation Amount and Probability
   Estimates are Negatively
          Correlated
  Solution to Sample Selection
               Bias
• Heckman’s procedure
  – Estimate conditional probability p( j=1 | x) using linear
    probit model.
  – Estimate y(x) on training dataset for which j (x) = 1
    by including a transformation for each x using the
    estimated values of conditional probability.
• Their own procedure
  – conditional probability is learned using decision tree
    or Naïve bayes classifier.
  – These probability estimates are added as additional
    attributes by estimating y(x).
       Experimental Results
Direct cost sensitive Decision Making




Meta cost
         Experimental Results
            Interpretation
• With Heckman
  – profits on test set increases by $484, in all probability
    estimation methods.
  – Systematic improvement indicates that Heckman’s
    procedure solves the problem of Sample Selection
    Bias
• Meta cost
  – Best result of Meta cost is $14113.
  – Best result of Direct cost sensitive method is $15329.
  – On an average, profit achieved in Meta Cost on test
    set is $1751 lower than the profit achieved in case of
    direct cost-sensitive decision making.
 Statistical Significance of Results
• 4872 donors in fixed test set
• Average donation of $15.62
• Different Test set drawn randomly from same
  probability distribution would expect a standard
  deviation of sqrt(4872)
• Fluctuation will cause a change of about $1090.
             sqrt(4872) * 15.62 = $1090.
• Profit Difference between two methods less than
  $1090 is not significant.
               Conclusions
• Cost sensitive learning is better than Meta cost.
• Provides solution to fundamental problem of cost
  being example dependent.
• Identify and solves the problem of Sample
  Selection Bias for KDD’98 dataset
Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:8/5/2011
language:English
pages:33