VIEWS: 14 PAGES: 33 POSTED ON: 8/5/2011 Public Domain
Learning and Making Decisions When Costs and Probabilities are Both Unknown by B. Zadrozny and C. Elkan Contents • Introduce the Problem • Previous work • Direct Cost Sensitive Decision Making • The Dataset • Estimating Class Membership Probabilities • Estimating Costs • Results and Conclusions Introduction • Costs/Benefits are the values assigned to classification decisions. • Cost are often different for different examples • Often we are interested in the rare class in cost-sensitive learning • Hence the problem of unbalanced data Cost Sensitive Decisions • Each training and test example has associated cost • General optimal prediction arg min P( j | x)C (i, j , x) i j • Methods differ w.r.t. P( j | x) and C (i, j , x) • Previous literature has assumed cost are known in advance and independent of examples. C (i, j, x) C (i, j, y) x, y MetaCost • Estimation of P( j | x) – Estimated in training only. • Estimation of C (i, j ) – Example independent • Training changes labelling to its optimum • Learns a classifier to predict labeling of test examples. Direct Cost-Sensitive-Decision- Making • Estimation of P( j | x) – Average of Naïve Bayes and Decision Trees . – Estimated on training and test sets. • Estimation of C (i, j , x) – Multiple Linear Regression. – Unbiased estimate using Heckman procedure – Example dependant. • Evaluation – Evaluate against MetaCost and KDD competition results. – Using a large and difficult dataset, KDD. MetaCost Implementation • For evaluation MetaCost is adapted – Probability class estimates found by simple methods using decision trees – Cost are made example dependant during training • Adapted MetaCost vs. DCSDM – DCSDM uses two models on test example MetaCost one. – Estimation P( j | x) was made in both training and test examples in DCSDM. The Data Mining Task • Data on persons who have donated in the past to a certain charity, KDD '98 competition – Non-donor and donor labelling based on last campaign • The task is to choose which donors to ask for new donations • Training/Test set – 95,412 records, labelled donors or non- donors and donation amount – 96,367 unlabelled records from same donation campaign Data Mining Task cont. • Cost of soliciting $0.68 • Donations range from $1-200 • 5% donors and 95% non-donors – Very low response rate and varying donations make hard to beat soliciting to everyone. • The dataset set is hard – Already been filtered to be a reasonable set of prospects – The task is to improve upon the unknown method that produced the set Applied DCSDM • In KDD we will change C(i,j,x) to B(i,j,x) – Costs become benefits Actual Actual Non-donor Donor Predict 0 0 Non-donor Predict Donor -0.68 y(x)-0.68 • B(1,1,x) is example dependant – Replaced by a constant by previous literature Optimal policy • The expected benefit of not soliciting, i = 0 P( j 0 | x) B(0,0, x) P( j 1 | x) B(0,1, x) 0 • Expected benefit of soliciting, i = 1 P( j 0 | x) B(1,0, x) P( j 1 | x) B(1,1, x) (1 P( j 1 | x))(0.68) P( j 1 | x)( y ( x) 0.68) P( j 1 | x) y ( x) 0.68 • Optimal policy: P( j 1 | x) y ( x) 0.68 0 P( j | x) • Optimal decisions require P ( j | x) • Class sizes may be highly unbalanced • Two methods proposed – Decision Trees - Smoothing Curtailment – Naïve Bayes - Binning Problems w. Decision Trees • Decision trees assign as a score to each leaf the raw training frequency of that leaf. k p n • High Bias – Decision trees growing methods try to make leaves Homogeneous. p’s tend to be over or under estimates • High Variance – When n is small p not to be trusted. Smoothing • Pruning is no good for us. • To make the estimates less extreme lets replace: k k b.m p p' n nm • b – base rate, m – heuristic value (smoothing strength) • Effect – where k, n small p’ essentially just base rate. – If k, n larger then p’ is ‘combination’ of base rate and original score Smoothed Scores Curtailment • What if the leaves have enough training examples to be statistically reliable? – Then smoothing seems to be unnecessary. • Curtailment searches through the tree and removes nodes where n < v. – V chosen either through cross-validation, or a heuristic, like b.v = 10. Curtailed Tree Curtailed Scores Naïve Bayes Classifiers • Naïve Bayes – Assumes that within any class the attribute values are independent variables. • This assumption gives inaccurate probability estimates • But, attributes tend to be positively correlated so naïve Bayes estimates tend to be too extreme, i.e. close to zero or one. • So, they do rank examples well: if n( x) n( y) then P( j 1 | x) P( j 1 | y) Calibrating Naïve Bayes Scores • The Histogram method: – Sort training examples by n.b. scores – Divide sorted set into b subsets of equal size, called bins – For each bin compute lower and upper boundary n.b. scores • Given a new data point x – Calculate n(x ) and find the associated bin – Let P( j | x) = fraction of positive training examples in that bin Averaging Probability Estimates • If probability estimates are partially uncorrelated then it follows that averaging these estimates will reduce their variance. • Assuming all estimates have the same variance the average estimate will have a variance given by: 2 1 ( N 1) 2 N 2 The individual classifier variances N The number of classifiers The correlation factor among all classifiers Estimating Donation Amount • Solicit the person based on policy. • Policy P( j 1 | x) y ( x) 0.68 0 y (x ) is estimated donation amount Cost and Probability • Good Decisions – Estimating Cost well is more important than estimating probabilities. • Why? – Relative variation of cost across different examples is much greater than the relative variation of probabilities • Probability – Estimating Donation probability is difficult. • Estimating donation amount are easier because past amount are excellent predictor of future amounts. Training and Test data • Two random process – Donate or not to. – How much to donate? Donation Amount. Donor Non donor Training data Known - Test data unknown unknown • Method used for estimating donation amount is called as Multiple Linear regression (MLR). Multiple Linear Regression Two attributes are used – lastgift : dollar amount of most recent gift. – ampergift : average gift amount in response to the last 22 promotions • Linear Regression equation is used to estimate donation amount. • 46 of 4843 donations recorded have donation amount more than $50. • Donors that have donated at most $50 are used as input for linear regression. Problem of Sample Selection Bias • Reasoning outside your learning space. • Donation Amount Donor Non donor Training data Known - Test data unknown unknown • Estimating Donation Amount – Any donation estimator is learned on the basis of people who actually donated. – This estimator is applied to different population consisting of donors and non-donors. Donation Amount and Probability Estimates are Negatively Correlated Solution to Sample Selection Bias • Heckman’s procedure – Estimate conditional probability p( j=1 | x) using linear probit model. – Estimate y(x) on training dataset for which j (x) = 1 by including a transformation for each x using the estimated values of conditional probability. • Their own procedure – conditional probability is learned using decision tree or Naïve bayes classifier. – These probability estimates are added as additional attributes by estimating y(x). Experimental Results Direct cost sensitive Decision Making Meta cost Experimental Results Interpretation • With Heckman – profits on test set increases by $484, in all probability estimation methods. – Systematic improvement indicates that Heckman’s procedure solves the problem of Sample Selection Bias • Meta cost – Best result of Meta cost is $14113. – Best result of Direct cost sensitive method is $15329. – On an average, profit achieved in Meta Cost on test set is $1751 lower than the profit achieved in case of direct cost-sensitive decision making. Statistical Significance of Results • 4872 donors in fixed test set • Average donation of $15.62 • Different Test set drawn randomly from same probability distribution would expect a standard deviation of sqrt(4872) • Fluctuation will cause a change of about $1090. sqrt(4872) * 15.62 = $1090. • Profit Difference between two methods less than $1090 is not significant. Conclusions • Cost sensitive learning is better than Meta cost. • Provides solution to fundamental problem of cost being example dependent. • Identify and solves the problem of Sample Selection Bias for KDD’98 dataset Questions?