Document Sample

Ronny Kohavi, ICML 1998 Ronny Kohavi, ICML 1998 Ronny Kohavi, ICML 1998 What is Data mining? Internal Databases Data Warehouses Research Find Question Data Internet Online databases Data Collection Answer Data Processing Research Extract Information Question Data Analysis Outline • Data Mining • Methodology for CART • Data mining trees, tree sketches •Applications to clinical data. •Categorical Response - OAB data •Continuous Response - Geodon RCT data • Robustness issues Moore’s law: + •Processing “capacity” doubles every couple of years (Exponential) •Hard Disk storage “capacity” doubles every 18 months (Use to be every 36 months) - •Bottle necks are not speed anymore. •Processing capacity is not growing as fast as data acquisition. Mining Data 50’s-80’s: EDA, Data Visualization. 1990 Ripley “that is not statistics, that’s data mining” 90’s-06: Data Mining: Large Datasets, EDA, DV, Machine Learning, Vision,… Example: Biopharmaceutical Area. Data repositories: - Data from many clinical, monitoring, marketing. - Data is largely unexplored. Data mining objective: “To extract valuable information.” “To identify nuggets, clusters of observations in these data that contain potentially valuable information.” Example: Biopharmaceutical Data: - Extract new information from existing databases. - Answer questions of clinicians, marketing. - Help design new studies. Data Mining Software and Recursive Partition SOFTWARE: Splus / Insightful Miner: Tree, CART, C4.5, BG, RF, BT R: Rpart, CART, BG, RF, BT SAS / Enterprise Miner: CART, C4.5, CHAID, Tree browser SPSS : CART, CHAID Clementine : CART, C4.5, Rule Finder HelixTree : CART, FIRM, BG, RF, BT Recursive Partition (CART) I. Dependent variable is categorical II. Dependent variable is numerical • Regression Tree •Classification Trees,DecisionTrees Dose=Function (BMI,AGE) Example: A doctor might have a rule AGE for choosing which drug to prescribe to 80 high cholesterol patients. 65 40 18 80 High Blood 20 Pressure? Y N 24 BMI AGE<65 Age>60 Age<30 Y N BMI<24 80 Y N Y N Y N Drug A Drug B Drug B DrugA AGE<18 80 Y N 20 40 Classic Example of CART: Pima Indians Diabetes • 768 Pima Indian females, 21+ years old ; 268 tested positive to diabetes • 8 predictors: PRG, PLASMA, BP, THICK, INSULIN, BODY, PEDIGREE, AGE • OBJECTIVE: PREDICT DIABETES Node CART N P(Diabetes) Combined 993.5 768 35% PLASMA127 ? Y N PLASMA<=127 854.3 485 19% PLASMA>127 283 61% AGE 28 BODY 29.9 AGE<=28 916.3 367 19% Y N Y N AGE>28 401 49% BODY<=27.8 913.7 222 12% BODY>27.8 546 44% 1.0 1.0 1.0 0.4 0.6 0.8 0.8 0.8 DIABETES 0.6 0.6 RESP RESP RESP 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 20 30 40 50 60 70 80 0 50 100 150 200 0 10 20 30 40 50 60 AGE PLASMA BODY Classic Example of CART: Pima Indians Diabetes CART Algorithm PLASMA<127.5 | • Grow tree • Stop when node AGE<28.5 BODY<29.95 sizes are small • Prune tree BODY<30.95 BODY<26.35 PLASMA<145.5 PLASMA<157.5 PLASMA<99.5 AGE<30.5 0.01325 0.17500 0.04878 0.14630 0.51430 0.86960 PEDIGREE<0.561 BP<61 0.18180 0.72310 0.40480 0.73530 1.00000 0.32500 CART criteria functions For regression trees : N L 2 N R 2 ˆL ˆR Equal variances(CART ) : h NL NR N L log 2 N R log 2 ˆL ˆR Non equal variances : h NL NR For classification trees: criteria functions h p L min( p L , p1 ) p R min( p R , p1 ) 0 L 0 R h p L ( p L log p L p1 log p1 ) p R ( p R log p R p1 log p1 ) (C5) 0 0 L L 0 0 R R h p L pL p1 p R pR p1 (CART) 0 L 0 R DATA PREPROCESSING RECOMMENDATIONS FOR TREES a. Make sure that all the factors are declared as factors Some times factor variables are read into R as numeric or as character variables. Suppose that a variable RACE on a SAS dataset is coded as 1, 2, 3, 4 representing 4 race groups. We need to be sure that it was not read as a numeric variable, therefore we will first check the types of the variables. We may use the functions “class” and “is.factor” combined with “sapply” in the following way. sapply(w,is.factor) or sapply(w,class) Suppose that the variable “x” is numeric when it is supposed to be a factor. Then we convert it into factor: w$x = factor(w$x) b. Recode factors: Sometimes the codes assigned to factor levels are very long phrases and when those code are inserted into the tree the resulting graph can be very messy. We prefer to use short words to represent the codes. To recode the factor levels you may use the function “f.recode”: > levels(w$Muscle) [1] "" "Mild Weakness" [3] "Moderate Weakness" "Normal" > musc =f.recode(w$Muscle,c("","Mild","Mod","Norm")) > w$Musclenew = musc Example Hospital data hospital = read.table("project2/hospital.txt",sep=",") colnames(hospital) <- c("ZIP","HID","CITY","STATE","BEDS","RBEDS","OUTV","ADM", "SIR", "SALESY","SALES12","HIP95","KNEE95","TH","TRAUMA","REHAB","HIP96", "KNEE96","FEMUR96") hosp = hospital[,-c(1:4,10)] hosp$TH = factor(hosp$TH) hosp$TRAUMA = factor(hosp$TRAUMA) hosp$REHAB = factor(hosp$REHAB) u<-rpart(log(1+SALES12)~.,data=hosp,control=rpart.control(cp=.01)) plot(u); text(u) u=rpart(log(1+SALES12)~.,data=hosp,control=rpart.control(cp=.001)) plot(u,uniform=T) ; text(u) Regression Tree for log(1+Sales) HIP95 < 40.5 [Ave: 1.074, Effect: -0.76 ] HIP96 < 16.5 [Ave: 0.775, Effect: -0.298 ] RBEDS < 59 [Ave: 0.659, Effect: -0.117 ] HIP95 < 0.5 [Ave: 1.09, Effect: +0.431 ] -> 1.09 HIP95 >= 0.5 [Ave: 0.551, Effect: -0.108 ] KNEE96 < 3.5 [Ave: 0.375, Effect: -0.175 ] -> 0.375 KNEE96 >= 3.5 [Ave: 0.99, Effect: +0.439 ] -> 0.99 RBEDS >= 59 [Ave: 1.948, Effect: +1.173 ] -> 1.948 HIP96 >= 16.5 [Ave: 1.569, Effect: +0.495 ] FEMUR96 < 27.5 [Ave: 1.201, Effect: -0.368 ] -> 1.201 FEMUR96 >= 27.5 [Ave: 1.784, Effect: +0.215 ] -> 1.784 HIP95 >= 40.5 [Ave: 2.969, Effect: +1.136 ] KNEE95 < 77.5 [Ave: 2.493, Effect: -0.475 ] BEDS < 217.5 [Ave: 2.128, Effect: -0.365 ] -> 2.128 BEDS >= 217.5 [Ave: 2.841, Effect: +0.348 ] OUTV < 53937.5 [Ave: 3.108, Effect: +0.267 ] -> 3.108 OUTV >= 53937.5 [Ave: 2.438, Effect: -0.404 ] -> 2.438 KNEE95 >= 77.5 [Ave: 3.625, Effect: +0.656 ] SIR < 9451 [Ave: 3.213, Effect: -0.412 ] -> 3.213 SIR >= 9451 [Ave: 3.979, Effect: +0.354 ] -> 3.979 Regression Tree | HIP95<2.52265 HIP96<2.01527 KNEE95<2.96704 RBEDS<2.77141 FEMUR96<2.28992 BEDS<3.8403 SIR<9.85983 HIP95<0.5 ADM<4.87542 OUTV<15.2396 1.2010 1.7840 2.1280 3.2130 3.9790 KNEE96<1.36514 1.0900 0.8984 2.3880 3.1080 2.4380 0.3752 0.9898 -0.936 PC3< | Classification tree: data(tissue) gr = rep(1:3, c( 11,11,19)) > x <- f.pca(f.toarray(tissue))$scores[,1:4] PC2< -1.154 > x= data.frame(x,gr=gr) 3 > library(rpart) > tr =rpart(factor(gr)~., data=x) n= 41 node), split, n, loss, yval, (yprob) 1 2 * denotes terminal node 1) root 41 22 3 (0.26829268 0.26829268 0.46341463) 2) PC3< -0.9359889 23 12 1 (0.47826087 0.47826087 0.04347826) 4) PC2< -1.154355 12 1 1 (0.91666667 0.00000000 0.08333333) * 5) PC2>=-1.154355 11 0 2 (0.00000000 1.00000000 0.00000000) * 3) PC3>=-0.9359889 18 0 3 (0.00000000 0.00000000 1.00000000) * > plot(tr) > text(tr) > Random forest Algorithm (A variant of bagging) •Select ntree, the number of trees to grow, and mtry, a number no larger than number of variables. •For i = 1 to ntree: •Draw a bootstrap sample from the data. Call those not in the bootstrap sample the "out-of-bag" data. •Grow a "random" tree, where at each node, the best split is chosen among mtry randomly selected variables. The tree is grown to maximum size and not pruned back. 5.Use the tree to predict out-of-bag data. 6.In the end, use the predictions on out-of-bag data to form majority votes. 7.Prediction of test data is done by majority votes from predictions from the ensemble of trees. R-package: randomForest with function called also randomForest Boosting (Ada boosting) Input: Data (xi,yi) i=1,…,n ; wi =1/n 1. Fit tree or any other learning method: h1(xi) 2. Calculate misclassification error E1 3. If E1 > 0.5 stop and abort loop 4. b1= E1/(1- E1) 5. for i=1,…,n if h1(xi) =yi wi = wi b1 else wi = wi 6. Normalize the wi’s to add up to 1. 7. Go back to 1. and repeat until no change in prediction error. R-package: bagboost with function called also bagboost and also adaboost Boosting (Ada boosting) i=sample(nrow(hosp),1000,rep=F) xlearn = f.toarray((hospital[-c(1:4,10:11),])) ylearn = 1*( hospital$SALES12 > 50) xtest = xlearn[i,] xlearn = xlearn[-i,] ytest = ylearn[i] ylearn = ylearn[-i] ## BOOSTING EXAMPLE u = bagboost(xlearn[1:100,], ylearn[1:100], xtest,presel=0,mfinal=20) summarize(u,ytest) ## RANDOM FOREST EXAMPLE u = randomForest(xlearn[1:100,], ylearn[1:100], xtest,ytest) round(importance(u),2) Competing methods Recursive Partition: Find the partition that best approximates the response. For moderate/large datasets partition tree may be too big Data Mining Trees Bump Hunting: Find subsets that optimize some criterion Subsets are more “robust” Not all interesting subsets are found Paradigm for data mining: Selection of interesting subsets Recursive Partition Var 2 Data Mining Trees High Resp Bump Hunting Var 1 Other Data High Resp Low Resp Data Mining Trees ARF (Active Region Finder) Naive thought: For the jth descriptor variable xj, an “interesting” subset {a<xji<b} is one such that p = Prob[ Z=1 | a<xji<b ] is much larger than 1.0 = Prob[ Z=1 ] 0.8 0.6 y 0.4 0.2 a b 0.0 -2 -1 0 1 2 T= (p-)/p measures how interesting a subset is. x Add a penalty term to prevent selection of subsets that are too small or too large. ARF algorithm diagram - Create NodeList with one node = FullData - Set NodeType=FollowUp - Set CurrentNode= 1 Split CurrentNode Left Bucket Right Bucket Center Bucket BucketSize >Min? BucketSize >Min? Is Split Significant? Yes: NodeType=Followup Yes: NodeType=Followup T or F No: NodeType=Terminal No: NodeType=Terminal BucketSize >Min? BucketSize > 0? Yes: NodeType=Followup BucketSize > 0? Y: Add Node to No: NodeType=Terminal Y: Add Node to NodeList Add Bucket to NodeList NodeList Set CurrentNode= +1 EXIT Y if CurrentNode> LastNode Print Report N If NodeType = Terminal Y N Comparing CART & ARF 0.8 ARF: Captures 50 y1 subset with small 0 0.4 -50 variance (but not 0.0 20 30 40 50 60 70 80 20 30 40 50 60 70 80 the rest). x x 20 40 60 0.8 CART Needs both y3 0.4 subset with small 0 0.0 variance relative 20 30 40 50 60 70 80 20 30 40 50 60 70 80 x x to mean diff. 0.8 50 ARF: captures y5 0 0.4 interior subsets. -50 0.0 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Two Examples Subset that are Point density is important hidden in the middle 1.0 40 0.8 20 Non-respondant 0.6 Poor 0.4 0 0.2 -20 0.0 2 4 6 8 10 0 10 20 30 40 50 Pain Scale DURATIL Methodology Var 2 1. Methodology Objective: High Resp The Data Space is divided between Var 1 Other - High response subsets Data - Low Response subsets High - Other Resp Low Resp 2. Categorical Responses: Subsets that have high response on one of the categories. T= (p-)/p 3. Continuous Responses: High mean response measured by Z (x ) / x 4. Statistical significance should be based on the entire tree building process. 5. Categorical Predictors 6. Data Visualization 7. PDF report. Report Simple Tree or Tree sketch : Only statistically significant nodes. Full Tree: All nodes. Table of Numerical Outputs: Detailed statistics of each node List of Interesting Subsets: List of significant subsets Conditional Scatter Plot (optional): Data Visualization. How about outliers? For Regression trees - Popular belief: Trees are not affected by outlier (are robust) - Outlier detection: Run the data mining tree allowing for small buckets. For observation Xi in terminal node j calculate the score | X i Median | Zi MAD Zi is the number of std dev away from the mean Zi > 3.5 then Xi is noted as an outlier. Node for outlier n. 1 Node for outliers n. 2&3 Node for outlier n. 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 4 5 4 3 Frequency Frequency Frequency 3 2 2 1 1 0 0 -20 0 20 40 60 80 -20 0 20 40 60 -20 0 20 40 Tree with Outliers After Outlier removal Robustness issues ISSUE In regulatory environments outliers are rarely omitted. Our method is easily adaptable to robust splits by calculating the robust version of the criterion by replacing the mean and std dev by suitable estimators of location and scale: Z (T TR ) / TR Binary/Categorical Response - How do we think about Robustness of trees? - One outlier might not make any difference. - 5% , 10% or more outliers could make a difference. Further work Binary/Categorical Response - How do we think about Robustness of trees? - One outlier might not make any difference. - 5% , 10% or more outliers could make a difference. Alvir, Cabrera, Caridi and Nguyen (2006) Mining Clinical Trial data. R-package: www.rci.rutgers.edu/DM/ARF_1.0.zip

DOCUMENT INFO

Shared By:

Categories:

Tags:
data mining, Knowledge Discovery, data analysis, machine learning, business intelligence, computer science, Visual Basic, datamining techniques, INTERNATIONAL CONFERENCE, Free Data

Stats:

views: | 36 |

posted: | 3/31/2010 |

language: | Dutch |

pages: | 34 |

OTHER DOCS BY liwenting

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.