A Data Mining Tutorial

A Data Mining Tutorial David Madigan dmadigan@rci.rutgers.edu http://stat.rutgers.edu/~madigan Overview • Brief Introduction to Data Mining • Data Mining Algorithms • Specific Examples – Algorithms: Disease Clusters – Algorithms: Model-Based Clustering – Algorithms: Frequent Items and Association Rules • Future Directions, etc. Of ―laws‖, Monsters, and Giants… • Moore’s law: processing ―capacity‖ doubles every 18 months : CPU, cache, memory • It’s more aggressive cousin: – Disk storage ―capacity‖ doubles every 9 months 1E+7 Disk TB Shipped per Year 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. What do the two ―laws‖ combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it. ExaByte 1E+6 1E+5 disk TB growth: 112%/y Moore's Law: 58.7%/y 1E+4 1E+3 1988 1991 1994 1997 2000 What is Data Mining? Finding interesting structure in data • Structure: refers to statistical patterns, predictive models, hidden relationships • Examples of tasks addressed by Data Mining – Predictive Modeling (classification, regression) – Segmentation (Data Clustering ) – Summarization – Visualization Ronny Kohavi, ICML 1998 Ronny Kohavi, ICML 1998 Ronny Kohavi, ICML 1998 Stories: Online Retailing Data Mining Algorithms ―A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns‖ Hand, Mannila, and Smyth ―well-defined‖: can be encoded in software ―algorithm‖: must terminate after some finite number of steps Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory) Backpropagation data mining algorithm x1 x2 x3 x4 h1 h2 y s1  i 1 i xi ; s2  i 1  i xi 4 4 hsi   1 (1  e  si ) y  i 1 wi hi 2 4 2 •vector of p input values multiplied by p  d1 weight matrix •resulting d1 values individually transformed by non-linear function •resulting d1 values multiplied by d1  d2 weight matrix 1 Backpropagation (cont.) Parameters: Score: 1 ,, 4 , 1 ,,  4 , w1 , w2 ˆ S SSE   ( y (i )  y (i )) 2 i 1 n Search: steepest descent; search for structure? Models and Patterns Models Prediction Probability Distributions Structured Data •Linear regression •Piecewise linear Models Prediction Probability Distributions Structured Data •Linear regression •Piecewise linear •Nonparamatric regression Models Prediction Probability Distributions Structured Data •Linear regression •Piecewise linear logistic regression naïve bayes/TAN/bayesian networks NN support vector machines Trees etc. •Nonparametric regression •Classification Models Prediction Probability Distributions •Parametric models •Mixtures of parametric models •Graphical Markov models (categorical, continuous, mixed) Structured Data •Linear regression •Piecewise linear •Nonparametric regression •Classification Models Prediction Probability Distributions •Parametric models •Mixtures of parametric models •Graphical Markov models (categorical, continuous, mixed) Structured Data •Time series •Markov models •Mixture Transition Distribution models •Hidden Markov models •Linear regression •Piecewise linear •Nonparametric regression •Classification •Spatial models Bias-Variance Tradeoff High Bias - Low Variance Low Bias - High Variance ―overfitting‖ - modeling the random component Score function should embody the compromise The Curse of Dimensionality X ~ MVNp (0 , I) •Gaussian kernel density estimation •Bandwidth chosen to minimize MSE at the mean ˆ E[( p( x)  p( x))2 •Suppose want:  0.1 2 p ( x) x0 Dimension 1 2 3 6 10 # data points 4 19 67 2,790 842,000 Patterns Global Local •Clustering via partitioning •Outlier detection •Bump hunting •Scan statistics •Hierarchical Clustering •Mixture Models •Changepoint detection •Association rules Scan Statistics via Permutation Tests xx x x x xxx x xx x x xxxx x xx x x xx xxx x x xx x x x x x The curve represents a road Each ―x‖ marks an accident Red ―x‖ denotes an injury accident Black ―x‖ means no injury Is there a stretch of road where there is an unually large fraction of injury accidents? Scan with Fixed Window • If we know the length of the ―stretch of road‖ that we seek, e.g., we could slide this window long the road and find the most ―unusual‖ window location xx x x x x xx x x xx xxx x x xx xxx x x xx x xxxx x x x x x How Unusual is a Window? • Let pW and p¬W denote the true probability of being red inside and outside the window respectively. Let (xW ,nW) and (x¬W ,n¬W) denote the corresponding counts • Use the GLRT for comparing H0: pW = p¬W versus H1: pW ≠ p¬W [(xW  xW ) /(nW  nW )]xW  xW [1  ((xW  xW ) /(nW  nW ))]nW  nW  xW  xW  ( xW / nW ) xW [1  ( xW / nW )]nW  xW ( xW / nW ) xW [1  ( xW / nW )]nW  xW • lambda measures how unusual a window is 2 log  here has an asymptotic chi-square distribution with 1df Permutation Test • Since we look at the smallest  over all window locations, need to find the distribution of smallest- under the null hypothesis that there are no clusters • Look at the distribution of smallest- over say 999 random relabellings of the colors of the x’s xx x xxx xx x xxx xx x xxx xx x xxx … x x x x xx xx xx xx x xx x xx x xx x xx x x x x smallest- 0.376 0.233 0.412 0.222 • Look at the position of observed smallest- in this distribution to get the scan statistic p-value (e.g., if observed smallest- is 5th smallest, p-value is 0.005) Variable Length Window • No need to use fixed-length window. Examine all possible windows up to say half the length of the entire road O O = fatal accident = non-fatal accident Spatial Scan Statistics • Spatial scan statistic uses, e.g., circles instead of line segments Spatial-Temporal Scan Statistics • Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window Other Issues • Poisson model also common (instead of the bernoulli model) • Covariate adjustment • Andrew Moore’s group at CMU: efficient algorithms for scan statistics Software: SaTScan + others http://www.satscan.org http://www.phrl.org http://www.terraseer.com Association Rules: Support and Confidence Customer buys both Customer buys diaper • Find all the rules Y  Z with minimum confidence and support – support, s, probability that a transaction contains {Y & Z} – confidence, c, conditional probability that a transaction having {Y & Z} also contains Z Customer buys beer Transaction ID Items Bought Let minimum support 50%, and 2000 A,B,C minimum confidence 50%, we 1000 A,C have 4000 A,D – A  C (50%, 66.6%) 5000 B,E,F – C  A (50%, 100%) Mining Association Rules—An Example Transaction ID 2000 1000 4000 5000 Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% For rule A  C: support = support({A &C}) = 50% confidence = support({A &C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Mining Frequent Itemsets: the Key Step • Find the frequent itemsets: the sets of items that have minimum support – A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset – Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • Use the frequent itemsets to generate association rules. The Apriori Algorithm • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 are contained in t that Lk+1 = candidates in Ck+1 with min_support end return k Lk; The Apriori Algorithm — Example Database D TID 100 200 300 400 Items 134 235 1235 25 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset sup L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 {1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 C2 itemset {1 2} Scan D {1 {1 {2 {2 {3 3} 5} 3} 5} 5} C3 itemset {2 3 5} Scan D L3 itemset sup {2 3 5} 2 Association Rule Mining: A Road Map • Boolean vs. quantitative associations (Based on the types of values handled) – – buys(x, ―SQLServer‖) ^ buys(x, ―DMBook‖) buys(x, ―DBMiner‖) [0.2%, 60%] age(x, ―30..39‖) ^ income(x, ―42..48K‖) buys(x, ―PC‖) [1%, 75%] • • • Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? Various extensions (thousands!) Model-based Clustering f ( x)    k f k ( x; k ) k 1 ANEMIA PATIENTS AND CONTROLS 4.4 K Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume 3.8 3.9 4 Padhraic Smyth, UCI ANEMIA PATIENTS AND CONTROLS 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume 3.8 3.9 4 EM ITERATION 1 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume EM ITERATION 3 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume EM ITERATION 5 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume EM ITERATION 10 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume EM ITERATION 15 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume EM ITERATION 25 4.4 Red Blood Cell Hemoglobin Concentration 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Red Blood Cell Volume Mixtures of {Sequences, Curves, …} p ( Di )   p ( Di | ck )  k k 1 K Generative Model - select a component ck for individual i - generate data according to p(Di | ck) - p(Di | ck) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(Di | ck), we can define an EM algorithm] Example: Mixtures of SFSMs Simple model for traversal on a Web site (equivalent to first-order Markov with end-state) Generative model for large sets of Web users - different behaviors <=> mixture of SFSMs EM algorithm is quite simple: weighted counts WebCanvas: Cadez, Heckerman, et al, KDD 2000 Discussion • What is data mining? Hard to pin down – who cares? • Textbook statistical ideas with a new focus on algorithms • Lots of new ideas too Privacy and Data Mining Ronny Kohavi, ICML 1998

Related docs
Tutorial on Data Mining
Views: 192  |  Downloads: 46
Data Mining Tutorial
Views: 29  |  Downloads: 6
Data Mining A Tutorial-Based Primer
Views: 121  |  Downloads: 25
Data Mining A Tutorial-Based Primer
Views: 99  |  Downloads: 15
Tutorial on High Performance Data Mining
Views: 71  |  Downloads: 25
Data Mining Tutorial
Views: 165  |  Downloads: 20
Microsoft PowerPoint - opinion mining tutorial
Views: 55  |  Downloads: 14
Tutorial on Data Mining and Epidemiology
Views: 287  |  Downloads: 12
Do it Yourself Data Mining
Views: 151  |  Downloads: 21
Do it Yourself Data Mining
Views: 142  |  Downloads: 11
premium docs
Other docs by techmaster