Data Mining in Large Databases

Reviews
Data Mining in Large Databases (Contributing Slides by Gregory Piatetsky-Shapiro and Rajeev Rastogi and Kyuseok Shim Lucent Bell laboratories) Overview     Introduction Association Rules Classification Clustering Background    Corporations have huge databases containing a wealth of information Business databases potentially constitute a goldmine of valuable business information Very little functionality in database systems to support data mining applications Data mining: The efficient discovery of previously unknown patterns in large databases  Applications         Fraud Detection Loan and Credit Approval Market Basket Analysis Customer Segmentation Financial Applications E-Commerce Decision Support Web Search Data Mining Techniques         Association Rules Sequential Patterns Classification Clustering Similar Time Sequences Similar Images Outlier Discovery Text/Web Mining Examples of Patterns     Association rules  98% of people who purchase diapers buy beer Classification  People with age less than 25 and salary > 40k drive sports cars Similar time sequences  Stocks of companies A and B perform similarly Outlier Detection  Residential customers with businesses at home Association Rules   Given:  A database of customer transactions  Each transaction is a set of items Find all rules X => Y that correlate the presence of one set of items X with another set of items Y  Any number of items in the consequent or antecedent of a rule  Possible to specify constraints on rules (e.g., find only rules involving expensive imported products) Association Rules  Sample Applications  Market basket analysis  Attached mailing in direct marketing  Fraud detection for medical insurance  Department store floor/shelf planning Confidence and Support   A rule must have some minimum userspecified confidence 1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. A rule must have some minimum userspecified support 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value Example  Example: Transaction Id 1 2 3 4 Purchased Items {1, 2, 3} {1, 4} {1, 3} {2, 5, 6}  For minimum support = 50%, minimum confidence = 50%, we have the following rules 1 => 3 with 50% support and 66% confidence 3 => 1 with 50% support and 100% confidence Problem Decomposition 1. Find all sets of items that have minimum support  Use Apriori Algorithm 2. Use the frequent itemsets to generate the desired rules  Generation is straight forward Problem Decomposition Example For minimum support = 50% and minimum confidence = 50% Frequent Itemset {1} {2} {3} {1, 3} Support 75% 50% 50% 50% TID 1 2 3 4 Items {1, 2, 3} {1, 3} {1, 4} {2, 5, 6} For the rule 1 => 3: •Support = Support({1, 3}) = 50% •Confidence = Support({1,3})/Support({1}) = 66% The Apriori Algorithm Fk : Set of frequent itemsets of size k  Ck : Set of candidate itemsets of size k F1 = {large items} for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk foreach transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support } Answer = Uk Fk  Key Observation  Every subset of a frequent itemset is also frequent => a candidate itemset in Ck+1 can be pruned if even one of its subsets is not contained in Fk Apriori - Example Database D TID 1 2 3 4 Items {1, 3, 4} {2, 3, 5} {1, 2, 3, 5} {2, 5} C1 F1 Sup. 2 3 3 1 3 Scan D Itemset {1} {2} {3} {4} {5} Itemset {2} {3} {5} Sup. 3 3 3 C2 Itemset {2, 3} {2, 5} {3, 5} C2 Scan D {2, 3} {2, 5} {3, 5} 2 3 2 F2 Itemset {2, 5} Sup. 3 Sequential Patterns    Given:  A sequence of customer transactions  Each transaction is a set of items Find all maximal sequential patterns supported by more than a user-specified percentage of customers Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction Classification   Given:  Database of tuples, each assigned a class label Develop a model/profile for each class  Example profile (good credit):  (25 <= age <= 40 and income > 40k) or (married = YES) Sample applications:  Credit card approval (good, bad)  Bank locations (good, fair, poor)  Treatment effectiveness (good, fair, poor)  Decision Tree      An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. Decision Trees Outlook sunny sunny overcast Temperature hot hot hot Humidity high high high Windy false true false Play? No No Yes rain rain mild cool high normal false false Yes Yes rain overcast sunny sunny rain sunny overcast overcast rain cool cool mild cool mild mild mild hot mild normal normal high normal normal normal high normal high true true false false false true true false true No Yes No Yes Yes Yes Yes Yes No Example Tree Outlook sunny rain overcast Humidity high No normal Yes Yes Windy true No false Yes Decision Tree Algorithms   Building phase  Recursively split nodes using best splitting attribute for node Pruning phase  Smaller imperfect decision tree generally achieves better accuracy  Prune leaf nodes recursively to prevent over-fitting Attribute Selection    Which is the best attribute?  The one which will result in the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain Which attribute to select? Computing information  Information is measured in bits  Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (this can involve fractions of bits!)   Formula for computing the entropy: p1 , p 2 ,  , p n )   p1 log p1  p 2 log p 2   p n log p n entropy( Example: attribute “Outlook”  info([2,3] )  entropy(2/ 5,3/5)   2 / 5 log( 2 / 5 )  3 / 5 log( 3 / 5 )  0 . 971 bits  “Outlook” = “Sunny”: “Outlook” = “Overcast”: info([4,0] )  entropy(1, 0)   1 log( 1)  0 log( 0 )  0 bits  info([3,2] )  entropy(3/ 5,2/5)   3 / 5 log( 3 / 5 )  2 / 5 log( 2 / 5 )  0 . 971 bits  “Outlook” = “Rainy”: info([3,2] , [4,0], [3,2])  ( 5 / 14 )  0 . 971  ( 4 / 14 )  0  ( 5 / 14 )  0 . 971  0 . 693 bits Expected information for attribute: Computing the information gain   gain(" Outlook" )  info([9,5] ) - info([2,3]  0 . 247 bits  Information gain: (information before split) – (information after split) , [4,0], [3,2])  0.940 - 0.693 Information gain for attributes from weather data: gain(" Outlook" )  0 . 247 bits gain(" Temperatur e" )  0 . 029 bits gain(" Humidity" )  0 . 152 bits gain(" Windy" )  0 . 048 bits Continuing to split gain(" Humidity" )  0 . 971 bits gain(" Temperatur e" )  0 . 571 bits gain(" Windy" )  0 . 020 bits The final decision tree  Note: not all leaves need to be pure; sometimes identical instances have different classes   Splitting stops when data can’t be split any further Decision Trees   Pros  Fast execution time  Generated rules are easy to interpret by humans  Scale well for large data sets  Can handle high dimensional data Cons  Cannot capture correlations among attributes  Consider only axis-parallel cuts Clustering    Given:  Data points and number of desired clusters K Group the data points into K clusters  Data points within clusters are more similar than across clusters Sample applications:  Customer segmentation  Market basket customer analysis  Attached mailing in direct marketing  Clustering companies with similar growth Traditional Algorithms Partitional algorithms   Enumerate K partitions optimizing some criterion Example: square-error criterion   p mi i 1 p k 2 Ci  mi is the mean of cluster Ci K-means Algorithm     Assign initial means Assign each point to the cluster for the closest mean Compute new mean for each cluster Iterate until criterion function converges K-means example, step 1 k1 Y Pick 3 initial cluster centers (randomly) k2 k3 X K-means example, step 2 k1 Y Assign each point to the closest cluster center k2 k3 X K-means example, step 3 k1 Y k1 Move each cluster center to the mean of each cluster k2 k2 k3 k3 X K-means example, step 4 Reassign points Y closest to a different new cluster center Q: Which points are reassigned? k2 k1 k3 X K-means example, step 4 … k1 Y A: three points with animation k2 k3 X K-means example, step 4b k1 Y re-compute cluster means k2 k3 X K-means example, step 5 k1 Y move cluster centers to cluster means k2 k3 X Discussion   Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum  Example: instances initial cluster centers  To increase chance of finding global optimum: restart with different random seeds K-means clustering summary    Advantages Simple, understandable items automatically assigned to clusters Disadvantages  Must pick number of clusters before hand  All items forced into a cluster  Too sensitive to outliers Traditional Algorithms Hierarchical clustering   Nested Partitions Tree structure Agglomerative Hierarchcal Algorithms    Mostly used hierarchical clustering algorithm Initially each point is a distinct cluster Repeatedly merge closest clusters until the number of clusters becomes K  Closest: dmean (Ci, Cj) = m  m dmin (Ci, Cj) = minC p  q C Likewise dave (Ci, Cj) and dmax (Ci, Cj) i j p i , q j Similar Time Sequences    Given:  A set of time-series sequences Find  All sequences similar to the query sequence  All pairs of similar sequences whole matching vs. subsequence matching Sample Applications  Financial market  Scientific databases  Medical Diagnosis Whole Sequence Matching Basic Idea Extract k features from every sequence Every sequence is then represented as a point in k-dimensional space Use a multi-dimensional index to store and search these points  Spatial indices do not work well for high dimensional data    Similar Time Sequences      Take Euclidean distance as the similarity measure Obtain Discrete Fourier Transform (DFT) coefficients of each sequence in the database Build a multi-dimensional index using first a few Fourier coefficients Use the index to retrieve sequences that are at most  distance away from query sequence Post-processing:  compute the actual distance between sequences in the time domain Outlier Discovery    Given:  Data points and number of outliers (= n) to find Find top n outlier points  outliers are considerably dissimilar from the remainder of the data Sample applications:  Credit card fraud detection  Telecom fraud detection  Medical analysis Statistical Approaches    Model underlying distribution that generates dataset (e.g. normal distribution) Use discordancy tests depending on  data distribution  distribution parameter (e.g. mean, variance)  number of expected outliers Drawbacks  most tests are for single attribute  In many cases, data distribution may not be known Distance-based Outliers      For a fraction p and a distance d,  a point o is an outlier if p points lie at a greater distance than d General enough to model statistical outlier tests Develop nested-loop and cell-based algorithms Scale okay for large datasets Cell-based algorithm does not scale well for high dimensions

Related docs
Data Mining in Large Databases
Views: 0  |  Downloads: 0
Mining Large Databases
Views: 0  |  Downloads: 0
Mining Association Rules in Large Databases
Views: 10  |  Downloads: 4
Databases and Data Mining
Views: 25  |  Downloads: 1
Large Databases
Views: 0  |  Downloads: 0
data mining
Views: 627  |  Downloads: 59
COMP3420 Advanced Databases and Data Mining
Views: 8  |  Downloads: 0
premium docs
Other docs by One Seven
Transmittal Letter to SEC Enclosing Form D 2
Views: 209  |  Downloads: 1
Customer Purchase Thank You Letter
Views: 1781  |  Downloads: 44
Equipment lease checklist
Views: 398  |  Downloads: 8
A Series ofLessons in Raja
Views: 265  |  Downloads: 8
Workers Compensation Claims
Views: 439  |  Downloads: 5
Authorization (Proxy) To Vote Shares
Views: 381  |  Downloads: 7
Hazard communication program package
Views: 324  |  Downloads: 4
Planand Agreement of Merger Between N and N
Views: 232  |  Downloads: 6
WRONGFUL DEATH
Views: 210  |  Downloads: 0
Intraware Inc Ammendments and Bylaws
Views: 213  |  Downloads: 0