Data Mining in Large Databases
(Contributing Slides by Gregory Piatetsky-Shapiro and Rajeev Rastogi and Kyuseok Shim Lucent Bell laboratories)
Overview
Introduction Association Rules Classification Clustering
Background
Corporations have huge databases containing a wealth of information Business databases potentially constitute a goldmine of valuable business information Very little functionality in database systems to support data mining applications Data mining: The efficient discovery of previously unknown patterns in large databases
Applications
Fraud Detection Loan and Credit Approval Market Basket Analysis Customer Segmentation Financial Applications E-Commerce Decision Support Web Search
Data Mining Techniques
Association Rules Sequential Patterns Classification Clustering Similar Time Sequences Similar Images Outlier Discovery Text/Web Mining
Examples of Patterns
Association rules 98% of people who purchase diapers buy beer Classification People with age less than 25 and salary > 40k drive sports cars Similar time sequences Stocks of companies A and B perform similarly Outlier Detection Residential customers with businesses at home
Association Rules
Given: A database of customer transactions Each transaction is a set of items Find all rules X => Y that correlate the presence of one set of items X with another set of items Y Any number of items in the consequent or antecedent of a rule Possible to specify constraints on rules (e.g., find only rules involving expensive imported products)
Association Rules
Sample Applications Market basket analysis Attached mailing in direct marketing Fraud detection for medical insurance Department store floor/shelf planning
Confidence and Support
A rule must have some minimum userspecified confidence 1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. A rule must have some minimum userspecified support 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value
Example
Example:
Transaction Id 1 2 3 4
Purchased Items {1, 2, 3} {1, 4} {1, 3} {2, 5, 6}
For minimum support = 50%, minimum confidence = 50%, we have the following rules 1 => 3 with 50% support and 66% confidence 3 => 1 with 50% support and 100% confidence
Problem Decomposition
1. Find all sets of items that have minimum support Use Apriori Algorithm
2. Use the frequent itemsets to generate the desired rules Generation is straight forward
Problem Decomposition Example
For minimum support = 50% and minimum confidence = 50%
Frequent Itemset {1} {2} {3} {1, 3} Support 75% 50% 50% 50%
TID 1 2 3 4 Items {1, 2, 3} {1, 3} {1, 4} {2, 5, 6}
For the rule 1 => 3: •Support = Support({1, 3}) = 50% •Confidence = Support({1,3})/Support({1}) = 66%
The Apriori Algorithm
Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 = {large items} for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk foreach transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support } Answer = Uk Fk
Key Observation
Every subset of a frequent itemset is also frequent => a candidate itemset in Ck+1 can be pruned if even one of its subsets is not contained in Fk
Apriori - Example
Database D
TID 1 2 3 4 Items {1, 3, 4} {2, 3, 5} {1, 2, 3, 5} {2, 5}
C1
F1
Sup. 2 3 3 1 3
Scan D
Itemset {1} {2} {3} {4} {5}
Itemset {2} {3} {5}
Sup. 3 3 3
C2
Itemset {2, 3} {2, 5} {3, 5}
C2 Scan D
{2, 3} {2, 5} {3, 5} 2 3 2
F2
Itemset {2, 5} Sup. 3
Sequential Patterns
Given: A sequence of customer transactions Each transaction is a set of items Find all maximal sequential patterns supported by more than a user-specified percentage of customers Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction
Classification
Given: Database of tuples, each assigned a class label Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) Sample applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor)
Decision Tree
An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node.
Decision Trees
Outlook sunny sunny overcast Temperature hot hot hot Humidity high high high Windy false true false Play? No No Yes
rain
rain
mild
cool
high
normal
false
false
Yes
Yes
rain
overcast sunny sunny rain sunny overcast overcast rain
cool
cool mild cool mild mild mild hot mild
normal
normal high normal normal normal high normal high
true
true false false false true true false true
No
Yes No Yes Yes Yes Yes Yes No
Example Tree
Outlook sunny rain
overcast
Humidity high No normal Yes
Yes
Windy true No
false
Yes
Decision Tree Algorithms
Building phase Recursively split nodes using best splitting attribute for node Pruning phase Smaller imperfect decision tree generally achieves better accuracy Prune leaf nodes recursively to prevent over-fitting
Attribute Selection
Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain
Which attribute to select?
Computing information
Information is measured in bits
Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the entropy:
p1 , p 2 , , p n ) p1 log p1 p 2 log p 2 p n log p n
entropy(
Example: attribute “Outlook”
info([2,3] ) entropy(2/ 5,3/5) 2 / 5 log( 2 / 5 ) 3 / 5 log( 3 / 5 ) 0 . 971 bits
“Outlook” = “Sunny”:
“Outlook” = “Overcast”:
info([4,0] ) entropy(1, 0) 1 log( 1) 0 log( 0 ) 0 bits
info([3,2] ) entropy(3/ 5,2/5) 3 / 5 log( 3 / 5 ) 2 / 5 log( 2 / 5 ) 0 . 971 bits
“Outlook” = “Rainy”:
info([3,2] , [4,0], [3,2]) ( 5 / 14 ) 0 . 971 ( 4 / 14 ) 0 ( 5 / 14 ) 0 . 971 0 . 693 bits
Expected information for attribute:
Computing the information gain
gain(" Outlook" ) info([9,5] ) - info([2,3] 0 . 247 bits
Information gain: (information before split) – (information after split)
, [4,0], [3,2]) 0.940 - 0.693
Information gain for attributes from weather data:
gain(" Outlook" ) 0 . 247 bits
gain(" Temperatur e" ) 0 . 029 bits
gain(" Humidity" ) 0 . 152 bits gain(" Windy" ) 0 . 048 bits
Continuing to split
gain(" Humidity" ) 0 . 971 bits gain(" Temperatur e" ) 0 . 571 bits gain(" Windy" ) 0 . 020 bits
The final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes
Splitting stops when data can’t be split any further
Decision Trees
Pros Fast execution time Generated rules are easy to interpret by humans Scale well for large data sets Can handle high dimensional data Cons Cannot capture correlations among attributes Consider only axis-parallel cuts
Clustering
Given: Data points and number of desired clusters K Group the data points into K clusters Data points within clusters are more similar than across clusters Sample applications: Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth
Traditional Algorithms
Partitional algorithms
Enumerate K partitions optimizing some criterion Example: square-error criterion
p mi
i 1 p k 2
Ci
mi is the mean of cluster Ci
K-means Algorithm
Assign initial means Assign each point to the cluster for the closest mean Compute new mean for each cluster Iterate until criterion function converges
K-means example, step 1
k1 Y Pick 3 initial cluster centers (randomly)
k2
k3
X
K-means example, step 2
k1
Y
Assign each point to the closest cluster center
k2
k3 X
K-means example, step 3
k1 Y k1
Move each cluster center to the mean of each cluster
k2
k2 k3
k3
X
K-means example, step 4
Reassign points Y closest to a different new cluster center Q: Which points are reassigned? k2 k1
k3
X
K-means example, step 4 …
k1 Y A: three points with animation k2
k3
X
K-means example, step 4b
k1 Y re-compute cluster means k2
k3
X
K-means example, step 5
k1
Y
move cluster centers to cluster means
k2 k3
X
Discussion
Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum
Example:
instances
initial cluster centers
To increase chance of finding global optimum: restart with different random seeds
K-means clustering summary
Advantages Simple, understandable items automatically assigned to clusters
Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
Traditional Algorithms
Hierarchical clustering
Nested Partitions Tree structure
Agglomerative Hierarchcal Algorithms
Mostly used hierarchical clustering algorithm Initially each point is a distinct cluster Repeatedly merge closest clusters until the number of clusters becomes K Closest: dmean (Ci, Cj) = m m dmin (Ci, Cj) = minC p q C Likewise dave (Ci, Cj) and dmax (Ci, Cj)
i j
p
i
, q
j
Similar Time Sequences
Given: A set of time-series sequences Find All sequences similar to the query sequence All pairs of similar sequences whole matching vs. subsequence matching Sample Applications Financial market Scientific databases Medical Diagnosis
Whole Sequence Matching
Basic Idea Extract k features from every sequence Every sequence is then represented as a point in k-dimensional space Use a multi-dimensional index to store and search these points Spatial indices do not work well for high dimensional data
Similar Time Sequences
Take Euclidean distance as the similarity measure Obtain Discrete Fourier Transform (DFT) coefficients of each sequence in the database Build a multi-dimensional index using first a few Fourier coefficients Use the index to retrieve sequences that are at most distance away from query sequence Post-processing: compute the actual distance between sequences in the time domain
Outlier Discovery
Given: Data points and number of outliers (= n) to find Find top n outlier points outliers are considerably dissimilar from the remainder of the data Sample applications: Credit card fraud detection Telecom fraud detection Medical analysis
Statistical Approaches
Model underlying distribution that generates dataset (e.g. normal distribution) Use discordancy tests depending on data distribution distribution parameter (e.g. mean, variance) number of expected outliers Drawbacks most tests are for single attribute In many cases, data distribution may not be known
Distance-based Outliers
For a fraction p and a distance d, a point o is an outlier if p points lie at a greater distance than d General enough to model statistical outlier tests Develop nested-loop and cell-based algorithms Scale okay for large datasets Cell-based algorithm does not scale well for high dimensions