Docstoc

DATA_MINING

Document Sample
DATA_MINING Powered By Docstoc
					Chapter 2 Data Mining
1. 2. 3. 4. 5. 6.

Overview of data mining Association rules Classification Clustering Other Data Mining problems Applications of data mining

Assoc. Prof. Dr. D. T. Anh

1

DATA MINING



 

Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data. To be practically useful, data mining must be carried out efficiently on large files and databases. This chapter briefly reviews the state of the art of this rather extensive field of data mining. Data mining uses techniques from such areas as
   

machine learning, statistics, neural networks genetic algorithms.

Assoc. Prof. Dr. D. T. Anh

2

1.

OVERVIEW OF DATA MINING

Data Mining as a Part of the Knowledge Discovery Process.


Knowledge Discovery in Databases, abbreviated as KDD, encompasses more than data mining.



The knowledge discovery process comprises six phases: data selection, data cleansing, enrichment, data transformation or encoding, data mining and the reporting and displaying of the discovered information.

Assoc. Prof. Dr. D. T. Anh

3

Example


 





Consider a transaction database maintained by a specially consumer goods retails. Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. The data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchases other data about age, income, and credit rating and append them to each record. Data transformation and encoding may be done to reduce the amount of data.

Assoc. Prof. Dr. D. T. Anh

4

Example (cont.)
The result of mining may be to discover the following type of “new” information:
 



Association rules – e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget. Sequential patterns – e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, then within six months he is likely to buy an accessory items. This defines a sequential pattern of transactions. A customer who buys more than twice in the regular periods may be likely buy at least once during the Christmas period. Classification trees – e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes.

Assoc. Prof. Dr. D. T. Anh

5









We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income group, place of residence, to what and how much the customers purchase. This information can then be utilized  to plan additional store locations based on demographics,  to run store promotions,  to combine items in advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualization.

Assoc. Prof. Dr. D. T. Anh

6

Goals of Data Mining and Knowledge Discovery


Data mining is typically carried out with some end goals. These goals fall into the following classes:








Prediction – Data mining can show how certain attributes within the data will behave in the future. Identification – Data patterns can be used to identify the existence of an item, an event or an activity. Classification – Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. Optimization – One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints.
Assoc. Prof. Dr. D. T. Anh 7

Data Mining: On What Kind of Data?
   

Relational databases Data warehouses Transactional databases Advanced DB and information repositories
     

Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases World Wide Web
Assoc. Prof. Dr. D. T. Anh 8

Types of Knowledge Discovered During Data Mining.
 



Data mining addresses inductive knowledge, which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an unstructured sense, it can be represented by rules. In a structured form, it may be represented in decision trees, semantic networks, or hierarchies of classes or frames. It is common to describe the knowledge discovered during data mining in five ways:


Association rules – These rules correlate the presence of a set of items with another range of values for another set of variables.

Assoc. Prof. Dr. D. T. Anh

9

Types of Knowledge Discovered (cont.)


 



Classification hierarchies – The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Patterns within time series Sequential patterns: A sequence of actions or events is sought. Detection of sequential patterns is equivalent to detecting associations among events with certain temporal relationship.  Clustering – A given population of events can be partitioned into sets of “similar” elements.

Assoc. Prof. Dr. D. T. Anh

10

Main function phases of the KD process
   



  



Learning the application domain:  relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining  summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
Assoc. Prof. Dr. D. T. Anh 11

Assoc. Prof. Dr. D. T. Anh

12

ASSOCIATION RULES What Is Association Rule Mining?




Association rule mining is finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications:
   

Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.



Rule form: “Body  Head [support, confidence]”.
Assoc. Prof. Dr. D. T. Anh 13

Association rule mining


buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] major(x, “CS”)  takes(x, “DB”)  grade(x, “A”) [1%, 75%]

Examples.

Association Rule Mining Problem: Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items  E.g., 98% of people who purchase tires and auto accessories also get automotive services done.
Assoc. Prof. Dr. D. T. Anh 14

Rule Measures: Support and Confidence








Let J = {i1, i2,…,im} be a set of items. Let D, the taskrelevant data, be a set of database transactions where each transaction T is a set of items such that T  J. Each transaction T is said to contain A if and only if A  T. An association rule is an implication of the form A  B where A  J, B  J and A  B = . The rule A  B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A  B (i.e. both A and B). This is taken to be the probability P(A  B ). The rule A  B has the confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B.

Assoc. Prof. Dr. D. T. Anh

15

Support and confidence
That is. support, s, probability that a transaction contains {A  B } s = P(A  B ) confidence, c, conditional probability that a transaction having A also contains B. c = P(A|B).  Rules that satisfy both a minimum support threhold (min_sup) and a mimimum confidence threhold (min_conf) are called strong.

Assoc. Prof. Dr. D. T. Anh

16

Frequent item set






A set of items is referred as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_suf and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count. If an itemset satisfies minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk.
Assoc. Prof. Dr. D. T. Anh 17

Example 2.1
Transaction-ID Items_bought ------------------------------------------2000 A, B, C 1000 A, C 4000 A, D 5000 B, E, F Let minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%)
Assoc. Prof. Dr. D. T. Anh 18

Types of Association Rules
Boolean vs. quantitative associations (Based on the types of values handled)
buys(x, “SQLServer”)  buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”)  income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations
The rule that references two or more dimensions, such as the dimensions buys, income and age is a multi-dimensional association rule.

Single level vs. multiple-level analysis
Some methods for association rule mining can find rules at different levels of abstractions. For example, suppose that a set of association rule mined includes the following rules: age(x, “30..39”)  buys(x, “laptop computer”) age(x, “30..39”)  buys(x, “ computer”) in which “computer” is a higher level abstraction of “laptop computer”.
Assoc. Prof. Dr. D. T. Anh 19

How to mine association rules from large databases?


Association rule mining is a two-step process: 1. Find all frequent itemsets (the sets of items that have minimum support) A subset of a frequent itemset must also be a frequent itemset. (Apriori principle) i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (kitemset) 2.Generate strong association rules from the frequent itemsets.



The overall performance of mining association rules is determined by the first step.
Assoc. Prof. Dr. D. T. Anh 20

The Apriori Algorithm








Apriori is an important algorithm for mining frequent itemsets for Boolean association rules. Apriori algorithm employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.
Assoc. Prof. Dr. D. T. Anh 21

Apriori property
 



Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threhold, min_sup, then I is not frequent, that is, P(I) < min_suf. If an item A is added to the itemset I, then the resulting itemset, I  A, can not occur more frequently than I. Therefore, I  A is not frequent either, i.e., P(IA) < min_suf. This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well.

Assoc. Prof. Dr. D. T. Anh

22

Finding Lk using Lk-1.


A two-step process is used in finding Lk using Lk-1.
 

Join Step: Ck is generated by joining Lk-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Assoc. Prof. Dr. D. T. Anh

23

Pseudo code
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = apriori_gen(Lk, min_sup); for each transaction t in database do // scan D for counts increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
Assoc. Prof. Dr. D. T. Anh 24

procedure apriori_gen(Lk:frequent k-itemset, min_sup: minmum support threshold) (1) for each itemset l1  Lk (2) for each itemset l2  Lk (3) if(l1[1] = l2[1] l1[2] = l2[2] … l1[k-1] = l2[k-1]  l1[k] < l2[k] then { (4) c = l1  l2; (5) if some k-subset s of c  Lk then (6) delete c; // prune step: remove unfruitful candidate (7) else add c to Ck; (8) } (9) return Ck; (10) end procedure

Assoc. Prof. Dr. D. T. Anh

25

Example 2.2:
TID List of item_Ids ----------------------------T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3


Assume that minimum transaction support count required is 2 (i.e. min_sup = 2/9=22%).
Assoc. Prof. Dr. D. T. Anh 26

C1 Itemset {I1} {I2} {I3) {I4} {I5} C2 Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}

Sup.count 6 7 6 2 2

L1 Itemset {I1} {I2} {I3) {I4} {I5} L2 Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5}

Sup.count 6 7 6 2 2

Sup.count 4 4 1 2 4 2 2 0 1 0

Sup.count 4 4 2 4 2 2

Assoc. Prof. Dr. D. T. Anh

27

C3 Itemset {I1, I2, I3} {I1, I2, I5} {I1, I3, I5} {I1, I2, I4} {I1, I4, I5} {I2, I3, I4} {I2, I4, I5}

Sup.count 2 2 X X X X X

L3 Itemset {I1, I2, I3} {I1, I2, I5}

Sup.count 2 2

C4 = {{I1, I2, I3, I5}

L4 = 

Assoc. Prof. Dr. D. T. Anh

28

Generating Association Rules from Frequent Itemsets




Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them. This can be done using the following equation for confidence, where the conditional probability is expressed in terms of itemset support count:
confidence(A  B) = P(B|A) = support_count(AB)/support_count(A) where support_count(X) is the number of transactions containing the itemsets X.
Assoc. Prof. Dr. D. T. Anh 29



Based on this equation, association rules can be generated as follows:




For each frequent itemset l, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule “ s  (l – s)” if support_count(l)/support_count(s)  min_conf, where min_conf is the minimum confidence threshold.



Since the rules are generated from frequent itemsets, each one automatically satisfies minimum support.
Assoc. Prof. Dr. D. T. Anh 30

Example 2.3. From Example 2.2, suppose the data contain the frequent itemset l = {I1, I2, I5}. The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}. The resulting association rules are as shown blow: I1 I2  I5 I1 I5  I2 I2 I5  I1 I1 I2 I5 I2  I1 I5 I5  I1 I2 confidence = 2/4 = 50% confidence = 2/2 = 100% confidence = 2/2 = 100% confidence = 2/6 = 33% confidence = 2/7 = 29% confidence = 2/2 = 100%

If the minimum confidence threshold is, say, 70%, then only the second, third and last rules above are outputs.
Assoc. Prof. Dr. D. T. Anh 31

CLASSIFICATION






Classification is the process of learning a model that describes different classes of data. The classes are predetermined. Example: In a banking application, customers who apply for a credit card may be classify as a “good risk”, a “fair risk” or a “poor risk”. Hence, this type of activity is also called supervised learning. Once the model is built, then it can be used to classify new data.

Assoc. Prof. Dr. D. T. Anh

32











The first step, of learning the model, is accomplished by using a training set of data that has already been classified. Each record in the training data contains an attribute, called the class label, that indicates which class the record belongs to. The model that is produced is usually in the form of a decision tree or a set of rules. Some of the important issues with regard to the model and the algorithm that produces the model include:  the model’s ability to predict the correct class of the new data,  the computational cost associated with the algorithm  the scalability of the algorithm. Let examine the approach where the model is in the form of a decision tree. A decision tree is simply a graphical representation of the description of each class or in other words, a representation of the classification rules.

Assoc. Prof. Dr. D. T. Anh

33

Example 3.1






Example 3.1: Suppose that we have a database of customers on the AllEletronics mailing list. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classified as to whether or not they have purchased a computer at AllElectronics. Suppose that new customers are added to the database and that you would like to notify these customers of an upcoming computer sale. To send out promotional literature to every new customers in the database can be quite costly. A more cost-efficient method would be to target only those new customers who are likely to purchase a new computer. A classification model can be constructed and used for this purpose. The figure 2 shows a decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer.
Assoc. Prof. Dr. D. T. Anh 34

Each internal node represents a test on an attribute. Each leaf node represents a class.

A decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer.
Assoc. Prof. Dr. D. T. Anh 35

Algorithm for decision tree induction
Input: set of training data records: R1, R2, …, Rm and set of Attributes A1, A2, …, An Ouput: decision tree

Basic algorithm (a greedy algorithm) - Tree is constructed in a top-down recursive divideand-conquer manner - At start, all the training examples are at the root - Attributes are categorical (if continuous-valued, they are discretized in advance) - Examples are partitioned recursively based on selected attributes - Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

Assoc. Prof. Dr. D. T. Anh

36

Conditions for stopping partitioning - All samples for a given node belong to the same class - There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf - There are no samples left.


Assoc. Prof. Dr. D. T. Anh

37

Procedure Build_tree(Records, Attributes); Begin (1) Create a node N; (2) If all Records belong to the same class, C then (3) Return N as a leaf node with the class label C; (4) If Attributes is empty then (5) Return N as a leaf node with the class label C, such that the majority of Records belong to it; (6) select attributes Ai (with the highest information gain) from Attributes; (7) label node N with Ai; (8) for each known value aj of Ai do begin (9) add a branch for node N for the condition Ai = aj; (10) Sj = subset of Records where Ai = aj; (11) If Sj is empty then (12) Add a leaf L with class label C, such that the majority of Records belong to it and return L else (13) Add the node return by Build_tree(Sj, Attributes – Ai); end end

Assoc. Prof. Dr. D. T. Anh

38

Attribute Selection Measure


The expected information gain needed to classify training data of s samples, where the Class attribute has m values (a1, …, am) and si is the number of samples belong to Class label ai is given by:

I(s1, s2,…, sm) = -

 p log
i i 1

m

2

( pi )

where pi is the probability that a random sample belongs to the class with label ai. An estimate of pi is si/s. Consider an attribute A with values {a1, …, av } used as the test attribute for splitting in the decision tree. Attribute A partitions the samples into the subsets S1,…, Sv where samples in each Si have a value of ai for attribute A. Each Si may contain samples that belong to any of the classes. The number of samples in Si that belong to class j can be denoted as sij v s1 j  ...  smj E(A) = I ( s1 j ,..., smj)


j 1

s

Assoc. Prof. Dr. D. T. Anh

39

I(s1j,…,smj) can be defined using the formulation for I(s1,…,sm) with pi being replaces by pij = sij/sj. Now the information gain by partitioning on attribute A is defined as: Gain(A) = I(s1, s2,…, sm) – E(A).






Example 3.1: Table 1 presents a training set of data tuples taken from the AllElectronics customer database. The class label attribute, buys_computer, has two distinct values; therefore two distinct classes (m = 2). Let class C1 correspond to yes and class C2 corresponds to no. There are 9 samples of class yes and 5 samples of class no. To compute the information gain of each attribute, we first use Equation (1) to compute the expected information needed to classify a given sample: I(s1, s2) = I(9,5) = - (9/14) log2(9/14) – (5/9)log2(5/14) = 0.94

Assoc. Prof. Dr. D. T. Anh

40

Training data tuples from the AllElectronics customer database
age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent
Assoc. Prof. Dr. D. T. Anh

Class No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
41

Next, we need to compute the entropy of each attribute. Let’s start with the attribute age. We need to look at the distribution of yes and no samples for each value of age. We compute the expected information for each of these distributions. For age =”<= 30”: s11 = 2 s21 = 3 I(s11, s21) = -(2/5)log2(2/5) – (3/5)log2(3/5)= 0.971 For age = “31…40” s12 = 4 s22 = 0 I(s12, s22) = -(4/4)log2(4/4) – (0/4)log2(0/4) = 0 For age = “>40”: s13 = 3 s23 = 2 I(s13, s23) = -(3/5)log2(3/5) – (2/5)log2(2/5)= 0.971 Using Equation (2), the expected information needed to classify a given sample if the samples are partitioned according to age is E(age) = (5/14)I(s11, s21) + (4/14) I(s12, s22) + (5/14)I(s13, s23) = (10/14)*0.971 = 0.694.

Assoc. Prof. Dr. D. T. Anh

42



Hence, the gain in information from such a partitioning would be
Gain(age) = I(s1, s2) – E(age) = 0.940 – 0.694 = 0.246





Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, and Gain(credit_rating) = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values. The samples are then partitioned accordingly, as shown in Figure 3.

Assoc. Prof. Dr. D. T. Anh

43

age?

<= 30 31…40 income high high medium low medium student credit_rating no no no yes yes fair excellent fair fair excellent class no no no yes yes student no yes no yes

>40

income medium low low medium medium

student credit_rating no fair yes fair yes excellent yes fair no excellent

class yes yes no yes no

income high low medium high

credit_rating class fair yes excellent yes excellent yes fair yes

Assoc. Prof. Dr. D. T. Anh

44

Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction  The leaf node holds the class prediction  Rules are easier for humans to understand. Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”


Assoc. Prof. Dr. D. T. Anh

45

Neural Networks and Classification








Neural network is a technique derived from AI research that uses generalized approximation and provides an iterative method to carry it out. ANNs use the curve-fitting approach to infer a function from a set of samples. This technique provides a “learning approach”; it is driven by a test sample that is used for the initial inference and learning. With this kind of learning method, responses to new inputs may be able to be interpolated from the known samples. This interpolation however, depends on the model developed by the learning method. ANNs can be classified into two categories: supervised and unsupervised networks. Adaptive methods that attempt to reduce the output error are supervised learning methods, whereas those that develop internal representations without sample outputs are called unsupervised learning methods. ANNs can learn from information on a specific problem. They perform well on classification tasks and are therefore useful in data mining.
Assoc. Prof. Dr. D. T. Anh 46

Other Classification Methods



 



k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches

Assoc. Prof. Dr. D. T. Anh

47

The k-Nearest Neighbor Algorithm








All instances (samples) correspond to points in the n-dimensional space. The nearest neighbor are defined in terms of Euclidean distance. The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1, y2, …,yn) is n d(X,Y) =   (xi –yi)2 1 When given an unknown sample, the k-Nearest Neighbor classifier search for the space for the k training samples that are closest to the unknown sample xq. The unknown sample is assigned the most common class among its k nearest neighbors. The algorithm has to vote to determine the most common class among the k nearest neighbor. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in the space. Once we have obtained xq’s k-nearest neighbors using the distance function, it is time for the neighbors to vote in order to determine xq’s class.

Assoc. Prof. Dr. D. T. Anh

48

Two approaches are common.  Majority voting: In this approach, all votes are equal For each class, we count how many of the k neighbors have that class. We return the class with the most votes.  Inverse distance-weighted voting: In this approach, closer neighbors get higher votes. While there are better-motivated methods, the simplest version is to take a neighbor’s vote to be the inverse of its distance to xq:

1 w d ( xq , xi )2
Then we sum the votes and return the class with the highest vote.
Assoc. Prof. Dr. D. T. Anh 49

Genetic Algorithms
  

GA: based on an analogy to biological evolution Each rule is represented by a string of bits. Example: The rule “IF A1 and Not A2 then C2“ can be encoded as the bit string “100”, where the two left bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001”.








An initial population is created consisting of randomly generated rules Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation.

Assoc. Prof. Dr. D. T. Anh

50

CLUSTERING

What is Cluster Analysis?


Cluster: a collection of data objects
 

Similar to one another within the same cluster Dissimilar to the objects in other clusters. Grouping a set of data objects into clusters.



Cluster analysis






Clustering is unsupervised classification: no predefined classes. Typical applications
 

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
Assoc. Prof. Dr. D. T. Anh 51

General Applications of Clustering
 

Pattern Recognition Spatial Data Analysis
 

create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining

  

Image Processing Economic Science (especially market research) World Wide Web
 

Document classification Cluster Weblog data to discover groups of similar access patterns
Assoc. Prof. Dr. D. T. Anh 52

Examples of Clustering Applications



 



Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults.

Assoc. Prof. Dr. D. T. Anh

53

Similarity and Dissimilarity Between Objects




Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q ) i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j2 ip jp
Assoc. Prof. Dr. D. T. Anh 54

Euclid distance
 

If q = 2, d is Euclidean distance: Properties


d(i,j) 0



d(i,i) = 0



d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)





Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.
Assoc. Prof. Dr. D. T. Anh 55

Partitioning Algorithms: Basic Concept








Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion. - Global optimal: exhaustively enumerate all partitions - Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
Assoc. Prof. Dr. D. T. Anh 56

The K-Means Clustering Method
  

Input: a database D, of m records, r1, r2,…,rm and a desired number of clusters k. Output: set of k clusters that minimizes the square error criterion. Given k, the k-means algorithm is implemented in 4 steps:
 

 

Step 1: Randomly choose k records as the initial cluster centers. Step 2: Assign each records ri, to the cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters. Step 3: recalculate the centroid (mean) of each cluster based on the records assigned to the cluster. Step 4: Go back to Step 2, stop when no more new assignment.

Assoc. Prof. Dr. D. T. Anh

57





The algorithm begins by randomly choosing k records to represent the centroids (means), m1, m2,…,mk of the clusters, C1, C2,…,Ck. All the records are places in a given cluster based on the distance between the record and the cluster mean. If the distance between mi and record rj is the smallest among all cluster means, then record is placed in cluster Ci. Once all records have been initially placed in a cluster, the mean for each cluster is recomputed. Then the process repeats, by examining each record again and placing it in the cluster whose mean is closest. Several iterations may be needed, but the algorithm will converge, although it may terminate at a local optimum.

Assoc. Prof. Dr. D. T. Anh

58

 

The terminating condition is usually the squarederror criterion. Typically, the square-error criterion is used, defined as:
k

E = i=1pCi |p – mi|2

where E is the sum of square error for all objects in the database, p is the point in space representing a given object, and mi is the mean of cluster Ci. This criterion tries to make the resulting clusters as compact and as separate as possible.

Assoc. Prof. Dr. D. T. Anh

59

Comments on the K-Means Method


Strength




Relatively efficient: O(tkn), where n is # objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms



Weakness


 

Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers

Assoc. Prof. Dr. D. T. Anh

60

Example 4.1: Consider the K-means clustering algorithm that works with the (2-dimensional) records in Table 2. Assume that the number of desired clusters k is 2. RID Age Years of Service -------------------------------------1 30 5 2 50 25 3 50 15 4 25 5 5 30 10 6 30 25  Let the algorithm choose records with RID 3 for cluster C 1 and RID 6 for cluster C2. as the initial cluster centroids.  The first iteration: distance(r1, C1) =  (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0, so r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2  C2. distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4  C1 distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5  C1  Now the new means (centroids) for the two clusters are computed.


Assoc. Prof. Dr. D. T. Anh

61



The means for a cluster Ci with n records of m dimensions is the vector: (1/n rj Ci rji, … 1/n rj Ci rjm) The new mean for C1 is (33.75, 8.75) and the new mean for C2 is (52.5, 25). The second iteration: r1, r4, r5  C1 and r2, r3, r6  C2. The mean for C1 and C2 are recomputed as (28.3, 6.7) and (51.7, 21.7). In the next iteration, all records stay in their previous clusters and the algorithm terminates.
62



 



Assoc. Prof. Dr. D. T. Anh

Clustering of a set of objects based on the k-means method.
Assoc. Prof. Dr. D. T. Anh 63

Hierarchical Clustering
 

A hierarchical clustering method works by grouping data objects into a tree of clusters. In general, there are two types of hierarchical clustering methods:
 Agglomerative hierarchical clustering: This bottom-up strategy

starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until a certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity.  Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfied certain termination condition, such as a desired number clusters is obtained or the distance between two closest clusters is above a certain threshold distance.

Assoc. Prof. Dr. D. T. Anh

64

Agglomerative algorithm


Assume that we are given n data records r1, r2,…,rm and a function D(Ci, Cj) for measuring the distance between two clusters Ci and Cj. Then an agglomerative algorithm for clustering can be described as follows:

for i = 1,…,n do let Ci = {ri}; while there is more than one cluster left do begin Let Ci and Cj be the clusters which minimize the distance D(Ck, Ch) between any two clusters; Ci = Ci  Cj; Remove cluster Cj end

Assoc. Prof. Dr. D. T. Anh

65

Example




Example 4.2: Figure 4 shows the application of AGNES (Agglomerative NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a single-link approach in that each cluster is represented by all of the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster.

Assoc. Prof. Dr. D. T. Anh

66

Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}
Assoc. Prof. Dr. D. T. Anh 67







 

In DIANA, all of the objects are used to form one initial cluster. The cluster is split according to some principle, such as the maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains only a single objects. In general, divisive methods are more computationally expensive and tend to be less widely used than agglomerative methods. There are a variety of methods for defining the intercluster distance D(Ck, Ch). However, local pairwise distance measures (i.e., between pairs of clusters) are especially suited to hierarchical methods since they can be computed directly from pairwise distances of the members of each cluster. One of the earliest and most important of these is the nearest neighbor or single link method. This defines the distance between two clusters as the distance between the two closest points, one from each cluster; Dsl(Ci, Cj) = min{d(x,y)| x Ci, y Cj} where d(x, y) is the distance between objects x and y.
Assoc. Prof. Dr. D. T. Anh 68

Major weakness of hierarchical clustering methods:




1. They do not scale well: time complexity of at least O(n2), where n is the number of total objects. The agglomerative algorithm requires in the first iteration that we locate the closest pair of objects. This takes O(n2) time and so, in most cases, the algorithm require O(n2) time, and frequently much more. 2. We can never undo what was done previously.

Assoc. Prof. Dr. D. T. Anh

69

OTHER DATA MINING PROBLEMS Discovering of Sequential Patterns








The discovery of sequential patterns is based on the concept of a sequence of itemsets. We assume that transactions are ordered by time of purchase. That ordering yields a sequence of itemsets. For example, {milk, bread, juice}, {bread, eggs}, {cookies, milk, coffee} may be such a sequence of itemsets based on three visits of the same customer to the store. The support for a sequence of itemsets is the percentage of the given set U of sequences of which S is a subsequence. In this example, {milk, bread, juice}, {bread, eggs} and {bread, eggs}, {cookies, milk, coffee} are considered subsequences.

Assoc. Prof. Dr. D. T. Anh

70





The problem of identifying sequential patterns, then, is to find all subsequences from the given sets of sequences that have a user-defined minimum support. The sequence S1, S2, S3,… is a predictor of the fact that a customer who buys itemset S1 is likely to buy itemset S2 and then S3, and so on. This prediction is based on the frequency (support) of this sequence in the past. Various algorithms have been investigated for sequence detection.

Assoc. Prof. Dr. D. T. Anh

71

Mining Time series Discovering of Patterns in Time Series




A time series database consists of sequences of values or events changing with time. The values are typically measured at equal time intervals. Time series databases are popular in many applications, such as studying daily fluctuations of a stock market, traces of scientific experiments, medical treatments, and so on. A time series can be illustrated as a time-series graph which describes a point moving with the passage of time.
Assoc. Prof. Dr. D. T. Anh 72

Time series data: Stock price of IBM over time

Assoc. Prof. Dr. D. T. Anh

73

Categories of Time-Series Movements:
 





Long-term or trend movements Cyclic movements or cycle variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements

Assoc. Prof. Dr. D. T. Anh

74

Similarity Search in Time-Series Analysis
 





Normal database query finds exact match. Similarity search finds data sequences that differ only slightly from the given query sequence. Two categories of similarity queries  Whole matching: find a sequence that is similar to the query sequence  Subsequence matching: find all pairs of similar sequences Typical Applications  Financial market  Transaction data analysis  Scientific databases (e.g. power consumption analysis)  Medical diagnosis (e.g. cardiogram analysis)

Assoc. Prof. Dr. D. T. Anh

75

Data transformation
 

    

For similarity analysis of time series data, Euclidean distance is typically used as a similarity measure. Many techniques for signal analysis require the data to be in the frequency domain. Therefore, distance-preserving transformations are often used to transform the data from time domain to frequency domain. Usually data-independent transformations are used where the transformation matrix is determined a priori. E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain DFT does a good job of concentrating energy in the first few coefficients. If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance.

Assoc. Prof. Dr. D. T. Anh

76

Multidimensional Indexing


Multidimensional index


Constructed for efficient accessing using the first few Fourier coefficients





Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence. Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches.
Assoc. Prof. Dr. D. T. Anh 77

Subsequence Matching




 

Break each sequence into a set of pieces of window with length w. Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle.
(R-trees and R*-trees have been used to store minimal bounding rectangles so as to speed up the similarity search.)



Use a multipiece assembly algorithm to search for longer sequence matches.
Assoc. Prof. Dr. D. T. Anh 78

Enhanced similarity search
  





Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two subsequences are considered similar if one lies within an envelope of  width around the other, ignoring outliers. Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

Assoc. Prof. Dr. D. T. Anh

79

Steps for performing a similarity search


Atomic matching


Find all pairs of gap-free windows of a small length that are similar Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches Linearly order the subsequence matches to determine whether enough similar pieces exist



Window stitching




Subsequence Ordering


Assoc. Prof. Dr. D. T. Anh

80

Query Languages for Time Sequences


Time-sequence query language




Should be able to specify sophisticated queries like Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B Should be able to support various kinds of queries: range queries, all-pair queries, and nearest neighbor queries Allows users to define and query the overall shape of time sequences Uses human readable series of sequence transitions or macros Ignores the specific details E.g., the pattern up, Up, UP can be used to describe increasing degrees of rising slopes Macros: spike, valley, etc.



Shape definition language
   



Assoc. Prof. Dr. D. T. Anh

81

Prediction by Regression What Is Prediction?


Prediction is similar to classification



First, construct a model Second, use model to predict unknown value
Linear and multiple regression Non-linear regression Classification refers to predict categorical class label Prediction models continuous-valued functions.



Major method for prediction is regression
 



Prediction is different from classification
 

Assoc. Prof. Dr. D. T. Anh

82

Linear and Multiple Regression






Linear regression: In linear regression, data are modeled using a straight line. Linear regression is the simplest form of regression. Bivariate linear regression models a random variable, Y (called a response variable), as a linear function of another random variable, X (called a predictor variable), that is, Y=a+bX Two parameters, a and b specify the line and are to be estimated by using the data at hand. It uses the least squares criterion to the s known values of Y1, Y2, …Ys, X1, X2, ….Xs
s s

b = i=1(Xi – X*)(Yi – Y*)/i=1(Xi – X*) a = Y* - b.X*
where X* is the average of X1, X2, ….Xs and Y* is the average of Y1, Y2, …Ys
Assoc. Prof. Dr. D. T. Anh 83








Multiple regression is an extension of linear regression involving more than one predictor variable.It allows response variable Y to be modeled as a linear function of a multidimensional feature vector. An example of a multiple regression model based on two predictor attributes or variables, X1 and X2, is: Y = b0 + b1 X1 + b2 X2. The methods of least squares can be applied here to solve for b0, b1 and b2. Many nonlinear functions can be transformed into the above.

Assoc. Prof. Dr. D. T. Anh

84

Nonlinear Regression








If a given response variable and predictor variables have a relationship that may be modeled by a polynomial function, we can use polynomial regression. Polynomial regression Polynomial regression can be modeled by adding polynomial terms to the basic linear model. Y = b0 + b1X + b2X2 + b3X3 To convert this equation to linear form, we define new variables: X 1 = X X 2 = X2 X3 = X3 The above equation can then be converted to linear form: Y = b0 + b1 X1 + b2 X2+ b3X3 which is solvable by the method of least squares.

Assoc. Prof. Dr. D. T. Anh

85

WHY DATA MINING? — POTENTIAL APPLICATIONS


Database analysis and decision support


Market analysis and management


target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis



Risk analysis and management




Fraud detection and management Text mining (news group, email, documents) and Web analysis. Intelligent query answering



Other Applications




Assoc. Prof. Dr. D. T. Anh

86

Market Analysis and Management


Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Conversion of single to a joint bank account: marriage, etc.



Target marketing




Determine customer purchasing patterns over time


Assoc. Prof. Dr. D. T. Anh

87



Cross-market analysis
 

Associations/co-relations between product sales Prediction based on the association information data mining can tell you what types of customers buy what products (clustering or classification) identifying the best products for different customers use prediction to find what factors will attract new customers various multidimensional summary reports statistical summary information (data central tendency and variation)



Customer profiling




Identifying customer requirements
 



Provides summary information
 

Assoc. Prof. Dr. D. T. Anh

88

Corporate Analysis and Risk Management


Finance planning and asset evaluation
 



cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Assoc. Prof. Dr. D. T. Anh 89



Resource planning:




Competition:
 



Fraud Detection and Management


Applications


widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. use historical data to build models of fraudulent behavior and use data mining to help identify similar instances



Approach




Examples






auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references

Assoc. Prof. Dr. D. T. Anh

90

Fraud Detection and Management (cont.)


Detecting inappropriate medical treatment


Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Analysts estimate that 38% of retail shrink is due to dishonest employees.



Detecting telephone fraud






Retail


Assoc. Prof. Dr. D. T. Anh

91

Other Applications


Sports


IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.



Astronomy




Internet Web Surf-Aid


Assoc. Prof. Dr. D. T. Anh

92

Some representative data mining tools
 



 

 

Acknosoft (Kate) Decision trees, case-based reasoning DBMiner Technology (DBMiner) OLAP analysis, Associations, classification, clustering. IBM (Intelligent Miner) Classification, Association rules, predictive models. NCR (Management Discovery Tool) Association rules SAS (Enterprise Miner) Decision trees, Association rules, neural networks, Regression, clustering Silicon Graphics (MineSet) Decision trees, Association rules Oracle (Oracle Data Mining) classification, prediction, regression, clustering, associations, feature selection, feature extraction, anomaly selection. Weka system (http://www.cs.waikato.ac.nz/ml/weka) University of Waikato, Newzealand. The system is written in Java. The platforms: Linux, Windows, Macintosh.
Assoc. Prof. Dr. D. T. Anh 93




				
DOCUMENT INFO