# Data Mining Process and Techniques

Document Sample

```					Chapter 5:     Clustering

UIC - CS 594                1
Searching for groups
   Clustering is unsupervised or undirected.
   Unlike classification, in clustering, no pre-
classified data.
   Search for groups or clusters of data
points (records) that are similar to one
another.
   Similar points may mean: similar
customers, products, that will behave in
similar ways.
UIC - CS 594                                   2
Group similar points together
   Group points into classes using some
distance measures.
   Within-cluster distance, and between cluster
distance
   Applications:
   As a stand-alone tool to get insight into data
distribution
   As a preprocessing step for other algorithms
UIC - CS 594                                     3
An Illustration

UIC - CS 594      4
Examples of Clustering
Applications
   Marketing: Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing
programs
   Insurance: Identifying groups of motor insurance
policy holders with some interesting
characteristics.
   City-planning: Identifying groups of houses
according to their house type, value, and
geographical location
UIC - CS 594                                      5
Concepts of Clustering
   Clusters
   Different ways of
representing clusters
   Division with boundaries
   Spheres                         1 2 3
   Probabilistic              I1   0.5 0.2 0.3

   Dendrograms                I2
…
   …
In

UIC - CS 594                                  6
Clustering
   Clustering quality
   Inter-clusters distance  maximized
   Intra-clusters distance  minimized
   The quality of a clustering result depends on both
the similarity measure used by the method and its
application.
   The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
   Clustering vs. classification
   Which one is more difficult? Why?
   There are a huge number of clustering techniques.
UIC - CS 594                                        7
Dissimilarity/Distance Measure
   Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, which
is typically metric: d (i, j)
   The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
   Weights should be associated with different
variables based on applications and data
semantics.
   It is hard to define “similar enough” or “good
enough”. The answer is typically highly subjective.
UIC - CS 594                                        8
Types of data in clustering
analysis

   Interval-scaled variables
   Binary variables
   Nominal, ordinal, and ratio variables
   Variables of mixed types

UIC - CS 594                                 9
Interval-valued variables
   Continuous measurements in a roughly linear
scale, e.g., weight, height, temperature, etc
   Standardize data (depending on applications)
   Calculate the mean absolute deviation:
s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
n

where       m f  1 (x1 f  x2 f
n                    ...    xnf )
.

   Calculate the standardized measurement (z-score)
xif  m f
zif      sf
UIC - CS 594                                                            10
Similarity Between Objects
   Distance: Measure the similarity or dissimilarity
between two data objects
   Some popular ones include: Minkowski
distance:
d (i, j)  (| x  x |  | x  x | ... | x  x | )
q
q        q                             q
i1 j1       i2 j 2          ip jp
where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
   If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2                ip j p
UIC - CS 594                                                     11
Similarity Between Objects (Cont.)
   If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1        i2 j 2           ip jp
   Properties
   d(i,j)  0
   d(i,i) = 0
   d(i,j) = d(j,i)
   d(i,j)  d(i,k) + d(k,j)
   Also, one can use weighted distance, and
many other similarity/distance measures.

UIC - CS 594                                                 12
Binary Variables
   A contingency table for binary data
Object j
1      0       sum
1    a     b        a b
Object i    0    c     d        cd
sum a  c b  d       p
   Simple matching coefficient (invariant, if the
binary variable is symmetric): d (i, j)     bc
a bc  d
   Jaccard coefficient (noninvariant if the binary
variable is asymmetric): d (i, j)             bc
a bc
UIC - CS 594                                         13
Dissimilarity of Binary Variables
   Example
Name      Gender   Fever   Cough   Test-1   Test-2   Test-3   Test-4
Jack      M        Y       N       P        N        N        N
Mary      F        Y       N       P        N        P        N
Jim       M        Y       P       N        N        N        N

   gender is a symmetric attribute (not used below)
   the remaining attributes are asymmetric attributes
   let the values Y and P be set to 1, and the value N
be set to 0                           01
d ( jack , m ary)                         0.33
2 01
11
d ( jack , jim )         0.67
111
1 2
d ( jim , m ary)          0.75
11 2
UIC - CS 594                                                           14
Nominal Variables
   A generalization of the binary variable in that it
can take more than 2 states, e.g., red, yellow,
blue, green, etc
   Method 1: Simple matching
   m: # of matches, p: total # of variables
d (i, j)  p  m
p
   Method 2: use a large number of binary variables
   creating a new binary variable for each of the M
nominal states

UIC - CS 594                                       15
Ordinal Variables
   An ordinal variable can be discrete or continuous
   Order is important, e.g., rank
   Can be treated like interval-scaled (f is a variable)
   replace xif by their ranks      rif { ,...,M f }
1
   map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif   
M f 1
   compute the dissimilarity using methods for interval-
scaled variables

UIC - CS 594                                            16
Ratio-Scaled Variables
   Ratio-scaled variable: a measurement on a
nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt, e.g., growth of a
bacteria population.
   Methods:
   treat them like interval-scaled variables—not a good idea!
(why?—the scale can be distorted)
   apply logarithmic transformation
yif = log(xif)
   treat them as continuous ordinal data and then treat their
ranks as interval-scaled
UIC - CS 594                                                 17
Variables of Mixed Types
   A database may contain all six types of variables
    symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
   One may use a weighted formula to combine
their effects           p
 f  1 ij dij
(f)  (f)
d (i, j) 
 p  1 ij f )
f
(

    f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
    f is interval-based: use the normalized distance
    f is ordinal or ratio-scaled
 compute ranks rif and

 and treat zif as interval-scaled
zif  r  1
if

M 1f

UIC - CS 594                                              18
Major Clustering Techniques
   Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion
   Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
   Density-based: based on connectivity and
density functions
   Model-based: A model is hypothesized for each
of the clusters and the idea is to find the best fit
of the model to each other.
UIC - CS 594                                          19
Partitioning Algorithms: Basic
Concept
   Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
   Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
   Global optimal: exhaustively enumerate all partitions
   Heuristic methods: k-means and k-medoids algorithms
   k-means : Each cluster is represented by the center of
the cluster
   k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the cluster
UIC - CS 594                                             20
The K-Means Clustering
   Given k, the k-means algorithm is as follows:
1) Choose k cluster centers to coincide with k
randomly-chosen points
2) Assign each data point to the closest cluster center
3) Recompute the cluster centers using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).
Typical convergence criteria are: no (or minimal)
reassignment of data points to new cluster centers, or
minimal decrease in squared error.        p is a point and mi
k                       is the mean of
E   pC | p  mi |   2
cluster Ci
i
i 1
UIC - CS 594                                                   21
Example
   For simplicity, 1 dimensional data and k=2.
   data: 1, 2, 5, 6,7
   K-means:
    Randomly select 5 and 6 as initial centroids;
    => Two clusters {1,2,5} and {6,7}; meanC1=8/3,
meanC2=6.5
    => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
    => no change.
    Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2
= 2.5

UIC - CS 594                                       22
   Strength: efficient: O(tkn), where n is # data points, k is
# clusters, and t is # iterations. Normally, k, t << n.
   Comment: Often terminates at a local optimum. The
global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
   Weakness
   Applicable only when mean is defined, difficult for categorical data
   Need to specify k, the number of clusters, in advance
   Sensitive to noisy data and outliers
   Not suitable to discover clusters with non-convex shapes
   Sensitive to initial seeds

UIC - CS 594                                                       23
Variations of the K-Means Method
   A few variants of the k-means which differ in
   Selection of the initial k seeds
   Dissimilarity measures
   Strategies to calculate cluster means
   Handling categorical data: k-modes
   Replacing means of clusters with modes
   Using new dissimilarity measures to deal with
categorical objects
   Using a frequency based method to update modes of
clusters
UIC - CS 594                                        24
k-Medoids clustering method
   k-Means algorithm is sensitive to outliers
    Since an object with an extremely large value may
substantially distort the distribution of the data.
   Medoid – the most centrally located point in a
cluster, as a representative point of the cluster.
   An example

Initial Medoids

   In contrast, a centroid is not necessarily inside a
cluster.
UIC - CS 594                                                  25
Partition Around Medoids
   PAM:
1.    Given k
2.    Randomly pick k instances as initial medoids
3.    Assign each data point to the nearest medoid x
4.    Calculate the objective function
    the sum of dissimilarities of all points to their
nearest medoids. (squared-error criterion)
5.    Randomly select an point y
6.    Swap x by y if the swap reduces the objective
function
7.    Repeat (3-6) until no change
UIC - CS 594                                              26
Outlier (100 unit away)
   Pam is more robust than k-means in the
presence of noise and outliers because a
medoid is less influenced by outliers or
other extreme values than a mean
(why?)
   Pam works well for small data sets but
does not scale well for large data sets.
   O(k(n-k)2 ) for each change

where n is # of data, k is # of clusters
UIC - CS 594                                               27
CLARA: Clustering Large Applications
   CLARA: Built in statistical analysis packages, such
as S+
   It draws multiple samples of the data set, applies
PAM on each sample, and gives the best
clustering as the output
   Strength: deals with larger data sets than PAM
   Weakness:
   Efficiency depends on the sample size
   A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
   There are other scale-up methods e.g., CLARANS
UIC - CS 594                                           28
Hierarchical Clustering
   Use distance matrix for clustering. This method
does not require the number of clusters k as an
input, but needs a termination condition

Step 0       Step 1   Step 2 Step 3 Step 4   agglomerative

a
ab
b                               abcde
c
cde
d
de
e
divisive
Step 4       Step 3   Step 2 Step 1 Step 0
UIC - CS 594                                                   29
Agglomerative Clustering
At the beginning, each data point forms a cluster
(also called a node).
Merge nodes/clusters that have the least
dissimilarity.
Go on merging
Eventually all nodes belong to the same cluster
10                                                10                                                10

9                                                 9                                                 9

8                                                 8                                                 8

7                                                 7                                                 7

6                                                 6                                                 6

5                                                 5                                                 5

4                                                 4                                                 4

3                                                 3                                                 3

2                                                 2                                                 2

1                                                 1                                                 1

0                                                 0                                                 0
0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10

UIC - CS 594                                                                                                                         30
A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster.

UIC - CS 594                                            31
Divisive Clustering

            Inverse order of agglomerative clustering
            Eventually each node forms a cluster on its own

10                                                 10                                                10

9                                                  9                                                 9

8                                                  8                                                 8

7                                                  7                                                 7

6                                                  6                                                 6

5                                                  5                                                 5

4                                                  4                                                 4

3                                                  3                                                 3

2                                                  2                                                 2

1                                                  1                                                 1

0                                                  0                                                 0
0   1   2    3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10

UIC - CS 594                                                                                                                          32
More on Hierarchical Methods
   Major weakness of agglomerative clustering
methods
   do not scale well: time complexity at least O(n2), where
n is the total number of objects
   can never undo what was done previously
   Integration of hierarchical with distance-based
clustering to scale-up these clustering methods
   BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
   CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
UIC - CS 594                                           33
Summary
   Cluster analysis groups objects based on their
similarity and has wide applications
   Measure of similarity can be computed for various
types of data
   Clustering algorithms can be categorized into
partitioning methods, hierarchical methods,
density-based methods, etc
   Clustering can also be used for outlier detection
which are useful for fraud detection
   What is the best clustering algorithm?

UIC - CS 594                                      34
Other Data Mining Methods

UIC - CS 594                35
Sequence analysis
   Market basket analysis analyzes things that
happen at the same time.
   How about things happen over time?
E.g., If a customer buys a bed, he/she is likely
to come to buy a mattress later
   Sequential analysis needs
   A time stamp for each data record
   customer identification
UIC - CS 594                                      36
Sequence analysis                 (cont …)

   The analysis shows which item come before, after
or at the same time as other items.
   Sequential patterns can be used for analyzing
cause and effect.
Other applications
   Finding cycles in association rules
   Some association rules hold strongly in certain periods
of time
   E.g., every Monday people buy item X and Y together
   Stock market predicting
   Predicting possible failure in network, etc
UIC - CS 594                                              37
Discovering holes in data
   Holes are empty (sparse) regions in the data
space that contain few or no data points. Holes
may represent impossible value combinations in
the application domain.
   E.g., in a disease database, we may find that
certain test values and/or symptoms do not go
together, or when certain medicine is used,
some test value never go beyond certain range.
   Such information could lead to significant
discovery: a cure to a disease or some biological
law.
UIC - CS 594                                       38
Data and pattern visualization
   Data visualization: Use computer graphics
effect to reveal the patterns in data,
2-D, 3-D scatter plots, bar charts, pie charts,
line plots, animation, etc.
   Pattern visualization: Use good interface
and graphics to present the results of
data mining.
Rule visualizer, cluster visualizer, etc

UIC - CS 594                                     39
Scaling up data mining
algorithms
   Adapt data mining algorithms to work on
very large databases.
   Data reside on hard disk (too large to fit in
main memory)
   Make fewer passes over the data
   Quadratic algorithms are too expensive
   Many data mining algorithms are quadratic,
especially, clustering algorithms.
UIC - CS 594                                       40

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 33 posted: 2/4/2010 language: English pages: 40