Docstoc

Data Mining Process and Techniques

Document Sample
Data Mining Process and Techniques Powered By Docstoc
					Chapter 5:     Clustering




UIC - CS 594                1
     Searching for groups
   Clustering is unsupervised or undirected.
   Unlike classification, in clustering, no pre-
    classified data.
   Search for groups or clusters of data
    points (records) that are similar to one
    another.
   Similar points may mean: similar
    customers, products, that will behave in
    similar ways.
     UIC - CS 594                                   2
        Group similar points together
   Group points into classes using some
    distance measures.
       Within-cluster distance, and between cluster
        distance
   Applications:
       As a stand-alone tool to get insight into data
        distribution
       As a preprocessing step for other algorithms
        UIC - CS 594                                     3
An Illustration




UIC - CS 594      4
       Examples of Clustering
       Applications
   Marketing: Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing
    programs
   Insurance: Identifying groups of motor insurance
    policy holders with some interesting
    characteristics.
   City-planning: Identifying groups of houses
    according to their house type, value, and
    geographical location
      UIC - CS 594                                      5
        Concepts of Clustering
   Clusters
   Different ways of
    representing clusters
       Division with boundaries
       Spheres                         1 2 3
       Probabilistic              I1   0.5 0.2 0.3

       Dendrograms                I2
                                   …
       …
                                   In



        UIC - CS 594                                  6
        Clustering
   Clustering quality
       Inter-clusters distance  maximized
       Intra-clusters distance  minimized
   The quality of a clustering result depends on both
    the similarity measure used by the method and its
    application.
   The quality of a clustering method is also
    measured by its ability to discover some or all of
    the hidden patterns
   Clustering vs. classification
       Which one is more difficult? Why?
       There are a huge number of clustering techniques.
        UIC - CS 594                                        7
        Dissimilarity/Distance Measure
   Dissimilarity/Similarity metric: Similarity is
    expressed in terms of a distance function, which
    is typically metric: d (i, j)
   The definitions of distance functions are usually
    very different for interval-scaled, boolean,
    categorical, ordinal and ratio variables.
   Weights should be associated with different
    variables based on applications and data
    semantics.
   It is hard to define “similar enough” or “good
    enough”. The answer is typically highly subjective.
      UIC - CS 594                                        8
 Types of data in clustering
 analysis

    Interval-scaled variables
    Binary variables
    Nominal, ordinal, and ratio variables
    Variables of mixed types




UIC - CS 594                                 9
        Interval-valued variables
   Continuous measurements in a roughly linear
    scale, e.g., weight, height, temperature, etc
   Standardize data (depending on applications)
       Calculate the mean absolute deviation:
            s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
                  n

        where       m f  1 (x1 f  x2 f
                          n                    ...    xnf )
                                                            .



       Calculate the standardized measurement (z-score)
                            xif  m f
                      zif      sf
    UIC - CS 594                                                            10
        Similarity Between Objects
   Distance: Measure the similarity or dissimilarity
    between two data objects
   Some popular ones include: Minkowski
    distance:
                 d (i, j)  (| x  x |  | x  x | ... | x  x | )
                        q
                           q        q                             q
                                i1 j1       i2 j 2          ip jp
    where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two p-
     dimensional data objects, and q is a positive integer
   If q = 1, d is Manhattan distance
               d (i, j) | x  x |  | x  x | ... | x  x |
                            i1 j1 i2 j 2                ip j p
      UIC - CS 594                                                     11
Similarity Between Objects (Cont.)
   If q = 2, d is Euclidean distance:
         d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
                        i1 j1        i2 j 2           ip jp
       Properties
            d(i,j)  0
            d(i,i) = 0
            d(i,j) = d(j,i)
            d(i,j)  d(i,k) + d(k,j)
   Also, one can use weighted distance, and
    many other similarity/distance measures.

     UIC - CS 594                                                 12
       Binary Variables
   A contingency table for binary data
                               Object j
                           1      0       sum
                      1    a     b        a b
          Object i    0    c     d        cd
                     sum a  c b  d       p
   Simple matching coefficient (invariant, if the
    binary variable is symmetric): d (i, j)     bc
                                              a bc  d
   Jaccard coefficient (noninvariant if the binary
    variable is asymmetric): d (i, j)             bc
                                                 a bc
      UIC - CS 594                                         13
    Dissimilarity of Binary Variables
   Example
        Name      Gender   Fever   Cough   Test-1   Test-2   Test-3   Test-4
        Jack      M        Y       N       P        N        N        N
        Mary      F        Y       N       P        N        P        N
        Jim       M        Y       P       N        N        N        N

       gender is a symmetric attribute (not used below)
       the remaining attributes are asymmetric attributes
       let the values Y and P be set to 1, and the value N
        be set to 0                           01
                            d ( jack , m ary)                         0.33
                                                2 01
                                                11
                            d ( jack , jim )         0.67
                                               111
                                                1 2
                            d ( jim , m ary)          0.75
                                               11 2
        UIC - CS 594                                                           14
         Nominal Variables
   A generalization of the binary variable in that it
    can take more than 2 states, e.g., red, yellow,
    blue, green, etc
   Method 1: Simple matching
       m: # of matches, p: total # of variables
                       d (i, j)  p  m
                                    p
   Method 2: use a large number of binary variables
       creating a new binary variable for each of the M
        nominal states


        UIC - CS 594                                       15
            Ordinal Variables
   An ordinal variable can be discrete or continuous
   Order is important, e.g., rank
   Can be treated like interval-scaled (f is a variable)
       replace xif by their ranks      rif { ,...,M f }
                                              1
       map the range of each variable onto [0, 1] by replacing
        i-th object in the f-th variable by
                                 rif 1
                         zif   
                                 M f 1
       compute the dissimilarity using methods for interval-
        scaled variables

        UIC - CS 594                                            16
         Ratio-Scaled Variables
   Ratio-scaled variable: a measurement on a
    nonlinear scale, approximately at exponential
    scale, such as AeBt or Ae-Bt, e.g., growth of a
    bacteria population.
   Methods:
       treat them like interval-scaled variables—not a good idea!
        (why?—the scale can be distorted)
       apply logarithmic transformation
                             yif = log(xif)
       treat them as continuous ordinal data and then treat their
        ranks as interval-scaled
        UIC - CS 594                                                 17
          Variables of Mixed Types
   A database may contain all six types of variables
        symmetric binary, asymmetric binary, nominal,
         ordinal, interval and ratio
   One may use a weighted formula to combine
    their effects           p
                                   f  1 ij dij
                                   (f)  (f)
                       d (i, j) 
                                     p  1 ij f )
                                       f
                                              (

        f is binary or nominal:
          dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
        f is interval-based: use the normalized distance
        f is ordinal or ratio-scaled
           compute ranks rif and

           and treat zif as interval-scaled
                                                    zif  r  1
                                                        if

                                                         M 1f

        UIC - CS 594                                              18
      Major Clustering Techniques
   Partitioning algorithms: Construct various
    partitions and then evaluate them by some
    criterion
   Hierarchy algorithms: Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
   Density-based: based on connectivity and
    density functions
   Model-based: A model is hypothesized for each
    of the clusters and the idea is to find the best fit
    of the model to each other.
     UIC - CS 594                                          19
         Partitioning Algorithms: Basic
         Concept
   Partitioning method: Construct a partition of a
    database D of n objects into a set of k clusters
   Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
       Global optimal: exhaustively enumerate all partitions
       Heuristic methods: k-means and k-medoids algorithms
       k-means : Each cluster is represented by the center of
        the cluster
       k-medoids or PAM (Partition around medoids): Each
        cluster is represented by one of the objects in the cluster
        UIC - CS 594                                             20
    The K-Means Clustering
   Given k, the k-means algorithm is as follows:
   1) Choose k cluster centers to coincide with k
        randomly-chosen points
   2) Assign each data point to the closest cluster center
   3) Recompute the cluster centers using the current
        cluster memberships.
   4) If a convergence criterion is not met, go to 2).
Typical convergence criteria are: no (or minimal)
    reassignment of data points to new cluster centers, or
    minimal decrease in squared error.        p is a point and mi
                         k                       is the mean of
                    E   pC | p  mi |   2
                                                 cluster Ci
                                i
                        i 1
     UIC - CS 594                                                   21
        Example
   For simplicity, 1 dimensional data and k=2.
   data: 1, 2, 5, 6,7
   K-means:
        Randomly select 5 and 6 as initial centroids;
        => Two clusters {1,2,5} and {6,7}; meanC1=8/3,
         meanC2=6.5
        => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
        => no change.
        Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2
         = 2.5

        UIC - CS 594                                       22
         Comments on K-Means
   Strength: efficient: O(tkn), where n is # data points, k is
    # clusters, and t is # iterations. Normally, k, t << n.
   Comment: Often terminates at a local optimum. The
    global optimum may be found using techniques such as:
    deterministic annealing and genetic algorithms
   Weakness
       Applicable only when mean is defined, difficult for categorical data
       Need to specify k, the number of clusters, in advance
       Sensitive to noisy data and outliers
       Not suitable to discover clusters with non-convex shapes
       Sensitive to initial seeds

        UIC - CS 594                                                       23
    Variations of the K-Means Method
   A few variants of the k-means which differ in
       Selection of the initial k seeds
       Dissimilarity measures
       Strategies to calculate cluster means
   Handling categorical data: k-modes
       Replacing means of clusters with modes
       Using new dissimilarity measures to deal with
        categorical objects
       Using a frequency based method to update modes of
        clusters
        UIC - CS 594                                        24
        k-Medoids clustering method
   k-Means algorithm is sensitive to outliers
        Since an object with an extremely large value may
         substantially distort the distribution of the data.
   Medoid – the most centrally located point in a
    cluster, as a representative point of the cluster.
   An example



                                                    Initial Medoids


   In contrast, a centroid is not necessarily inside a
    cluster.
        UIC - CS 594                                                  25
         Partition Around Medoids
   PAM:
    1.    Given k
    2.    Randomly pick k instances as initial medoids
    3.    Assign each data point to the nearest medoid x
    4.    Calculate the objective function
              the sum of dissimilarities of all points to their
               nearest medoids. (squared-error criterion)
    5.    Randomly select an point y
    6.    Swap x by y if the swap reduces the objective
          function
    7.    Repeat (3-6) until no change
         UIC - CS 594                                              26
         Comments on PAM
                                               Outlier (100 unit away)
   Pam is more robust than k-means in the
    presence of noise and outliers because a
    medoid is less influenced by outliers or
    other extreme values than a mean
    (why?)
   Pam works well for small data sets but
    does not scale well for large data sets.
       O(k(n-k)2 ) for each change

    where n is # of data, k is # of clusters
         UIC - CS 594                                               27
CLARA: Clustering Large Applications
   CLARA: Built in statistical analysis packages, such
    as S+
   It draws multiple samples of the data set, applies
    PAM on each sample, and gives the best
    clustering as the output
   Strength: deals with larger data sets than PAM
   Weakness:
       Efficiency depends on the sample size
       A good clustering based on samples will not
        necessarily represent a good clustering of the whole
        data set if the sample is biased
   There are other scale-up methods e.g., CLARANS
        UIC - CS 594                                           28
       Hierarchical Clustering
   Use distance matrix for clustering. This method
    does not require the number of clusters k as an
    input, but needs a termination condition

        Step 0       Step 1   Step 2 Step 3 Step 4   agglomerative

        a
                     ab
        b                               abcde
        c
                                   cde
        d
                              de
        e
                                                      divisive
        Step 4       Step 3   Step 2 Step 1 Step 0
      UIC - CS 594                                                   29
                     Agglomerative Clustering
         At the beginning, each data point forms a cluster
         (also called a node).
         Merge nodes/clusters that have the least
         dissimilarity.
         Go on merging
         Eventually all nodes belong to the same cluster
10                                                10                                                10

9                                                 9                                                 9

8                                                 8                                                 8

7                                                 7                                                 7

6                                                 6                                                 6

5                                                 5                                                 5

4                                                 4                                                 4

3                                                 3                                                 3

2                                                 2                                                 2

1                                                 1                                                 1

0                                                 0                                                 0
     0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10




                 UIC - CS 594                                                                                                                         30
   A Dendrogram Shows How the
   Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster.




  UIC - CS 594                                            31
                      Divisive Clustering

            Inverse order of agglomerative clustering
            Eventually each node forms a cluster on its own

    10                                                 10                                                10

    9                                                  9                                                 9

    8                                                  8                                                 8

    7                                                  7                                                 7

    6                                                  6                                                 6

    5                                                  5                                                 5

    4                                                  4                                                 4

    3                                                  3                                                 3

    2                                                  2                                                 2

    1                                                  1                                                 1

    0                                                  0                                                 0
         0   1   2    3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10




                     UIC - CS 594                                                                                                                          32
    More on Hierarchical Methods
   Major weakness of agglomerative clustering
    methods
       do not scale well: time complexity at least O(n2), where
        n is the total number of objects
       can never undo what was done previously
   Integration of hierarchical with distance-based
    clustering to scale-up these clustering methods
       BIRCH (1996): uses CF-tree and incrementally adjusts
        the quality of sub-clusters
       CURE (1998): selects well-scattered points from the
        cluster and then shrinks them towards the center of the
        cluster by a specified fraction
        UIC - CS 594                                           33
        Summary
   Cluster analysis groups objects based on their
    similarity and has wide applications
   Measure of similarity can be computed for various
    types of data
   Clustering algorithms can be categorized into
    partitioning methods, hierarchical methods,
    density-based methods, etc
   Clustering can also be used for outlier detection
    which are useful for fraud detection
   What is the best clustering algorithm?

      UIC - CS 594                                      34
Other Data Mining Methods




UIC - CS 594                35
     Sequence analysis
   Market basket analysis analyzes things that
    happen at the same time.
   How about things happen over time?
    E.g., If a customer buys a bed, he/she is likely
      to come to buy a mattress later
   Sequential analysis needs
       A time stamp for each data record
       customer identification
     UIC - CS 594                                      36
        Sequence analysis                 (cont …)

   The analysis shows which item come before, after
    or at the same time as other items.
   Sequential patterns can be used for analyzing
    cause and effect.
Other applications
   Finding cycles in association rules
       Some association rules hold strongly in certain periods
        of time
       E.g., every Monday people buy item X and Y together
   Stock market predicting
   Predicting possible failure in network, etc
        UIC - CS 594                                              37
     Discovering holes in data
   Holes are empty (sparse) regions in the data
    space that contain few or no data points. Holes
    may represent impossible value combinations in
    the application domain.
   E.g., in a disease database, we may find that
    certain test values and/or symptoms do not go
    together, or when certain medicine is used,
    some test value never go beyond certain range.
   Such information could lead to significant
    discovery: a cure to a disease or some biological
    law.
     UIC - CS 594                                       38
    Data and pattern visualization
   Data visualization: Use computer graphics
    effect to reveal the patterns in data,
    2-D, 3-D scatter plots, bar charts, pie charts,
      line plots, animation, etc.
   Pattern visualization: Use good interface
    and graphics to present the results of
    data mining.
    Rule visualizer, cluster visualizer, etc

     UIC - CS 594                                     39
     Scaling up data mining
     algorithms
   Adapt data mining algorithms to work on
    very large databases.
       Data reside on hard disk (too large to fit in
        main memory)
       Make fewer passes over the data
   Quadratic algorithms are too expensive
       Many data mining algorithms are quadratic,
        especially, clustering algorithms.
     UIC - CS 594                                       40

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:33
posted:2/4/2010
language:English
pages:40