K

Document Sample
K Powered By Docstoc
					           Data Mining:
      Concepts and Techniques
                    Clustering




April 3, 2013   Data Mining: Concepts and Techniques   1
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   2
       What is Cluster Analysis?
   Cluster: a collection of data objects
      Similar to one another within the same cluster

      Dissimilar to the objects in other clusters

   Cluster analysis
      Grouping a set of data objects into clusters

   Clustering is unsupervised classification: no
    predefined classes
   Typical applications
      As a stand-alone tool to get insight into data
       distribution
      As a preprocessing step for other algorithms
                General Applications of Clustering

      Pattern Recognition
      Spatial Data Analysis
         create thematic maps in GIS by clustering feature
          spaces
         detect spatial clusters and explain them in spatial data
          mining
      Image Processing
      Economic Science (especially market research)
      WWW
         Document classification

         Cluster Weblog data to discover groups of similar
          access patterns

April 3, 2013              Data Mining: Concepts and Techniques      4
                Examples of Clustering Applications
     Marketing: Help marketers discover distinct groups in their
      customer bases, and then use this knowledge to develop
      targeted marketing programs
     Land use: Identification of areas of similar land use in an
      earth observation database
     Insurance: Identifying groups of motor insurance policy
      holders with a high average claim cost
     City-planning: Identifying groups of houses according to
      their house type, value, and geographical location
     Earth-quake studies: Observed earth quake epicenters
      should be clustered along continent faults
April 3, 2013              Data Mining: Concepts and Techniques     5
                What Is Good Clustering?

       A good clustering method will produce high quality
        clusters with
               high intra-class similarity
               low inter-class similarity
       The quality of a clustering result depends on both the
        similarity measure used by the method and its
        implementation.
       The quality of a clustering method is also measured by
        its ability to discover some or all of the hidden patterns.

April 3, 2013                    Data Mining: Concepts and Techniques   6
                Requirements of Clustering in Data
                Mining
           Scalability
           Ability to deal with different types of attributes
           Discovery of clusters with arbitrary shape
           Minimal requirements for domain knowledge to
            determine input parameters
           Able to deal with noise and outliers
           Insensitive to order of input records
           High dimensionality
           Incorporation of user-specified constraints
           Interpretability and usability

April 3, 2013                 Data Mining: Concepts and Techniques   7
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   8
                Data Structures

                                       x11      ... x1f        ... x1p 
         Data matrix                                                  
                                       ...      ... ...        ... ... 
                                      x                        ... xip 
           (two modes)                          ...    xif
                                       i1                              
                                       ...      ...      ...   ... ... 
                                      x         ... xnf        ... xnp 
                                       n1
                                                                       
                                                                        

                                       0                         
                                       d(2,1)                    
         Dissimilarity matrix                     0             
                                       d(3,1) d ( 3,2) 0         
           (one mode)                                           
                                          :        :     :       
                                      d ( n,1) d ( n,2) ... ... 0
                                                                 

April 3, 2013            Data Mining: Concepts and Techniques               9
                Measure the Quality of Clustering

     Dissimilarity/Similarity metric: Similarity is expressed in
      terms of a distance function, which is typically metric:
            d(i, j)
     There is a separate “quality” function that measures the
      “goodness” of a cluster.
     The definitions of distance functions are usually very
      different for interval-scaled, boolean, categorical, ordinal
      and ratio variables.
     Weights should be associated with different variables
      based on applications and data semantics.
     It is hard to define “similar enough” or “good enough”
          the answer is typically highly subjective.
April 3, 2013              Data Mining: Concepts and Techniques      10
                Type of data in clustering analysis



           Interval-scaled variables
           Binary variables
           Nominal, ordinal, and ratio variables
           Variables of mixed types




April 3, 2013                  Data Mining: Concepts and Techniques   11
                Interval-valued variables

     Standardize data
           Calculate the mean absolute deviation:
                   s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
                         n

        where           m f  1 (x1 f  x2 f
                              n                  ...    xnf )
                                                              .




           Calculate the standardized measurement (z-score)
                                            xif  m f
                                      zif      sf
     Using mean absolute deviation is more robust than using
      standard deviation

April 3, 2013                          Data Mining: Concepts and Techniques        12
                Similarity and Dissimilarity Between
                Objects

     Distances are normally used to measure the similarity or
      dissimilarity between two data objects
     Some popular ones include: Minkowski distance:
                 d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
                                  i1  j1       i2  j2           ip  jp
        where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
         two p-dimensional data objects, and q is a positive
         integer
     If q = 1, d is Manhattan distance
                       d (i, j) | x  x |  | x  x | ... | x  x |
                                    i1 j1 i2 j 2                ip j p

April 3, 2013                     Data Mining: Concepts and Techniques      13
                 Similarity and Dissimilarity Between
                 Objects (Cont.)

      If q = 2, d is Euclidean distance:
                       d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
                                      i1  j1       i2  j2           ip  jp
               Properties
                    d(i,j)  0
                    d(i,i) = 0
                    d(i,j) = d(j,i)
                    d(i,j)  d(i,k) + d(k,j)
      Also one can use weighted distance, parametric Pearson
       product moment correlation, or other dissimilarity
       measures.

April 3, 2013                         Data Mining: Concepts and Techniques      14
                Binary Variables
     A contingency table for binary data
                                         Object j
                                   1          0        sum
                             1     a          b        a b
                 Object i    0     c          d        cd
                            sum a  c b  d              p

     Simple matching coefficient (if the binary variable is
      symmetric):         d (i, j)      bc
                                     a bc  d
     Jaccard coefficient (if the binary variable is asymmetric):
                                 d (i, j)          bc
                                                  a bc
April 3, 2013                    Data Mining: Concepts and Techniques   15
                 Dissimilarity Between Binary
                 Variables: Example
                Name   Gender   Fever   Cough      Test-1      Test-2     Test-3   Test-4
                Jack   M        Y       N          P           N          N        N
                Mary   F        Y       N          P           N          P        N
                Jim    M        Y       P          N           N          N        N
                gender is a symmetric attribute, the remaining attributes are
                 asymmetric
                let the values Y and P be set to 1, and the value N be set to 0
                consider only asymmetric attributes
                                             01
                        d ( jack , m ary)          0.33
                                            2 01
                                            11
                        d ( jack , jim )         0.67
                                           111
                                            1 2
                        d ( jim , m ary)          0.75
                                           11 2
April 3, 2013                      Data Mining: Concepts and Techniques                     16
                Nominal Variables

     A generalization of the binary variable in that it can take
      more than 2 states, e.g., red, yellow, blue, green
     Method 1: Simple matching
           m: # of matches, p: total # of variables
                          d (i, j)  p  m
                                       p

     Method 2: use a large number of binary variables
           creating a new binary variable for each of the M
            nominal states

April 3, 2013               Data Mining: Concepts and Techniques    17
                Ordinal Variables
     An ordinal variable can be discrete or continuous
     Order is important, e.g., rank
     Can be treated like interval-scaled
           replacing xif by their rank              rif {1,...,M f }
           map the range of each variable onto [0, 1] by replacing
            i-th object in the f-th variable by
                                   rif 1
                             zif 
                                   M f 1

           compute the dissimilarity using methods for interval-
            scaled variables


April 3, 2013                Data Mining: Concepts and Techniques        18
                Ratio-Scaled Variables

      Ratio-scaled variable: a positive measurement on a
       nonlinear scale, approximately at exponential scale,
           such as AeBt or Ae-Bt
      Methods:
               treat them like interval-scaled variables
               apply logarithmic transformation
                                      yif = log(xif)
               treat them as continuous ordinal data treat their rank
                as interval-scaled.


April 3, 2013                    Data Mining: Concepts and Techniques    19
                Variables of Mixed Types
     A database may contain all the six types of variables
        symmetric binary, asymmetric binary, nominal,

         ordinal, interval and ratio.
     One may use a weighted formula to combine their
      effects.                 p  1 ij f ) dij f )
                                         (       (
                    d (i, j)  f p
                                   f  1 ij f )
                                             (

        f is binary or nominal:

          dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
        f is interval-based: use the normalized distance

        f is ordinal or ratio-scaled

           compute ranks rif and          zif  r  1 and treat zif as
                                                          if

                                                     M 1
            interval-scaled
                                                               f




April 3, 2013               Data Mining: Concepts and Techniques          20
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   21
                Major Clustering Approaches

    Partitioning algorithms: Construct various partitions and
     then evaluate them by some criterion
    Hierarchy algorithms: Create a hierarchical decomposition
     of the set of data (or objects) using some criterion
    Density-based: based on connectivity and density functions
    Grid-based: based on a multiple-level granularity structure
    Model-based: A model is hypothesized for each of the
     clusters and the idea is to find the best fit of that model to
     each other

April 3, 2013             Data Mining: Concepts and Techniques        22
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   23
                Partitioning Algorithms: Basic Concept

     Partitioning method: Construct a partition of a database D
      of n objects into a set of k clusters
     Given k, find a partition of k clusters that optimizes the
      chosen partitioning criterion
           Global optimality: exhaustively enumerate all partitions
           Heuristic methods: k-means and k-medoids algorithms
           k-means (MacQueen’67): Each cluster is represented
            by the center of the cluster
           k-medoids or PAM (Partition around medoids)
            (Kaufman & Rousseeuw’87): Each cluster is
            represented by one of the objects in the cluster
April 3, 2013                Data Mining: Concepts and Techniques      24
                The K-Means Clustering Method

           Given k, the k-means algorithm is implemented in 4
            steps:
              Partition objects into k nonempty subsets

              Compute seed points as the centroids of the

               clusters of the current partition. The centroid is
               the center (mean point) of the cluster.
              Assign each object to the cluster with the nearest

               seed point.
              Go back to Step 2, stop when no more new

               assignment.


April 3, 2013                Data Mining: Concepts and Techniques   25
                The K-Means Clustering Method
      Example
                  10                                                                                           10

                  9                                                                                             9

                  8                                                                                             8

                  7                                                                                             7

                  6                                                                                             6

                  5                                                                                             5

                  4                                                                                             4

                  3                                                                                             3

                  2                                                                                             2

                  1                                                                                             1

                  0                                                                                             0
                       0       1       2       3       4       5       6       7        8       9       10          0       1       2       3       4       5       6       7       8       9       10




                   10                                                                                           10

                       9                                                                                            9

                       8                                                                                            8

                       7                                                                                            7

                       6                                                                                            6

                       5                                                                                            5

                       4                                                                                            4

                       3                                                                                            3

                       2                                                                                            2

                       1                                                                                            1

                       0                                                                                            0
                           0       1       2       3       4       5       6       7        8       9    10             0       1       2       3       4       5       6       7       8       9    10




April 3, 2013                                                                          Data Mining: Concepts and Techniques                                                                               26
                Comments on the K-Means Method

   Strength
         Relatively efficient: O(tkn), where n is # objects, k is #
          clusters, and t is # iterations. Normally, k, t << n.
         Often terminates at a local optimum. The global optimum
          may be found using techniques such as: deterministic
          annealing and genetic algorithms
   Weakness
     Applicable only when mean is defined, then what about

      categorical data?
     Need to specify k, the number of clusters, in advance

     Unable to handle noisy data and outliers

     Not suitable to discover clusters with non-convex shapes

April 3, 2013               Data Mining: Concepts and Techniques       27
                Variations of the K-Means Method
     A few variants of the k-means which differ in
        Selection of the initial k means

        Dissimilarity calculations

        Strategies to calculate cluster means

     Handling categorical data: k-modes
        Replacing means of clusters with modes

        Using new dissimilarity measures to deal with

         categorical objects
        Using a frequency-based method to update modes of

         clusters
        A mixture of categorical and numerical data: k-

         prototype method
April 3, 2013              Data Mining: Concepts and Techniques   28
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   34
                Hierarchical Clustering
     Use distance matrix as clustering criteria. This method
      does not require the number of clusters k as an input,
      but needs a termination condition
                Step 0   Step 1   Step 2 Step 3 Step 4
                                                                       agglomerative
                                                                         (AGNES)
                a        ab
                b                              abcde
                c
                                        cde
                d
                                  de
                e
                                                                             divisive
                Step 4   Step 3   Step 2 Step 1 Step 0                      (DIANA)
April 3, 2013                        Data Mining: Concepts and Techniques               35
More on Hierarchical Clustering Methods

    Major weakness of agglomerative clustering methods
       do not scale well: time complexity of at least O(n ),
                                                          2

        where n is the number of total objects
       can never undo what was done previously

    Integration of hierarchical with distance-based clustering
       BIRCH (1996): uses CF-tree and incrementally adjusts

        the quality of sub-clusters
       CURE (1998): selects well-scattered points from the

        cluster and then shrinks them towards the center of the
        cluster by a specified fraction
       CHAMELEON (1999): hierarchical clustering using

        dynamic modeling
April 3, 2013            Data Mining: Concepts and Techniques     36
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   41
            Density-Based Clustering Methods
         Clustering based on density (local cluster criterion),
          such as density-connected points
         Major features:
            Discover clusters of arbitrary shape

            Handle noise

            One scan

            Need density parameters as termination condition

         Several interesting studies:
            DBSCAN: Ester, et al. (KDD’96)

            OPTICS: Ankerst, et al (SIGMOD’99).

            DENCLUE: Hinneburg & D. Keim (KDD’98)

            CLIQUE: Agrawal, et al. (SIGMOD’98)

April 3, 2013               Data Mining: Concepts and Techniques   42
                Density Concepts


         Core object (CO) – object with at least ‘M’ objects
          within a radius ‘E-neighborhood’
         Directly density reachable (DDR) – x is CO, y is in x’s
          ‘E-neighborhood’
         Density reachable – there exists a chain of DDR objects
          from x to y
         Density based cluster – set of density connected
          objects that is maximal w.r.t. density-reachability
April 3, 2013                Data Mining: Concepts and Techniques   43
            Density-Based Clustering: Background
     Two parameters:
           Eps: Maximum radius of the neighbourhood
           MinPts: Minimum number of points in an Eps-
            neighbourhood of that point
     NEps(p): {q belongs to D | dist(p,q) <= Eps}
     Directly density-reachable: A point p is directly density-
      reachable from a point q wrt. Eps, MinPts if
           1) p belongs to NEps(q)
           2) core point condition:                                    p   MinPts = 5
                                                                    q
                 |NEps (q)| >= MinPts                                       Eps = 1 cm

April 3, 2013                Data Mining: Concepts and Techniques                    44
    Density-Based Clustering: Background (II)

   Density-reachable:
                                                                                p
         A point p is density-reachable from
          a point q wrt. Eps, MinPts if there                              p1
          is a chain of points p1, …, pn, p1 =                     q
          q, pn = p such that pi+1 is directly
          density-reachable from pi
   Density-connected
         A point p is density-connected to a                  p                    q
          point q wrt. Eps, MinPts if there is
          a point o such that both, p and q                            o
          are density-reachable from o wrt.
          Eps and MinPts.
April 3, 2013               Data Mining: Concepts and Techniques                        45
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   54
            What Is Outlier Discovery?
         What are outliers?
            The set of objects are considerably dissimilar from
             the remainder of the data
            Example: Sports: Michael Jordon, Wayne Gretzky,
             ...
         Problem
            Find top n outlier points

         Applications:
            Credit card fraud detection

            Telecom fraud detection

            Customer segmentation

            Medical analysis

April 3, 2013              Data Mining: Concepts and Techniques    55
   Outlier Discovery:
   Statistical Approaches


     Assume a model underlying distribution that generates
      data set (e.g. normal distribution)
     Use discordancy tests depending on
        data distribution
        distribution parameter (e.g., mean, variance)
        number of expected outliers
     Drawbacks
        most tests are for single attribute
        in many cases, data distribution may not be known
April 3, 2013          Data Mining: Concepts and Techniques   56
         Outlier Discovery: Distance-
         Based Approach

   Introduced to counter the main limitations imposed by
    statistical methods
      We need multi-dimensional analysis without knowing

       data distribution.
   Distance-based outlier: A DB(p, D)-outlier is an object O
    in a dataset T such that at least a fraction p of the
    objects in T lies at a distance greater than D from O
   Algorithms for mining distance-based outliers
      Index-based algorithm

      Nested-loop algorithm

      Cell-based algorithm
                Chapter 8. Cluster Analysis

         What is Cluster Analysis?
         Types of Data in Cluster Analysis
         A Categorization of Major Clustering Methods
         Partitioning Methods
         Hierarchical Methods
         Density-Based Methods
         Grid-Based Methods
         Outlier Analysis
         Summary


April 3, 2013                Data Mining: Concepts and Techniques   59
                Summary

     Cluster analysis groups objects based on their similarity
      and has wide applications
     Measure of similarity can be computed for various types
      of data
     Clustering algorithms can be categorized into partitioning
      methods, hierarchical methods, density-based methods,
      grid-based methods, and model-based methods
     Outlier detection and analysis are very useful for fraud
      detection, etc. and can be performed by statistical,
      distance-based or deviation-based approaches
     There are still lots of research issues on cluster analysis,
      such as constraint-based clustering
April 3, 2013             Data Mining: Concepts and Techniques       60

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/3/2013
language:English
pages:42