Top-Down Subspace clustering

Document Sample
Top-Down Subspace clustering Powered By Docstoc
					Top-Down Subspace clustering


Subspace Clustering for High Dimensional Data: A Review
            Lance Parsons , Ehtesham Haque, Huan Liu
           Department of Computer Science engineering
                    Arizona State University

                 Presented by: Muna Al-Razgan
Outline

   Top-down approach
   FINDIT
   Delta Clusters
   Empirical comparison
   Conclusion




                           2
Iterative Top-Down Subspace search Methods

   Top-Down Algorithm:
      Finds initial approximation of the clusters in the full feature space with
       equally weighted dimensions.
      Each dimension is assigned a weight for each cluster.
      Use update weights for the next iteration to regenerate the clusters.
   Requirement:
      Multiple iterations of expensive clustering algorithm.
      Parameter tuning to find meaningful results.
      Input parameter: number of clusters & size of subspace.
      Most algorithms use sampling which is another input parameter.



                                                                                    3
FINDIT: a Fast and Intelligent Subspace Clustering
Algorithm using Dimension Voting

   Basic Idea: determines the correlated
    dimensions for each cluster based on
    dimension-oriented distance &
    dimension voting policy.
   Dimension-Oriented Distance (dod):
      Measure similarity between two
        points by counting the number of
        dimensions in which the two           A(4,4,-4)
                                              B(6,0,0)
        points’ value difference is smaller
                                              Point A is closer to O
        than specific threshold ε.            than point B




                                                                       4
Estimating Correlated Dimensions based on Nearest
Neighbors




              Dataset = 1,000 points   in 10-dimensional space
              Value-range [0,100]
              ε = 15             decision threshold = 7 (??)

    selected dimensions for point p is { 2, 4,6, 7} Cluster E
    Selected dimensions for point q is {1,3 5, 6, 8, 9, 10} Cluster B   5
FINDIT Overview

   Input:
      Dataset
      Cminsize = minimum size of clusters
      D mindist = minimum difference between two resultant clusters
   Clustering process:
      Sampling phase,
      Cluster forming phase,
      Data assigning phase.




                                                                       6
Clustering Process




                     7
1- Sampling Phase
   S & M (medoides) samples are generated by a random sampling method
    (M < S).
   Minimum size of S obtain by:




     N = dataset size     k = N/ Cminsize ρ = N / K * Cminsize ξ= sminsize
   Example: N = 100,000      Cminsize = 5,000 ξ = 30 δ = 0.01
     Therefore |S| approximated to 1037
      |S| > Chernoff-bounds (S) (smallest cluster contain 30 with 99%)
   Minimum size of M obtain by same way as S except ξ = 1 (why ?)
     Therefore Chernoff-bounds for M is approximated to 224.




                                                                              8
2- Cluster Forming Phase
   Step 1- Dimension Voting:
   Determines original cluster’s correlated dimensions (KD) of all medoides.
   to (KD), apply sequential search on V (nearest neighbors) from S to each
    medoid in M. Medoides near to each other in dod grouped together & named
    medoid cluster .
      Medoid cluster (MCε) : is a sort of wireframe that simulates original cluster’s
       size and correlated dimensions.
   Iterate several times with increase ε.
      25 different values in [ 1/100 valuerange, 25/100 valuerange]
   Different (MCε) is generated for each iteration ε.
      Evaluation criteria is used to choose best ε and corresponds (MCε)




                                                                                         9
Cluster Forming Phase (cont.)

   Property 1: (Threshold decision to decide how many gray cell required for
    each dimension to be correlated )
      The number of “Yes” votes (gray cell) on any non-correlated
       dimension follows binomial probability distribution B(n: V, p) where
       p= 2ε/ valuerange, provided that all voters belongs to the same original
       cluster as the given medoid m.

   Example: V= 20, ε= 10 normalized valuerange= [0,100] therefore
      Threshold that makes P(X>= T) to 0% is 12 using SPSS or referencing
       cumulative binomial distribution table.

   Property 2: The maximum number of reliable voters is sminsize


                                                                             10
Cluster Forming Phase (cont.)

   Step 2. Cluster Simulating:
      Assign point p from S to the nearest medoid m based on :
      Member assignment condition: dodε (m          p)= 0
      Verify whether there is really a cluster in the subspace of KDm.
      Point p will be assigned to medoid that has largest key dimensions.
      Example:

        D = 5 (dimension)       |M| = 5    ε=2
        p1 = (5,6,6,5,8)
        p2 = (3,2,9,9,7)



                                                                             11
12
3 – Data Assigning Phase

   All points in original dataset are assigned to either best medoid cluster or to
    outlier cluster.

   Why FINDIT is fast?
     Due to fast dimension voting process based on sampling.
         FINDIT’s dimension selection requires no iteration and no
          dimension partitioning,
         It does not suffer significantly from the increased dimensionality or
          dataset size.
         From original paper’s experiment: FINDIT outperforms PROCLUS
          by more than 5 times when dataset size is as large as 5,000,000.


                                                                                 13
δ -Clusters: Capturing Subspace Correlation in a Large
Data Set

   δ- cluster: capture coherence
    exhibited by a subset of objects on a
    subset of attributes
   Example:
      d1 = (1, 5, 23, 12, 20)
      d2 = (11, 15, 33, 22, 30)
      d3 = (111, 115, 133, 122, 130)
      (Are these points close to each
         other?)
      d2 – d1 = (10, 10, 10, 10, 10)
      d3 – d2 = (100, 100, 100, 100,
         100)



                                                         14
Measure coherence among objects

   One choice is
   PearsonR correlation =(∑ (o1 – o1)(o2 – o2)) /   (∑ (o1 – o1)2 * ∑ (o2 – o2)2)


   Example:
      There is six movies (first three is action, second three is family)
      V1 = (8, 7, 9, 2, 2, 3)          V2 = (2, 1, 3, 8, 8 ,9)
      Cluster 1 (first three movie) Cluster 2( second three movie)
      PearsonR correlation give small value
      δ – cluster introduce base concept




                                                                                     15
The Model of δ – cluster

   δ – cluster: sub matrix that exhibit
    some coherent tendency among
    objects (rows) and attributes
    (columns).
   δ – cluster: allow for limited
    missing value by introducing α <= 1
   Example: if α = 0.6
      Figure (a) is not δ – cluster
      Figure (b) is a δ – cluster

   Volume = number of non-missing
    entries in the sub matrix.



                                           16
17
δ – cluster defined by I = { VPS8, EFB1, CYS3}
                       J = { CH1I, CH1D, CH2B}
                      volume = 3 * 3 = 9
                                                 18
19
Example for the base

   d VPS8, J = (401 + 120 + 298)/ 3 = 273
   d I, CH1I = (401 + 318 + 322) /3 = 347
   Similarly d EFBI, J = 190      d CYS3, J = 194
               d I, CHID = 66     d I, CH2B = 244
   d I, J = (401 + 120 + 298 + 318 + 37 + 215 + 322 + 41 + 219)/9 = 219
   Perfect δ – cluster holds d i , j = d i , J + d I, j – d I, J.
      d VPS8, CH1I = 401 (cell value)
      Apply perfect δ – cluster
      d VPS8, CH1I = d VPS8, J + d I, CH1I - d I, J
      = 273 + 347                 - 219
      = 401 (same value)




                                                                           20
   Definition 3.4: the residue of an entry d i j in a δ – cluster is
      r i j = d i j – d i J – d I j + d IJ if d i j is specified otherwise r i j = 0.
      r 2 1 = 401 – 273 – 347 + 219 = 0 ( why??)
      The lower the residue the stronger the correlation.




                                                                                         21
Flexible Overlapped Clustering (FLOC)




                                        22
   Action (x, c) : is change of membership of x with respect to cluster c.
      There are k (number of clusters) actions associated with each row (or
       column).
      Action that brings most improvement need to be identified

   Gain: reduction of c’s residue incurred by performing Action (x, c)
      find action that has highest gain.




                                                                               23
Cluster1 –residue = ¼              Cluster 2- residue = 2/3
Action with column 3: add col-3 to cluster 1
                         remove col-3 from cluster 2
Cluster 1 – residue after adding col-3 = 1/3
Gain- cluster1 = ¼ - 1/3 = -1/12
Gain cluster 2 = -1/3
Best action for column 3 is first action (add to cluster1)
                                                              24
Empirical Comparison

   Bottom- up approach: MAFIA
   Top-down approach : FINDIT

   Assumption:
      Bottom-up approach to perform well (search lower dimensionality)
      Top-down sampling scheme should scale well to large dataset.




                                                                          25
3- MAFIA: Efficient and Scalable Subspace Clustering
        for Very Large data Sets (Review)

   MAFIA: is extension of CLIQUE that uses an adaptive gird based on the
    distribution of data to improve efficiency and cluster quality.

   It introduces parallelism to improve scalability.
   It creates histogram to determine the minimum number of bins for a
    dimension. Then combine adjacent cells for similar density to form larger
    cells.
   Following above technique, the dimension is partitioned based on the data
    distribution.
      It gives more accurate result of cluster boundaries in each dimension.
      Determine the minimum number of bins for a dimension.


                                                                            26
MAFIA: Adaptive Grids




    Figure (a) illustrates the uniform grid used in CLUQUE.
       Grid is not based on the data distribution.
       Generate many more candidate dense units.
    Figure (b) illustrates the adaptive grid used in MAFIA.
       Grids Follow data distribution
       Generate very few candidate dense units in the grid.
                                                               27
Scalability




              28
Number of instance = 100,000
                               29
Subspace Detection & Accuracy




                                30
31
Conclusion

   Top-down approach

       uses multiple iteration, selection, and clustering.
       Use sampling technique.
       Often uncovered Clusters have hyper-spherical shape.
       The clusters form non-overlapping partitions of the dataset.
       Many algorithms require input parameter: number of cluster and size of
        subspace




                                                                            32
Conclusion (Cont.)

   Bottom-up approach:
      Selects subspace, then evaluate the instance.
      Adds one dimension at a time (able to work with small subspace).
      Finds clusters of various shapes and sizes.
      Main input parameter: density threshold.

   Finding meaningful and useful clusters result depends on the selection of
    appropriate technique and proper tuning of the algorithm via input
    parameter.
   Domain knowledge of the dataset is a plus for subspace clustering.



                                                                                33
References

   K.-G. Woo and J.-H. Lee. FINDIT: a Fast and Intelligent Subspace
    Clustering Algorithm using Dimension Voting. PhD thesis, Korea
    Advanced Institute of Science and Technology, Taejon, Korea, 2002.

   J. Yang, W. Wang, H. Wang, and P. Yu. δ -clusters: capturing subspace
    correlation in a large data set. In Data Engineering, 2002. Proceedings. 18th
    International Conference on, pages 517-528, 2002.

   L, Parsons, E, Haque, H, Li. Subspace Clustering for High Dimensional
    Data: A Review. In Proceeding of ACM SIGKDD Explorations
    Newsletter, Pages: 90 – 105. ACM Press, 2004.



                                                                               34

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:9/29/2012
language:English
pages:34