Top-Down Subspace clustering

Document Sample

```					Top-Down Subspace clustering

Subspace Clustering for High Dimensional Data: A Review
Lance Parsons , Ehtesham Haque, Huan Liu
Department of Computer Science engineering
Arizona State University

Presented by: Muna Al-Razgan
Outline

   Top-down approach
   FINDIT
   Delta Clusters
   Empirical comparison
   Conclusion

2
Iterative Top-Down Subspace search Methods

   Top-Down Algorithm:
 Finds initial approximation of the clusters in the full feature space with
equally weighted dimensions.
 Each dimension is assigned a weight for each cluster.
 Use update weights for the next iteration to regenerate the clusters.
   Requirement:
 Multiple iterations of expensive clustering algorithm.
 Parameter tuning to find meaningful results.
 Input parameter: number of clusters & size of subspace.
 Most algorithms use sampling which is another input parameter.

3
FINDIT: a Fast and Intelligent Subspace Clustering
Algorithm using Dimension Voting

   Basic Idea: determines the correlated
dimensions for each cluster based on
dimension-oriented distance &
dimension voting policy.
   Dimension-Oriented Distance (dod):
 Measure similarity between two
points by counting the number of
dimensions in which the two           A(4,4,-4)
B(6,0,0)
points’ value difference is smaller
Point A is closer to O
than specific threshold ε.            than point B

4
Estimating Correlated Dimensions based on Nearest
Neighbors

Dataset = 1,000 points   in 10-dimensional space
Value-range [0,100]
ε = 15             decision threshold = 7 (??)

selected dimensions for point p is { 2, 4,6, 7} Cluster E
Selected dimensions for point q is {1,3 5, 6, 8, 9, 10} Cluster B   5
FINDIT Overview

   Input:
 Dataset
 Cminsize = minimum size of clusters
 D mindist = minimum difference between two resultant clusters
   Clustering process:
 Sampling phase,
 Cluster forming phase,
 Data assigning phase.

6
Clustering Process

7
1- Sampling Phase
   S & M (medoides) samples are generated by a random sampling method
(M < S).
   Minimum size of S obtain by:

N = dataset size     k = N/ Cminsize ρ = N / K * Cminsize ξ= sminsize
   Example: N = 100,000      Cminsize = 5,000 ξ = 30 δ = 0.01
Therefore |S| approximated to 1037
 |S| > Chernoff-bounds (S) (smallest cluster contain 30 with 99%)
   Minimum size of M obtain by same way as S except ξ = 1 (why ?)
Therefore Chernoff-bounds for M is approximated to 224.

8
2- Cluster Forming Phase
   Step 1- Dimension Voting:
   Determines original cluster’s correlated dimensions (KD) of all medoides.
   to (KD), apply sequential search on V (nearest neighbors) from S to each
medoid in M. Medoides near to each other in dod grouped together & named
medoid cluster .
 Medoid cluster (MCε) : is a sort of wireframe that simulates original cluster’s
size and correlated dimensions.
   Iterate several times with increase ε.
 25 different values in [ 1/100 valuerange, 25/100 valuerange]
   Different (MCε) is generated for each iteration ε.
 Evaluation criteria is used to choose best ε and corresponds (MCε)

9
Cluster Forming Phase (cont.)

   Property 1: (Threshold decision to decide how many gray cell required for
each dimension to be correlated )
 The number of “Yes” votes (gray cell) on any non-correlated
dimension follows binomial probability distribution B(n: V, p) where
p= 2ε/ valuerange, provided that all voters belongs to the same original
cluster as the given medoid m.

   Example: V= 20, ε= 10 normalized valuerange= [0,100] therefore
 Threshold that makes P(X>= T) to 0% is 12 using SPSS or referencing
cumulative binomial distribution table.

   Property 2: The maximum number of reliable voters is sminsize

10
Cluster Forming Phase (cont.)

   Step 2. Cluster Simulating:
 Assign point p from S to the nearest medoid m based on :
 Member assignment condition: dodε (m          p)= 0
 Verify whether there is really a cluster in the subspace of KDm.
 Point p will be assigned to medoid that has largest key dimensions.
 Example:

D = 5 (dimension)       |M| = 5    ε=2
p1 = (5,6,6,5,8)
p2 = (3,2,9,9,7)

11
12
3 – Data Assigning Phase

   All points in original dataset are assigned to either best medoid cluster or to
outlier cluster.

   Why FINDIT is fast?
 Due to fast dimension voting process based on sampling.
 FINDIT’s dimension selection requires no iteration and no
dimension partitioning,
 It does not suffer significantly from the increased dimensionality or
dataset size.
 From original paper’s experiment: FINDIT outperforms PROCLUS
by more than 5 times when dataset size is as large as 5,000,000.

13
δ -Clusters: Capturing Subspace Correlation in a Large
Data Set

   δ- cluster: capture coherence
exhibited by a subset of objects on a
subset of attributes
   Example:
 d1 = (1, 5, 23, 12, 20)
 d2 = (11, 15, 33, 22, 30)
 d3 = (111, 115, 133, 122, 130)
 (Are these points close to each
other?)
 d2 – d1 = (10, 10, 10, 10, 10)
 d3 – d2 = (100, 100, 100, 100,
100)

14
Measure coherence among objects

   One choice is
   PearsonR correlation =(∑ (o1 – o1)(o2 – o2)) /   (∑ (o1 – o1)2 * ∑ (o2 – o2)2)

   Example:
 There is six movies (first three is action, second three is family)
 V1 = (8, 7, 9, 2, 2, 3)          V2 = (2, 1, 3, 8, 8 ,9)
 Cluster 1 (first three movie) Cluster 2( second three movie)
 PearsonR correlation give small value
 δ – cluster introduce base concept

15
The Model of δ – cluster

   δ – cluster: sub matrix that exhibit
some coherent tendency among
objects (rows) and attributes
(columns).
   δ – cluster: allow for limited
missing value by introducing α <= 1
   Example: if α = 0.6
 Figure (a) is not δ – cluster
 Figure (b) is a δ – cluster

   Volume = number of non-missing
entries in the sub matrix.

16
17
δ – cluster defined by I = { VPS8, EFB1, CYS3}
J = { CH1I, CH1D, CH2B}
volume = 3 * 3 = 9
18
19
Example for the base

   d VPS8, J = (401 + 120 + 298)/ 3 = 273
   d I, CH1I = (401 + 318 + 322) /3 = 347
   Similarly d EFBI, J = 190      d CYS3, J = 194
               d I, CHID = 66     d I, CH2B = 244
   d I, J = (401 + 120 + 298 + 318 + 37 + 215 + 322 + 41 + 219)/9 = 219
   Perfect δ – cluster holds d i , j = d i , J + d I, j – d I, J.
 d VPS8, CH1I = 401 (cell value)
 Apply perfect δ – cluster
 d VPS8, CH1I = d VPS8, J + d I, CH1I - d I, J
 = 273 + 347                 - 219
 = 401 (same value)

20
   Definition 3.4: the residue of an entry d i j in a δ – cluster is
 r i j = d i j – d i J – d I j + d IJ if d i j is specified otherwise r i j = 0.
 r 2 1 = 401 – 273 – 347 + 219 = 0 ( why??)
 The lower the residue the stronger the correlation.

21
Flexible Overlapped Clustering (FLOC)

22
   Action (x, c) : is change of membership of x with respect to cluster c.
 There are k (number of clusters) actions associated with each row (or
column).
 Action that brings most improvement need to be identified

   Gain: reduction of c’s residue incurred by performing Action (x, c)
 find action that has highest gain.

23
Cluster1 –residue = ¼              Cluster 2- residue = 2/3
Action with column 3: add col-3 to cluster 1
remove col-3 from cluster 2
Cluster 1 – residue after adding col-3 = 1/3
Gain- cluster1 = ¼ - 1/3 = -1/12
Gain cluster 2 = -1/3
Best action for column 3 is first action (add to cluster1)
24
Empirical Comparison

   Bottom- up approach: MAFIA
   Top-down approach : FINDIT

   Assumption:
 Bottom-up approach to perform well (search lower dimensionality)
 Top-down sampling scheme should scale well to large dataset.

25
3- MAFIA: Efficient and Scalable Subspace Clustering
for Very Large data Sets (Review)

   MAFIA: is extension of CLIQUE that uses an adaptive gird based on the
distribution of data to improve efficiency and cluster quality.

   It introduces parallelism to improve scalability.
   It creates histogram to determine the minimum number of bins for a
dimension. Then combine adjacent cells for similar density to form larger
cells.
   Following above technique, the dimension is partitioned based on the data
distribution.
 It gives more accurate result of cluster boundaries in each dimension.
 Determine the minimum number of bins for a dimension.

26

   Figure (a) illustrates the uniform grid used in CLUQUE.
 Grid is not based on the data distribution.
 Generate many more candidate dense units.
   Figure (b) illustrates the adaptive grid used in MAFIA.
 Generate very few candidate dense units in the grid.
27
Scalability

28
Number of instance = 100,000
29
Subspace Detection & Accuracy

30
31
Conclusion

   Top-down approach

   uses multiple iteration, selection, and clustering.
   Use sampling technique.
   Often uncovered Clusters have hyper-spherical shape.
   The clusters form non-overlapping partitions of the dataset.
   Many algorithms require input parameter: number of cluster and size of
subspace

32
Conclusion (Cont.)

   Bottom-up approach:
 Selects subspace, then evaluate the instance.
 Adds one dimension at a time (able to work with small subspace).
 Finds clusters of various shapes and sizes.
 Main input parameter: density threshold.

   Finding meaningful and useful clusters result depends on the selection of
appropriate technique and proper tuning of the algorithm via input
parameter.
   Domain knowledge of the dataset is a plus for subspace clustering.

33
References

   K.-G. Woo and J.-H. Lee. FINDIT: a Fast and Intelligent Subspace
Clustering Algorithm using Dimension Voting. PhD thesis, Korea
Advanced Institute of Science and Technology, Taejon, Korea, 2002.

   J. Yang, W. Wang, H. Wang, and P. Yu. δ -clusters: capturing subspace
correlation in a large data set. In Data Engineering, 2002. Proceedings. 18th
International Conference on, pages 517-528, 2002.

   L, Parsons, E, Haque, H, Li. Subspace Clustering for High Dimensional
Data: A Review. In Proceeding of ACM SIGKDD Explorations
Newsletter, Pages: 90 – 105. ACM Press, 2004.

34

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 14 posted: 9/29/2012 language: English pages: 34