# GAD_ General Activity Detection for Fast Clustering on Large Data

Document Sample

for Fast Clustering on Large Data

Xin Jin, Sangkyum Kim, Jiawei Han,
Liangliang Cao and Zhijun Yin
SDM’09
Outline
   Introduction
   GAD for very large clustering
   Experimental
   Conclusion
   My though
Introduction
 It focus on developing fast core clustering
algorithms.
 Contribution
 Exploit activity detection for fast clustering on
different senarios
 It can achieve very high speed than k-means
 Notations
 NC(i, p, j)
 pattern p’s jth nearest center
 D-NC(i, p, j)
 distance from pattern p to its jth nearest center
 Dist(i, p, Cj)
 distance between pattern p and center Cj
Definition and Concepts
 S: search methods, A: activity states, m: the number of
nearest center, B: boundary
 Search methods
   Full search - find a pattern’s m nearest center
   Whole full search - perform full search for all the patterns
   Partial search - search from active centers
   m-search - search from a pattern’s previous m nearest centers
   0-search - a special case of m-search
   m-boundary
 General algorithm
 Step 1. initialization
 Step 2. search method decision
 Step 3. update pattern p’s nearest centers according
to step 2
 Step 4. get next pattern
 Step 5. assign each pattern to its nearest center
 Step 6. go to step 2 until all the centers are
converged
m=3

i=1       P        C1   C2        C3             C4             C5        C6

i=2
P        C1        C2   C3                  C4        C5             C6
(1)

i=2
P   C2   C1        C4   C3             C5        C6
(2)

i=2
P        C1                  C2   C3        C6   C5        C4
result 1

i=2
P   C2   C1        C4   C3             C5        C6
result 2
m=3

i=1       P        C1        C2        C3        C4             C5   C6

i=2
P   C1             C2             C3        C4        C5        C6
(1)

i=2
P   C2   C3   C1        C4             C5        C6
(2)

i=2
P   C1             C2             C3        C4        C5        C6
result 1

i=2
P   C2   C3   C1        C4             C5        C6
result 2
Full Search
m=3

i=1      P   C1        C2             C3                  C4             C5        C6

i=2      P              m=3                C2        C1             C3   C5   C4        C6

i=2
P   C2   C1        C3   C5             C4             C6
result
 Build two kd-tree
 Full kd-tree
 Active kd-tree
Experimental Evaluation
Conclusion
 Propose a General Activity Detection
framework for fast clustering.
 It is several times faster than K-Means and the
best speedup can be as high as 10 times.
My thought
 Although this paper provide new core
clustering algorithm, but whether uses on data
streaming.
 Is it the same result for different initialize