Docstoc

powerpoint - The University of Texas at Austin

Document Sample
powerpoint - The University of Texas at Austin Powered By Docstoc
					Robust Methods for Locating Multiple
Dense Regions in Complex Datasets


                  Gunjan Gupta

  Department of Electrical & Computer Engineering
         The University of Texas at Austin

                 October 30, 2006


                                                    1
Why cluster only a part of the data into dense
                  clusters?




                                                 2
Why cluster only a part of the data into dense
                  clusters?




                                                 3
Why cluster only a part of the data into dense
                  clusters?
Exhaustive clustering (K-Means) result:




                                                 4
Goal: cluster only a (small) part of the data into
            multiple dense clusters.




                                                     5
Why cluster only a part of the data into dense
                  clusters?



  •   Little or no labeled data available.

  •   Only a part of the data clusters well.

  •   Or only a fraction of the data relevant.




                                                 6
                  Application Scenarios

•   Bioinformatics:
    –   Gene Microarray data: 100s of strongly correlated genes
        from 1000s
    –   Phylogenetics data
•   Document Retrieval
    –   User interested in a few highly relevant documents.
•   Market Basket Data
    –   Only some customers have highly correlated behaviors
•   Feature selection




                                                                  7
               Practical Issues



•   How many dense clusters, where are they
    located?

•   What fraction of data to cluster?

•   Notion of density?

•   All clusters not necessarily equally dense.

•   Choice of model, distance measure.

                                                  8
In this thesis we introduce two new approaches
           for finding k dense clusters:


  • A very general parametric approach:

             Bregman Bubble Clustering


  • A non parametric approach that provides
    significant extension over existing methods:

        Automated Hierarchical Density Shaving




                                                   9
                       Outline


• Parametric approach
   –   Bregman Bubble Clustering
   –   Soft Bregman Bubble Clustering
   –   Pressurization & Seeding
   –   Results
• Non Parametric approach
   – Automated Hierarchical Density Shaving
   – Results
• Comparison
• Demo


                                              10
    Finding a single dense cluster: One Class-IB
                 [Crammer et. al ’04]

•   Perhaps the first parametric density-based approach.
•   Uses the notion of a Bregmanian Ball:
    –   Distance measure: Bregman divergence


         Bregmanian ball cost= average
         Bregman divergence from center:



•   Pros:
    –   Faster than non-parametric methods.
    –   Generalize to all Bregman Divergences , a large class of measures.
•   Cons:
    –   Can only find one dense region.



                                                                             11
                    Bregman Divergences


•   A large class of divergences
    –   Applicable to a wide variety of data types.
    –   Includes a number of useful distance measures.
    –   e.g. KL-divergence, Itakura-Saito distance, Squared Euclidean
        distance, Mahalanobis distance, etc.


•   Exploited in Bregman Clustering [Banerjee et al., 2004]
    –   Exhaustive partitioning into k segments.
    –   Generalization of K-Means to all Bregman divergences.




                                                                        12
Problem Definition: Bregman Bubble Clustering


Find k clusters consisting a total of s points that have the lowest
                              total cost:
 Cost measure :
                                               Bregman divergence

k centroids




                              Where         =S
      Set of k clusters


                                                                      13
    Bregman Bubble will also be applicable to two important
                   Bregman Projections



•    1-Cosine for document clustering:
     –   Sq. Euclidean distance between points projected onto a sphere.


•    Pearson Distance for Biological data:



     –   Sq. Euclidean Distance between Z-scored points (i.e between points
         projected on a sphere after subtracting mean across dimensions).
     –   Equal to 1-Pearson Correlation.



                                                                          14
       Bregmanian Ball vs. Bregman Bubbles




     Bregmanian Ball, k=1         Bregmanian Bubbles, k>1


•   Can show: for fixed centers, optimal solution for   forms
    Bregman Bubbles.


                                                                15
              Finding k Bregman Bubbles

•   Optimal solution too slow.

•   A simple iterative relocation algorithm, Bregman Bubble
    Clustering possible:

    –   Guaranteed to converge to local minima.
    –   Alternately updates assigned points and centers of the k
        bubbles.
    –   Mean best center at each step because of Bregman divergence
        property shown by [Banerjee et al., 2004].




                                                                  16
Bregman Bubble Clustering demo for k=2, s=10

   (average sq. Euclidean distance from assigned centroid) = 1.77


                                          4
                                  c2


                             c1
                                  6




                                                                    17
Bregman Bubble Clustering demo for k=2, s=10

   (average sq. Euclidean distance from assigned centroid) = 0.85


                                          5
                                     c2


                            c1

                                 5




                                                                    18
Bregman Bubble Clustering demo for k=2, s=10

   (average sq. Euclidean distance from assigned centroid) = 0.56


                                           6

                                      c2
                         c1
                              4




                                                                    19
Bregman Bubble Clustering demo for k=2, s=10

   (average sq. Euclidean distance from assigned centroid) = 0.45


                                          6
                                          c2

                          c1
                   4




                                                                    20
Bregman Bubble Clustering demo for k=2, s=10

   (average sq. Euclidean distance from assigned centroid) = 0.37


                                          5
                                          c2
                       c1

                   5




                                                                    21
                       Outline


• Parametric approach
   –   Bregman Bubble Clustering
   –   Bregman Bubble Soft Clustering
   –   Pressurization & Seeding
   –   Results
• Non Parametric approach
   – Automated Hierarchical Density Shaving
   – Results
• Comparison
• Demo


                                              22
        Bregman Bubble Soft Clustering

•   A probabilistic model, mixture of k exponential
    distributions and one uniform distribution:




•   An EM-based algorithm that alternately updates the
    mixture weights and mixing parameters.

•   Can be showed to converge to local minima.


                                                         23
   A 2-d dataset, 5 Gaussians+uniform

Can you spot the 5 Gaussians?




                                        24
A 2-d dataset, 5 Gaussians+uniform




                                     25
2-d dataset, after Bregman Soft Clustering
        (scalar variances updated)




                                             26
            Bregman Bubble Soft Clustering

 •   Exploits a bijection between Bregman divergences and
     exponential functions [Banerjee et al., 2004]:




              conjugate of                  a convex function

     Exponential function        Bregman divergence
Examples:

     1. Squared Euclidean corresponds to fixed variance Gaussian

     2. KL-Divergence corresponds to multinomial



                                                                   27
                           Unification

•   Can show Bregman Bubble Clustering(BBC) special case of
    Bregman Bubble Soft Clustering when:




•   Demonstrates that the BBC algorithm not ad-hoc but arises out
    of a mixture of exponentials and a uniform distribution.


•   Unifications with previous algorithms:
    –   Bregman Bubble (k=1)           One Class (same cost as OC-IB)
    –   Bregman Bubble (cluster all data)      Bregman Clustering
    –   Soft Bubble (cluster all data)    Bregman Soft Clustering


                                                                        28
                       Outline


• Parametric approach
   –   Bregman Bubble Clustering
   –   Bregman Bubble Soft Clustering
   –   Pressurization & Seeding
   –   Results
• Non Parametric approach
   – Automated Hierarchical Density Shaving
   – Results
• Comparison
• Demo


                                              29
        Bregman Bubble Soft Clustering with
                 Pressurization
•   Problem:
    –    BBC very sensitive to initialization
    –    Especially when small clusters desired: limited mobility
         during local search.
•   Solution: Pressurization
•   Demo:




                                                                    30
Pressurization demo, iteration 1




                                   watch this area




                                              31
Pressurization demo, iteration 2




                                   32
Pressurization demo, iteration 3




                                   33
Pressurization demo, iteration 9




                                   34
Pressurization demo, iteration 10




                                    35
Pressurization demo, iteration 20




                                    36
     Seeding Bregman Bubble Clustering


• Goals:
   – Find k centers to start BBC local search.
   – To overcome local minima problem in BBC.
   – Automatically estimate k.

• Features:
   – Deterministic.
   – Guaranteed constant times optimal cost for One Class.




                                                             37
    Seeding Bregman Bubble Clustering for k=1

•    Input: n x n distance matrix, no. of points to cluster s.
•    Restrict c (center) to a data point.
•    Output: best cluster centroid.
•    Algorithm: Sort each row, cumulative sum, normalize, pick
     best.
Best cluster of size s
       row     closest 2nd closest 3rd    sth cl.     nth cl.
                                   cl.
                                          2.69        24.03
       1       0         0.03      0.04
                                          3.03        43.00
       2       0         0.07      0.09
                                          1.04        29.07
       3       0         0.01      0.03
                                          3.53        18.05
       4       0         0.09      0.23
                                          1.07        19.13
       5       0         0.06      0.13
                                                                 38
    Seeding Bregman Bubble Clustering for k >1

•   Goal:
    –   Identify the k dense regions in the data, and the corresponding k
        centroids.
•   Main idea behind the solution:
    –   If we run One Class Bregman Bubble (k=1) starting n times from each
        of the n data points as seed locations, the n convergence locations
        would correspond to one of only k distinct densest regions in the data.
    –   These k dense locations can be thought of as the centers of the k
        dense locations in the data.


•   A faster approximation of the above possible by:
    –   Restricting the k centroid search to the n data points.
•   Demo of DGRADE (Density Gradient Enumeration) next.


                                                                                  39
Density Gradient Enumeration demo




                                    40
       Density Gradient Enumeration demo

Step 1: Measure One class cost at each point as average
divergence to say 5 (sone ) closest neighbors (including pt
itself):




                 0.25




                                                              41
         Density Gradient Enumeration demo

Step 1: Measure the One class cost at each point (which could
also be seen as being inversely proportional to data density at
that point):                        0.35
                0.41                  0.4                          0.24
                                                          0.19
                             0.26           0.37
                      0.22                                    0.09
               0.25      0.1                                           0.29
                                0.19               0.36
                     0.16                                 0.24 0.18
                                             0.45
                0.25      0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                              42
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                  Iteration 1
                      0.22                                    0.09
               0.25      0.1                                           0.29
                                0.19               0.36
                     0.16                                  0.24 0.18
                                             0.45
                0.25      0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19                                                           Cluster 1 started
 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                              43
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                Iteration 2
                      0.22                                    0.09
               0.25      0.1                                           0.29
                                0.19               0.36
                     0.16                                  0.24 0.18
                                             0.45
                0.25      0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24
                                                   Cluster 2 started
 0.25 - 0.34
 >0.34

                                                                                            44
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                Iteration 3
                      0.22                                    0.09
               0.25      0.1                                           0.29
                                0.19               0.36
                     0.16                                  0.24 0.18
                                             0.45
                0.25      0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            45
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                Iteration 4
                      0.22                                    0.09
               0.25      0.1                                           0.29
                                0.19               0.36
                     0.16                                  0.24 0.18
                                             0.45
                0.25      0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            46
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                Iteration 5
                      0.22                                    0.09
               0.25          0.1                                       0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            47
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                          0.24
                             0.26           0.37
                                                          0.19                Iteration 6
                      0.22                                    0.09
               0.25          0.1                                       0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                     0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            48
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 7
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                             49
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 8
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                             50
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 9
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                             51
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 10
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                              52
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 11
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                      0.16                                 0.24 0.18
                                             0.45
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                              53
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 12
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                   0.16                      0.45          0.24 0.18
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                              54
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                         0.35
                0.41                  0.4                           0.24
                             0.26           0.37
                                                          0.19                 Iteration 13
                      0.22                                       0.09
               0.25          0.1                                        0.29
                                    0.19           0.36
                   0.16                      0.45          0.24 0.18
                0.25         0.14 0.15
                                                                        0.58
<0.14                               0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                              55
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                          0.35
                0.41                   0.4                           0.24
                              0.26           0.37
                                                           0.19                 Iteration 14
                      0.22                                        0.09
               0.25          0.1                                         0.29
                                     0.19           0.36
                   0.16                       0.45          0.24 0.18
                0.25          0.14 0.15
                                                                         0.58
<0.14                                0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                               56
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                          0.35
                0.41                   0.4                           0.24
                              0.26           0.37
                                                           0.19                 Iteration 15
                      0.22                                        0.09
               0.25          0.1                                         0.29
                                     0.19           0.36
                   0.16                       0.45          0.24 0.18
                0.25          0.14 0.15
                                                                         0.58
<0.14                                0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                               57
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                          0.35
                0.41                   0.4                   0.19 0.24       Iteration 16
                              0.26           0.37
                      0.22                                      0.09
               0.25          0.1                                      0.29
                                     0.19           0.36
                   0.16                       0.45          0.24 0.18
                0.25          0.14 0.15
                                                                     0.58
<0.14                                0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            58
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                        0.35
                0.41                 0.4                   0.19 0.24         Iteration 24
                            0.26           0.37
                       0.22                                   0.09
               0.25       0.1                                       0.29
                                                  0.36
                                   0.19
                   0.16                     0.45          0.24 0.18
                0.25 0.14 0.15
                                                                      0.58
<0.14                              0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34

                                                                                            59
         Density Gradient Enumeration demo

Step 2: Visit points in the order of decreasing density: connect
next densest point to densest among 5 closest neighbors and
relabel. Start new cluster if point itself densest:
                                                        0.35
                0.41                 0.4                   0.19 0.24               Iteration 24
                            0.26           0.37
                       0.22                                   0.09
               0.25       0.1                                       0.29
                                                  0.36
                                   0.19
                   0.16                     0.45          0.24 0.18
                0.25 0.14 0.15
                                                                      0.58
<0.14                              0.42           0.7

 0.15 - 0.19

 0.20 - 0.24

 0.25 - 0.34
 >0.34                                                  return densest points in
                                                        each cluster as seeds.
                                                                                                  60
Density Gradient Enumeration:
      2-d Gaussian data




                                61
                             Results


Tested on many datasets, using three different distance measures:




                                                                    62
Results on Lee (gene expression, 1-Pearson)


     Overlap Lift




                                              63
Results on 40-d Synthetic (Sq. Euclidean)


  Adjusted Rand Index




                                            64
Results on 20-Newsgroup (1-Cosine)


 Adjusted Rand Index




                                     65
    Seeded BBC, Gasch Array (1-Pearson)

Pressurization only             Pressurization+Seeding
          Adjusted Rand Index




                                                         66
                     Outline


• Parametric approach
   – Bregman Bubble Clustering
   – Bregman Bubble Soft Clustering
   – Seeding
• Non Parametric approach
   – Automated Hierarchical Density Shaving
• Comparison
• Demo




                                              67
         Density Based Clustering Algorithms

•   HMA (Wishart, 1968)

•   DBSCAN (Ester et al., 1996)

•   Density Shaving (DS) and Auto-HDS framework.

•   Can show there is a strong connection between above algorithms.

•   All 3 use a uniform density kernel (density µ no. points within some
    radius):


          radius                            radius



                                                                           68
Density Shaving (DS)




                       69
Density Shaving (DS)




                       70
Density Shaving (DS)
         Two inputs:
         1. f_shave : Fraction to “shave” (0.38)
         2. nÎ : Min. no. of nbrs (3)




                                        Uses a trick to
                                        automatically compute
                                        correct ball radius from
                                        f_shave and nÎ .




                                                          71
Density Shaving (DS)
                 Performs graph
                 traversal on dense
                 points to identify
                 clusters.




                                      72
Density Shaving (DS)

                 “don’t care” points




                                       73
                        Properties of DS

•    Increasing nÎ has a smoothing effect on clustering.

    nÎ =5                             nÎ =50




            x = dense        . = “don’t care” points
                                                           74
                         Properties of DS

•   For a fixed nÎ , successive runs of DS with increasing data shaving
    (f_shave ) result in a hierarchy of clusters.




                                                   38% shaving, nÎ =25


                                                  2-D Gaussian example
                                                  1298 pts, 5 Gaussians
                                                  +uniform background

            15% shaving, nÎ =25                                           75
                          Properties of DS
 •   With a fixed nÎ , successive runs of DS with increasing shaving
     (f_shave ) result in a hierarchy of clusters.

15% shaving                38%                        62%




• clusters can          85%
     - split
     - vanish                                      2-D Gaussian example
• pts in separate                                  1298 pts, 5 Gaussians
  clusters never                                   +uniform background
  merge into one
                                                                           76
          Hierarchical Density Shaving (HDS)
•   Uses geometric/exponential shaving to create the hierarchy from DS.
     – Starting from all, fixed fraction r_shave of data shaved each
        iteration.
•   Clusters that lose points without splitting get the same id. Example:

           38% shaving                          55% shaving




                             A                                   A
     B                                        B

                                                                            77
            An important trick:
Dictionary Row Sort on HDS Label Matrix




                                          78
                          Visualization using the sorted Label Matrix
                     A: 38%                       B: 62%                 C: 85% shaving




                      A   B      C
                                                           • Sorted matrix plotted
Sorted rows’ index




                                                           • Each cluster plotted in unique color
                                                           • Don’t care points plotted in
                                                           background color

                                                           • Shows the compact, 8-node hierarchy


                                                                                                    79
                              Shaving iteration
                                                     Cluster Stability
Spatially relevant Projection




                                     22 iterations                     level

                                                                        Shaving iteration » level




                                            level



                                Stability = diff. between first and last level of a cluster
                                          » no. of shaving iterations a cluster exists.

                                                                                                    80
                           Cluster Selection




• We can show: relative stability is independent of shaving rate r_shave
• All clusters can be ranked by stability, even parents and children.
• One way of selecting clusters:
   - Highest stability clusters picked first.
   - Parents and children of picked clusters eliminated.

                                                                           81
         HDS + Visualization + Model selection
                    = Auto-HDS




                                           Auto HDS (546 pts)

• Auto-HDS
  -Finds all “modes” /clusters.
  -Finds clusters of varying density
  simultaneously.




                                       DS (546 pts,             82
                                       f_shave=0.58)
Results: Gasch Dataset
Results: Gasch Dataset




                         83
                Results: Gasch Dataset
                        H202
                        Menadione
Diauxic Shift

   Heat Shock
   Heat Shock



                 Heat Shock       Reference pool, not stressed

                                  YPD



           Nitrogen Depletion

           Stationary Phase
         Sorbitol osmotic shock
                                                                 84
                Results: Gasch Dataset
                        H202
                        Menadione
Diauxic Shift

   Heat Shock
   Heat Shock

                                    "heat shock 17 to 37, 20 minutes" 1
                                    "heat shock 21 to 37, 20 minutes" 1
                 Heat Shock         "heat shock 25 to 37, 20 minutes" 1
                                    "heat shock 29 to 37, 20 minutes" 1
                                    "heat shock 33 to 37, 20 minutes" 1
                                    DBY7286 37degree heat - 20 min 12
                                    DBYyap1- 37degree heat - 20 min (redo)
                                       9
                                    DBY7286 + 0.3 mM H2O2 (20 min) 9
                                    DBYyap1- + 0.3 mM H2O2 (20 min) 9
           Nitrogen Depletion

           Stationary Phase
         Sorbitol osmotic shock
                                                                             85
Results: Lee Dataset




                       86
                     Outline


• Parametric approach
   – Bregman Bubble Clustering
   – Bregman Soft Bubble Clustering
   – Seeding
• Non Parametric approach
   – Automated Hierarchical Density Shaving
• Comparison
• Demo




                                              87
     BBC vs. Auto-HDS: A qualitative comparison

Features                     BBC                        Auto-HDS
Approach                     parametric                 non-parametric

Scalability                  Very good                  Good

Generalization               All Bregman. Divergences   Sq. Euc.,Pearson, Cosine

Application to very high-d   Good                       Very good
data
Special features             Extendable to              Compact hierarchy,
                             Coclustering, Online       visualization, interactive
                             clustering settings.       clustering, spatially
                             Applicable to a wider      relevant 2-d projection,
                             variety of domains.        more robust automation.



                                                                                     88
      BBC vs. Auto-HDS: Sim-2




BBC                     Auto-HDS




                                   89
      BBC vs. Auto-HDS: Sim-2


Dataset     Gasch       Sim-2



Auto-HDS    ARI=0.35    ARI=0.70




BBC         ARI=0.30    ARI=0.74




                                   90
BBC vs. Auto-HDS: Lee




                        91
                     Outline


• Parametric approach
   – Bregman Bubble Clustering
   – Bregman Bubble Soft Clustering
   – Seeding
• Non Parametric approach
   – Automated Hierarchical Density Shaving
• Comparison
• Demo




                                              92
                  Gene DIVER


• Gene Density Interactive Visual ExplorER
   – A scalable implementation of Auto-HDS using
     streaming data instead of main memory.
   – Special features for browsing clusters.
   – Special features for biological data mining.


• Available for download at:
  http://www.ideal.ece.utexas.edu/~gunjan/genediver


    Lets see the Gene DIVER Demo now…


                                                      93
                     Main Contributions

•   Simultaneously finding dense clusters and pruning rest useful
    in many domains.
•   Parametric method BBC generalizes density based clustering
    to a large class of problems.
    –   very scalable to large, high-d data
    –   Robust with pressurization and seeding.


•   Auto-HDS improves upon non-parametric density-based
    clustering in many ways
    –   Well-suited for very high-d datasets.
    –   A powerful visualization.
    –   Interactive clustering, compact hierarchy.


•   Gene DIVER: a powerful tool for the data mining community,
    and especially for Bioinformatics practitioners.
                                                                    94
                       Future Work

•   BBC
    –   Bregman Bubble Coclustering.
    –   Online Bregman Bubble for capturing localized concept
        drifts.


•   Auto-HDS
    –   Variable resolution ISOMAP.
    –   Deterministic coclustering.
    –   Extensions to Gene DIVER.




                                                                95
                       Relevant Papers

•   G. Gupta, J. Ghosh, Bregman Bubble Clustering: A Robust, Scalable
    Framework for Locating Multiple, Dense Regions in Data, ICDM 2006,
    12 pages.
•   Gupta, A. Liu and J. Ghosh, Hierarchical Density Shaving: A clustering
    and visualization framework for large biological datasets, ICDM-DMB
    2006, 5 pages
•   G. Gupta, A. Liu and J. Ghosh, Clustering and Visualization of High-
    Dimensional Biological Datasets using a fast HMA Approximation,
    ANNIE 2006, 6 pages.
•   G. Gupta and J. Ghosh, Robust One-Class Clustering Using Hybrid
    Global and Local Search, ICML 2005, pp. 273-280
•   G. Gupta, J. Ghosh, Bregman Bubble Clustering, A Robust
    Framework for Mining Dense Clusters, under review, JMLR.
•   G. Gupta, A. Liu, J. Ghosh, Automated Hierarchical Density Shaving:
    A robust, automated clustering and visualization framework for large
    biological datasets, under review, IEEE Tran. Comp. Bio. Bioinform.

                                                                             96
?

    97
Backup Slides from Here




                          98
                          Other papers

•   G. Gupta and J. Ghosh, Detecting Seasonal Trends and Cluster
    Motion Visualization for very High Dimensional Transactional Data,
    SDM-2001
•   G. Gupta and J. Ghosh, Value Balanced Agglomerative Connectivity
    Clustering, Proc. SPIE Conf. on Data Mining and Knowledge
    Discovery, SPIE-2001.
•   G. Gupta, A. Strehl and J. Ghosh. Distance Based Clustering of
    Association Rules, Annie-99




                                                                         99
                  Properties of Auto-HDS

• Fast: O(n nÎ log n) using a heap-based imp.
   – Gene DIVER: a memory efficient heap-based implementation.
   – Extremely compact hierarchy of clusters.


• Visualization
   – Creates a spatially relevant 2-D projection of points and clusters.
   – Spatially relevant 2-D project of the compact hierarchy.


• Model selection
   – Can define a notion of stability (analogous to cluster “height”)
   – Based on stability, can select the most stable clusters
     automatically.




                                                                           100
         Finding relevant subsets, related work

•   Density based clustering: e.g. DBSCAN [Ester et al.]
    –   Pros: good for low-d spatial data.
    –   Cons: not suitable for high-d or non-metric scenarios.


•   Gene Shaving [Hastie et al. 2000]
    –   Pros: well-suited for gene expression datasets.
    –   Cons: greedily finds clusters, slow, implicit Sq. Euclidean
        assumptions.


•   PLAID [Lazzeroni et al. 2002]
    –   Pros: cluster rows and columns simultaneously: good for high-d
    –   Cons: greedy extraction of clusters as “plaids”, assumes additive
        layers, not true for many datasets.

                                                                       101
    DGRADE: Selecting sone parameter

•   sone is like a smoothing parameter to DGRADE
•   As sone increases, k declines.


•   Three scenarios for determining sone :
    –   If k given, find the smallest sone that results in k clusters
    –   If not, find k that occurs for the longest interval (max
        stability) of sone values:
    –   Or, find largest k that occurs at least a certain no. of times
        (minimum stability)




                                                                         102
    DGRADE: Selecting sone parameter

                             k=5 input, sone was found as 57
•   Example on 2-d data:




                Automatic k(4) and sone (62)               103
Seeded BBC on 40-d Synthetic (Sq. Euclidean)

Using Pressurization only   Pressurization+Seeding
      Adjusted Rand Index




                                                     104

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:7/20/2013
language:English
pages:104