Document Sample
CLUSTERING Powered By Docstoc

   EE 7000-1
Class Presentation

 Clustering basic and types
 K-means, a type of Unsupervised clustering
 Supervised clustering type
     Vector Quantization
     Fuzzy Identification
     Artificial neural net
     Fuzzy-neuro system
             What is clustering ?
 A technique that helps to extract more out of data

 Clustering involves grouping data points together
  according to some measure of similarity

 Clustering of data is a method by which large sets
  of data is grouped into clusters of smaller sets of
  similar data
              The usage of Clustering

 Some engineering sciences such as pattern recognition, artificial
  intelligence have been using the concepts of cluster analysis.
 In the life sciences (biology, botany, zoology, entomology, cytology,
  microbiology), the objects of analysis are life forms such as plants,
  animals, and insects. The clustering analysis may range from developing
  complete taxonomies to classification of the species into subspecies. The
  subspecies can be further classified into subspecies.
 Clustering analysis is also widely used in information, policy and
  decision sciences. The various applications of clustering analysis to
  documents include votes on political issues, survey of markets, survey of
   products, survey of sales programs, and R & D.
        A Clustering Example

Income: High
                                   Income: Medium
  Cluster 1                        Children:2
                    Car: Sedan and Car:Truck
      Income: Low
                    Income: Medium         Cluster 4
                       Cluster 3
        Cluster 2
                    Clustering in FDI ?

 Basically used to cluster (thereby identify) data as faulty or
 Also different fault conditions
 Data from the system  processed ( creating residues,
  Fourier transform….)  Clustering algorithm to identify
  different conditions of the data
            Properties of clustering

 Hierarchical : multiple steps, fusion of data to get desired
  number of clusters.
 Flat clustering : all clusters are same.
 Non-hierarchical or iterative : assume no. of clusters, assign
  instances to them
 Hard : each instance to only one cluster
 Soft : assigns as a probability of belonging to all clusters
 Disjunctive: Instances can be part of more than one cluster
                          Properties of Clustering
                                                                (b) Non-hierarchical, disjunctive
          (a) Hard, non-hierarchical                                       d               e
                      d               e                                a
                a                                                                  j           c
                              j               c                            k           h
                    k             h                                                        f       b
                                          f       b
                          g       i                                            g
                                                        (d) Hierarchical, hard
(c) Soft, non-hierarchical,                                Non-disjunctive
        disjunctive            1               2   3
                    a         0.4             0.1 0.5
                    b         0.1             0.8 0.1             g    a c i e d kb j f h
                    c         0.3             0.3 0.4
                     Types of Clustering

 Supervised Clustering : The task is to learn to assign instances to
 pre-defined classes. ( Classification)
  Example: Cluster, given classes : blue, red & yellow


Unsupervised Clustering : The task is to learn a classification
from the data. Discovers natural grouping.
Example : cluster the data: given no. of clusters = 3
                K-means algorithm
             ( a type of unsupervised clustering )

 Specify k, the number of clusters
 Choose k points randomly as cluster centers
 Assign each instance to its closest cluster center using
  Euclidian distance
 Calculate the median (mean) for each cluster, use it as its
  new cluster center
 Reassign all instances to the closest cluster center
 Iterate until the cluster centers do not change any more
                                Select the k cluster centers randomly.

                    Classify t he entire training set. For each patern X i in the training set, find
 Loop until the     the nearest cluster center C  and classify X i as a member of C  .
change in cluster
means is less the
amount specified
  by the user.

                     For each cluster, recompute its center by finding the mean of the cluster :
                                                       1 Nk
                                                  Mk       X jk
                                                       N k j 1
                    where M k is the new mean, N k is the number of training patterns in cluster
                               k , and X jk is the j - th pattern belonging to cluster k .

                                       Store the k cluster centers.
Initial K cluster centers, calculation of centers in first iteration
Changed cluster centers after first iteration
Change in clusters during second iteration
Final positions of cluster centers centers
Supervised Clustering
                    Vector Quantization

 Originated from Shannon’s coding theory
 Instead of continuous levels, quatize the codes
 Quantized levels are called codewords collection of them codebook
 For transmission of codes, approximate each code by its nearest
  codeword ( Euclidean distance)
 Divide the space containing codewords by perpendicular bisectors of
  lines joining two codewords
 Neighboring region of a codeword is called voronoi region
 Basically mapping of k dimensional vectors in the vector space R(k) into
  finite set of vectors
Voronoi region formation illustration