clustering

Document Sample
clustering Powered By Docstoc
					Clustering
Grouping or partitioning a collection of objects
(patterns) into a small number of clusters based on
similarity.

Objects are represented as points in a high-dimensional
space.


Objects within a cluster are more similar to each other
than they are to an object belonging to a different
cluster.
                        x           x
                x
                        x               x
x       x                           x
    x       x               x
                    x               x
                        x


                            x           x
                    x
            x                   x           x
    x                                   x
        x       x               x
                        x               x
                            x
Clustering is useful in several decision making and
machine learning situations including data mining,
document retrieval and pattern recognition.

Example: Documents may be thought of as points in a high
dimensional space, where each dimension corresponds to
one possible word. The position of a document in a
dimension is the number of times the word occurs in the
document. (think of another alternative)

Clusters of documents in this space often correspond to
groups of documents on the same topic.
  Some notation:

• A pattern (or feature vector, observation, point) x is a
  single data item used by the clustering algorithm. It
  typically consists of a vector of d components:
  x = (x1, x2, ... , xd)

• The individual scalar components xi of a pattern x are
  called features or attributes.

• d is the dimensionality of the patterns.
                      Example

State   Average Teacher   Personal Income Automobiles
          Salary ($)       per capita ($)  per capita

AL          32,549              21,442       0.44
AZ          33,350              23,060       0.40
FL          33,889              25,852       0.50
MI          48,238              25,857       0.52
NH          36,029              29,022       0.63
NJ          49,349              33,937       0.52
VT          37,200              24,175       0.49
                         Clustering


      Hierarchical approaches     Partitional approaches


Agglomerative         Divisive            k-means
                       Agglomerative


1- Initialize: Assign each vector to its own cluster.

2- Compute distances between all clusters.

3- Merge the two clusters which are closest to each other.

4- Return to step 2 until there is only one cluster left.
Example
                     Agglomerative

The clustering tree (dendrogram) can be utilized in
interpretation of the data structure and determination of the
number of clusters.

The dendrogram does not provide a unique clustering. A
partitioning can be achieved by cutting the dendrogram at
certain level(s).
                      Agglomerative

Ways of merging two clusters with different distance
definitions:

1- Single linkage method: two clusters are merged if the
minimum of the distances between all pairs of patterns drawn
from these two clusters (one pattern from the first cluster, the
other from the second) is minimum among all cluster pairs.

2- Complete linkage method: two clusters are merged if the
maximum of the distances between all pairs of patterns drawn
from the two clusters (one pattern from the first cluster, the
other from the second) is minimum among all cluster pairs.
Solution with SPSS - Ex. 1
         Solution with SPSS - Ex. 2


0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
  0
-0.1 0   0.2     0.4      0.6         0.8   1
                 Solution with SPSS - Ex. 3

State   Average Teacher    Personal Income Automobiles
          Salary ($)        per capita ($)  per capita

AL          32,549              21,442        0.44
AZ          33,350              23,060        0.40
FL          33,889              25,852        0.50
MI          48,238              25,857        0.52
NH          36,029              29,022        0.63
NJ          49,349              33,937        0.52
VT          37,200              24,175        0.49
                          k-means

1- Choose the number of clusters k.

2- Initialize the cluster centers (e.g., k randomly chosen
   patterns sufficiently far away from each other.

3- Assign each pattern to the closest cluster center.

4- Update the cluster centers using the current cluster
   memberships.

4- If the algorithm has converged , then stop; otherwise go
   to step 3.
6
5
4
3
2
1
0
    0   2   4   6   8   10
                       k-means

k-means algorithm is popular because it is easy to
implement. A major drawback is that it is sensitive to the
selection of the initial cluster centers.
                             k-means


               F     G                            F     G




      C            D E                     C          D E
      B                                    B

  A                                    A




If we start with patterns A, B     If we start with patterns A, D
and C as initial cluster centers   and F as initial cluster centers
                k-means

        Solution with SPSS - Ex. 1


6
5
4
3
2
1
0
    0   2      4      6       8      10
                k-means

         Solution with SPSS - Ex. 2


0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
  0
-0.1 0   0.2     0.4      0.6         0.8   1
                   Multidimensional Scaling
                           (MDS)

The purpose of MDS is to provide a visual representation of
the pattern of proximities (i.e., similarities or dissimilarities)
among a set of objects.

For example, given a matrix of perceived similarities
between various brands of cars, MDS plots the brands on a
map such that

- those brands that are perceived to be very similar to each
other are placed near each other on the map,

- those brands that are perceived to be very different from
each other are placed far away from each other on the map.
Example
                  Multidimensional Scaling
                          (MDS)

Normally, MDS is used to provide a visual representation of a
complex set of relationships on two or three dimensions by
preserving the relationships (similarities or dissimilarities) as
much as possible.

You can represent n points in an n-1 dimensional space exactly.

For example, two points can be located on a line by preserving
the distance between them perfectly.

However, you can not configure three points, say 1 unit apart
from each other, in 1 dimensional space.
Multidimensional Scaling
        (MDS)
Solution with SPSS - Ex. 1
                 Multidimensional Scaling
                         (MDS)
                 Solution with SPSS - Ex. 2

State   Average Teacher     Personal Income Automobiles
          Salary ($)         per capita ($)  per capita

AL          32,549               21,442        0.44
AZ          33,350               23,060        0.40
FL          33,889               25,852        0.50
MI          48,238               25,857        0.52
NH          36,029               29,022        0.63
NJ          49,349               33,937        0.52
VT          37,200               24,175        0.49

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:2/14/2012
language:
pages:26