Clustering

Shared by: HC111213113647
Categories
Tags
-
Stats
views:
6
posted:
12/13/2011
language:
English
pages:
35
Document Sample
scope of work template
							Clustering

Petter Mostad
    Clustering vs. class prediction
   Class prediction:
       A learning set of objects with known classes
       Goal: put new objects into existing classes
       Also called: Supervised learning, or classification
   Clustering:
       No learning set, no given classes
       Goal: discover the ”best” classes or groupings
       Also called: Unsupervised learning, or class discovery
                     Overview

   General clustering theory
       Steps, methods, algorithms, issues...
   Clustering microarray data
       Recommendations for this kind of data
   Programs for clustering
   Some other visualization techniques
           Issues in clustering
   Used to explore and visualize data, with
    few preconceptions
   Many subjective choices must be made, so
    a clustering output tends to be subjective
   It is difficult to get truly statistically
    ”significant” conclusions
   Algorithms will always produce clusters,
    whether any exist in the data or not
           Steps in clustering

1.   Feature selection and extraction
2.   Defining and computing similarities
3.   Clustering or grouping objects
4.   Assessing, presenting, and using the
     result
1. Feature selection and extraction

   Deciding which measurements matter for
    similarity
   Data reduction
   Filtering away objects
   Normalization of measurements
              The data matrix
   Every row contains         measurements

    the measurements for
    one object.             objects

   Similarities are
    computed between all
    pairs of rows
   If measurements are
    of same type, one can
    instead cluster them!
         2. Defining and computing
                 similarities
   Similarity measures for continuous data
    vectors: x  ( x1 ,..., xn ) y  ( y1 ,..., yn )
       Euclidean distance          n

                                    ( xi  yi ) 2
                                   i 1

       Minkowski distance (including Manhattan
        metric)    n
                           p
                              1/ p

                     xi  yi 
                    i 1      
       Mahalanobis distance       xS 1 y      where S is a
        covariance matrix
   Centered and non-centered (absolute)
    Pearson correlation  ( x  x )( y  y )
                                       n

                                                i              i
                                     i 1
        centered:               n                       n

                                (x  x)  ( y
                                            i
                                                    2
                                                                   i    y)2        n
                                i 1                    i 1
                                                                                   x y           i       i

        non-centered:                                                         n
                                                                                   i 1
                                                                                                      n

                                                                           x y
                                                                                              2                   2
                                                                                          i                   i
                  1 n         1 n                                          i 1                   i 1
               x   xi    y   yi
     where        n i 1      n i 1
   Spearman rank correlation
      Compute the ranking of the numbers in each
       vector
      Find correlation between ranking numbers

   ....
    Geometrical view of clustering
   If measurements are
    coordinates, objects
    become points in some
    space




                                  8
   If the simiarity measure is



                                  6
    Euclidean distance, the
    goal is to group nearby


                                  4
    points
    Note: When we have only
                                  2

    2 or 3 measurements per
    object, we can do better
                                  0


    than most algorithms              2   3   4   5   6   7
    using visual inspection
     Similarity measures for discrete
                  data
   Comparing two binary vectors, count the
    numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s,
    respectively
   Construct different similarity measurements
    based on these numbers:
       ad          a           a         2(a  b)        ...
     abcd     abcd      abc    2(a  b)  c  d

   Similarity of for example trees or other objects
    can be defined in reasonable ways
      Similarities using contexts

   Mutual Neighbour Distance:
    MND( x, y)  NN ( x, y)  NN ( y, x)
    where NN ( x, y) is the neighbour number of x
     with respect to y
   This is not a metric, but similarities do not
    need to be based on metrics.
         3. Clustering or grouping

   Hierarchical clusterings
       Divisive: Starts with one big cluster and
        subdivides on cluster in each step
       Agglomerative: Starts with each object in
        separate cluster. In each step, joins the two
        closest clusters
   Partitional clusterings
   Probabilistic or fuzzy clusterings
            Hierarchical clustering
   Agglomerative clustering depends on type of linkage,
    i.e., how to compute the distance between merged
    cluster (UV) and old cluster (W):
       d(UV, W) = min(d(U, W), d(V,W)) (single linkage)
       d(UV, W) = max(d(U,W), d(V,W)) (complete linkage)
       d(UV, W) = average over all distances between objects in (UV)
        and objects in W (average linkage, or UPGMA: Unweighted Pair
        Group Method with Arithmetic mean)
   The output is a dendrogram
   A simplification of average linkage is often implemented
    (“average group linkage”): It may lead to inverted
    dendrograms!
     Dendrograms, visualizations
   The data matrix is often visualized using three
    colors, representing positive, negative, and zero
    values.
   Hierarchical clustering results often represented
    with a dendrogram. The similarity at which
    clusters merge should correspond to height of
    corresponding horizontal line in dendrogram!
   To display the dendrogram, the objects (lines or
    columns) need to be sorted, this can be done in
    two ways at every time when two clusters are
    merged.
    Ward’s hierarchical clustering
   Agglomerative.
   Goal: minimize ”Error Sum of Squares” (ESS) at
    every step.
       ESS = The sum over all clusters, of the sum of the
        squares of the distances from the objects to the
        cluster centroid.
   When joining two clusters, find the pair that
    results in the smallest increase in ESS.
            Partitional clusterings
   The number of desired clusters is fixed at the
    start
   K-means clustering:
       Partition into k initial clusters
       Iteratively, reassign points to groups with the closest
        centroid. Recompute centroids.
       Repeat until stability
       The result may depend on initial clusters
       May include a procedure joining or splitting clusters
        according to size
   The choice of number of clusters may not be
    obvious
    Probabilistic or fuzzy clustering
   The output is, for each object and each cluster,
    a probability or weight that the object belongs to
    the cluster
   Example: The observations are modelled as
    produced by drawing from a number of
    probability densities (often multivariate normal).
    Parameters are then estimated with Maximum
    Likelihood (for example using EM algorithm).
   Example: A ”fuzzy” version of k-means, where
    weights for objects are changed iteratively
    Neural networks for clustering
   Neural networks are mathematical models
    made to be similar to actual neural
    networks
   They consist of layers of nodes that send
    out ”signals” based probabilistically on
    input signals
   Most known uses are classifications, i.e.,
    with learning sets
Self-Organising Maps (SOM)
       Clustering as optimization
   Given similarity definition and definition of what
    is an ”optimal” clustering, it can often be a huge
    algorithmic challenge to find the optimum.
   Example: Subdivide many thousand objects into
    50 clusters, minimizing e.g. the sum of the
    squared distances to centroids.
   Then, algorithms for optimization are central.
            Genetic algorithms
   Tries to use ”evolution” to obtain good solutions
    to a problem
   A number of solutions are kept at every step:
    They may then mate or mutate, to produce new
    solutions. The ”fittest” solutions are kept.
   Can be seen as an optimization algorithm
   A great challenge to design ways of mating and
    mutating that produce an efficient algorithm
           Simulated annealing
   A general optimization technique
   Iterative: At every step, nearby solutions are
    chosen with probabilities depending on their
    optimality (so even less optimal solutions may
    be chosen)
   As the algorithm proceeds, and the
    ”temperature” sinks, the probability of choosing
    less optimal solutions also sinks.
   Is a good general way to avoid local optima.
    4. Assessing and using the result

   Visualization and summarization of the
    clusters
   Note: You should always investigate the
    dependence of your results on the choices
    you have made for the clustering!
       Examples of applications of
              clustering
   Image analysis
   Speech recognition
   Data mining
   ....
       Clustering microarray data
                                      samples
   Samples are columns,
    genes are rows, in data
                              genes
    matrix
   What values to cluster?
   What is a biologically
    relevant measure of
    similarity?
   One can cluster genes
    and/or samples
       Clustering microarray data
   Use logged data, usually
   Data should be on same scale (but usually is if you use data that is
    already normalized)
   You may have to filter away genes that show too little variation over
    samples.
   Use an appropriate distance measure for the question you want to
    focus on (Pearson correlation often works OK).
   Use appropriate clustering algorithm (Hierarchical average linkage
    usually works OK).
   If you draw some conclusion from the clustering results, try to vary
    your clustering choices to see how stable these results are.
   Clustering works best as a tool to generate hypotheses and ideas,
    which may then be tested in other ways.
Clustering tumor samples
        Clustering to confirm or reject
                 hypotheses?
   A clustering may appear to validate, or be validated by,
    a grouping derived by using other data
   Caution: The many different ways to do a clustering may
    make it possible to tweak it to produce the clusters you
    want
   There is a huge and complex multiple testing problem
   Note that small changes in data can change result
    dramatically
   If you insist on trying to get ”significance”:
       Using permutations of data
       Using resampling of data (bootstrapping)
How to do clustering: Programs
   A good program for clustering and visualization: HCE
       Great visualization options
       Adapted to microarray data
       http://www.cs.umd.edu/hcil/hce/
       Can import similarity matrices
   Classic for microarray data: Cluster & TreeView (Eisen)
   R/BioConductor: package cluster, hclust function,
    heatmap function, ...
   Many other programs/packages
     Other visualization techniques:
         Principal Components
   The principal components can be viewed as the axes of
    a “better” coordinate system for the data.
   “Better” in the sense that the data is maximally spread
    out along the first principal components.
   The principal components correspond to eigenvectors of
    the covariance matrix of the data.
   The eigenvalues represent the part of the total variance
    explained by each of the principal components.
Principal component analysis of
        expression data
Principal component analysis of
        expression data
     Other visualization techniques:
        Multidimensional scaling
   Start with some points in a very high dimension.
   Goal: Display these points in a lower dimension,
    so that distances between them are similar to
    distances in original dimension.
   May also try to preserve only the ranking of the
    pairwise distances.
   Makes it possible to use powerful visual
    inspection, in 2 or 3 dimensions.
   Can sometimes give very convincing pictures
    separating samples in a predicted way.

						
Related docs
Other docs by HC111213113647
Creating a Shared Library
Views: 2  |  Downloads: 0
???? B12 ??????????????
Views: 19  |  Downloads: 0
RPH LINUS Tahun 1
Views: 152  |  Downloads: 1
Chart 1 - Excel 15
Views: 21  |  Downloads: 0
Tartalomjegyz�k
Views: 2  |  Downloads: 0
Dr
Views: 5  |  Downloads: 0
2004 State Games of Oregon - Archery
Views: 2  |  Downloads: 0
A Note on Box-Jenkins Notation
Views: 10  |  Downloads: 1
?????
Views: 177  |  Downloads: 0
LMH ch? xem x�t
Views: 75  |  Downloads: 0