Bioinformatics Coursework

Document Sample
Bioinformatics Coursework Powered By Docstoc
					Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007


Course 341: Introduction to Bioinformatics

Answers to Microarray Bioinformatics Tutorial 2
(Review questions on clustering)

1. Describe what is meant by data clustering and how it can be used for the analysis
   of gene expression matrices.
   Lecture 16 slide # 3-5

Clustering of data is a method by which a large set of data is grouped into clusters (groups) of
smaller sets of similar data. In the context of gene expression matrices where rows represent
genes and columns represent measurements of gene expression values for samples under
different conditions, clustering algorithms can be applied to find either groups of similar genes
or groups of similar samples or both:

        –e.g. Groups of genes with “similar expression profiles (Co-expressed Genes) ---
        similar rows in the gene expression matrix

        –or Groups of samples (disease cell lines/tissues/toxicants) with “similar effects” on
        gene expression --- similar columns in the gene expression matrix


2. Describe what is meant by a cluster centoid and what is meant by similarity
   metrics.
   Lecture 16 slide # 6-14, 26

The centroid is taken to be a “virtual” representative object for a cluster. Mathematically, it
could be calculated as a point in an M-dimensional space whose parameter values are the
mean of the parameter values of all the points in the clusters. (where M is the number of
features or parameters or dimensions used for describing each object).

                                            n                  
                                   C ( S )   X i / n, X 1 ,..., X n  S
                                             i 1


It is a virtual object, since there does not need to be a real object in the cluster with the
calculated vales.

A similarity metric is a method used for quantifying the similarity between two objects. We
typically represent objects as points in an M-dimensional space. Generally, the distance
between two points is taken as a common metric to assess the similarity among them. The
most commonly used distance metric is the Euclidean metric which defines the distance
between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :




Other metrics include Manhattan distance, which is calculated as follows

                                                      p
                                         d ( x, y )   | xi  yi |
                                                     i 1




Moustafa Ghanem                                                             Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007

        Make sure to describe the properties of a good similarity metric.

             1. Distance between two profiles must be greater than or equal to zero,
                distances cannot be negative.
             2. The distance between a profile and itself must be zero
             3. Conversely if the difference between two profiles is zero, then the profiles
                must be identical.
             4. The distance between profile A and profile B must be the same as the
                distance between profile B and profile A.
             5. The distance between profile A and profile C must be less than or equal to
                the sum of the distance between profiles A and B and profiles B and C.

        Make sure to provide formulae for two different similarity metrics that can be
         used in data clustering.

Provided above, Euclidean and Manhattan


3.   Describe the operation of the hierarchical clustering algorithms
     Lecture 16 slide # 16-24

Hierarchical clustering is a method that successively links objects with similar profiles to form
a tree structure. The standard hierarchical clustering algorithm works as follows:

         Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the
         basic process of hierarchical clustering is this:
             1. Start by assigning each item to its own cluster, so that if you have N items,
                 you now have N clusters, each containing just one item.
             2. Find the closest (most similar) pair of clusters and merge them into a single
                 cluster, so that now you have one less cluster.
             3. Compute distances (similarities) between the new cluster and each of the old
                 clusters.
             4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size
                 N.

        Make sure you explain what is meant by a similarity matrix

At each step of the algorithm, we need to compute a similarity matrix (or alternatively a
distance matrix) which represent the similarity (alternatively distance) between the N objects
being clustered. At each step you use the matrix to find the two elements with maximum
similarity (alternatively minimum distance). The two elements are merged into one element
and the matrix is recalculated. The matrix is thus updated during the operation of the
algorithm by reducing it to a smaller matrix at each step. You start by an NxN matrix, then an
(N-1)x(N-1) matrix, ….


                                  0                         
                                  d(2,1)      0             
                                                            
                                  d(3,1) d ( 3,2) 0         
                                                            
                                  :           :     :       
                                 d ( n,1) d ( n,2) ... ... 0
                                                            
        Make sure you explain what is meant by single linkage, average linkage and
         complete linkage

Linkage methods refer to how the distance between clusters (groups of objects) are
calculated. Whereas it is straightforward to calculate distance between two objects, we do


Moustafa Ghanem                                                      Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
have various options when calculating distance between clusters. These include single
linkage, average linkage and complete linkage methods.

In Single Linkage we consider the distance between one cluster and another cluster to be
equal to the shortest distance from any member of one cluster to any member of the other
cluster.

In Complete Linkage we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any member of the other
cluster.

In Average Linkage we consider the distance between one cluster and another cluster to be
equal to the average distance from any member of one cluster to any member of the other
cluster.

       Make sure you explain what is meant by a dendrogram.




Dendrograms are used to represent the outputs of hierarchical clustering algorithms. A
dendrogram is a binary tree structure whose leaf elements represent the data elements,
which are joined up the tree based on their similarity. Internal nodes represent sub-clusters of
elements. The root of the node represents the cluster containing the whole data collection.

The length of each tree branch represents the distance between clusters it joins.

4. Describe the operation k-means clustering algorithm using psuedocode.
   Lecture 16 slide # 26

    Given a set of N items to be grouped into k clusters
        1. Select an initial partition of k clusters
        2. Assign each object to the cluster with the closest centroid
        3. Compute the new centeroid of the clusters.
        4. Repeat step 2 and 3 until no object changes cluster.

5. Compare and contrast the advantages of hierarchical clustering and k-means
   clustering.
   Lecture 16 slide # 34

The table in the slides provides the required comparison from a computational perspective. In
general hierarchical clustering is more informative since it provides a more detailed output
showing similarity between individual items in the data set. However, its space and time
complexity are higher than k-means clustering since you need to start with an NxX matrix, In
k-means you don’t. Also the output of k-means may change based on the seed clusters so it
can generate different results each time you execute.

6. Explain briefly the operation of the SOM algorithm and how it relates to k-means
   algorithm.
   Lecture 16 slide # 35




Moustafa Ghanem                                                     Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
7. Explain briefly what is meant by dimensionality reduction and why it may be
   important in data analysis.
   Lecture 16 slide # 36

8. Explain briefly how both MDS and PCA work and compare between them.
   Lecture 16 slide # 37-30

9. What is the main difference between clustering and classification.

   In classification you already know the groups that the data is divided into, this is provided
   by a label (e.g. diseased vs. healthy), and you are trying to find a model in terms of the
   dimensions (d1…dm) that can predict the class. This type of analysis is useful for
   predictive modelling.

   In clustering you are trying to divide the data into groups based on the values of their
   dimensions. You choose these groups such as to maximise the similarity inside the
   groups and maximise the distance between them. This type of analysis is useful for
   exploratory analysis.

(Problems)

10. If you use k-means clustering on the data in table below to group the following
    people by age into 3 groups. How many steps would it take the algorithm to
    converge if you start with centroids defined by Andy, Burt and Claire? How may
    steps would be needed if you start with Andy, Ed and Harry?

                            ID                 Age
                            Andy               1
                            Burt               2
                            Claire             3
                            Dave               11
                            Ed                 12
                            Fred               13
                            George             21
                            Harry              22
                            Ian                23

   a)

   Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1.
   Cluster 2: Initial Centroid 2. Assigned items (Burt). New centroid 2.
   Cluster 3: Initial Centroid 3. Assigned items (Claire, Dave, Ed, Fred, George, Harry, Ian).
   New centorid: 15

   Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1.
   Cluster 2: Initial Centroid 2. Assigned items (Burt, Claire). New centroid 2.5
   Cluster 3: Initial Centroid 15. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New
   centorid: 17

   Cluster 1: Initial Centroid Andy. Assigned items (Andy). New centroid Andy.
   Cluster 2: Initial Centroid 2.5. Assigned items (Burt, Claire). New centroid 2.5
   Cluster 3: Initial Centroid 17. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New
   centorid: 17

   STOP,
   3 steps

   b)
   Cluster 1: Initial centroid 1, items (Andy, Burt, Claire), final centroid 2.
   Cluster 2: Initial centroid 12, items (Dave, Ed, Fred), final centroid 12.
   Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22.



Moustafa Ghanem                                                      Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007

   Cluster 1: Initial centroid 2, items (Andy, Burt, Claire), final centroid 2.
   Cluster 2: Initial centroid 12, items (Dave, Ed, Fred), final centroid 12.
   Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22.

   STOP
   2 steps, but clearly better results.


11. Use the k-means clustering algorithm to find 3 clusters on the following data set,
    Assume initial cluster centroids adefined by A, B and C. Provide a graphical
    representation of the clusters.

                              ID                            Dimension 1        Dimension 2
                              A                             1                  1
                              B                             8                  6
                              C                             20                 3
                              D                             21                 2
                              E                             11                 7
                              F                             7                  7
                              G                             1                  2
                              H                             6                  8



   Cluster 1: Initial centroid (1,1), items (A, G), final centroid (1,1.5)
   Cluster 2: Initial centroid (8,6), items (B, E, F, H), final centroid (8,7).
   Cluster 3: Initial centroid (20,3), items (C, D) final centroid (20.5,2.5)


   Cluster 1: Initial centroid (1,1.5), items (A, G), final centroid (1,1.5)
   Cluster 2: Initial centroid (8,7), items (B, E, F, H), final centroid (8,7).
   Cluster 3: Initial centroid (20,2.5), items (C, D) final centroid (20.5,2.5)

   STOP

                 9

                 8                     8   H

                 7                         7   F        7   E
                 6                             6   B
                 5

                                                                                             Series1
                 4

                 3                                                             3   C

                 2       2G                                                        2   D

                 1       1A

                 0
                     0             5               10            15       20               25




Moustafa Ghanem                                                                    Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
                                                                                                      1
12. Use the hierarchical clustering on the data of question 11 using a Euclidean metric
    in the following cases:
        a. Using single Linkage
        b. Using Complete Linkage

           Make sure to show the values of your distance matrix at each step

I build a matrix based on distance (not similarity), so at each step, so I scan for the minimum
value – If I used a similarity matrix, I would have to choose the maximum value.

a. Using single Linkage

Note I only have to calculate distances once, I will operate only on this matrix from now on.
              A           B           C           D           E           F          G         H
A             X           8.6         19.1        20          11.7        8.5        1         8.6
B             X           X           12.3        13.6        3.2         1.4        8.1       2.8
C             X           X           X           1.4         9.8         13.6       19        14.9
D             X           X           X           X           11.2        14.9       20        16.2
E             X           X           X           X           X           4          11.2      5.1
F             X           X           X           X           X           X          7.8       1.4
G             X           X           X           X           X           X          X         7.8
H             X           X           X           X           X           X          X         X


A and G are most similar items so I merge them to get first link between two elements. I draw
the connection and label the length on the scale bar.
               1


     A

     G
I need to update the matrix, I delete the row and column for A and row/column for B. I insert a
new row and column called AG. The entries for AG need to be calculated. Since I use single
linkage, I choose to keep the minimum value between (AG, B) i.e. min (dist(A,B) , dist(B,G)) =
min(8.6, 8.1) = 8.1 the distance from G to B. All other entries that do not involve AG remain
the same. The updated values are shown in italics.

              A-G         B           C           D           E           F          H
A-G           X           8.1         19          20          11.2        7.8        7.8
B             X           X           12.3        13.6        3.2         1.4        2.8
C             X           X           X           1.4         9.8         13.6       14.9
D             X           X           X           X           11.2        14.9       16.2
E             X           X           X           X           X           4          5.1
F             X           X           X           X           X           X          1.4
H             X           X           X           X           X           X          X


I repeat the process, this time I have a choice since the distance between F and B is 1.4, the
distance between G and H is also 1.4 and so is the distance between C and D. I arbitrarily
choose to link F and B together.
           1

     A

     G


      F

      B

1
    This is a rather big size problem to solve by hand, but given to show how you can do it.


Moustafa Ghanem                                                             Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
                       AG      B-F      C       D       E        H
            AG         X       7.8      19      20      11.2     7.8
            B-F        X       X        12.3    13.6    3.2      1.4
            C          X       X        X       1.4     9.8      14.9
            D          X       X        X       X       11.2     16.2
            E          X       X        X       X       X        5.1
            H          X       X        X       X       X        X


I repeat and I link now BF and H
            1



   A

    G

    F

    B

    H

                                   AG    BF-H    C       D        E
                       AG          X     7.8     19      20       11.2
                       BF-H        X     X       12.3    13.6     3.2
                       C           X     X       X       1.4      9.8
                       D           X     X       X       X        11.2
                       E           X     X       X       X        X

Now I link C and D

                  1


     A

     G


        F

        B

        H


        C

        D


                                   AG    BFH     C-D     E
                       AG          X     7.8     19      11.2
                       BFH         X     X       12.3    3.2
                       C-D         X     X       X       9.8
                       E           X     X       X       X



I now link BFH and E




Moustafa Ghanem                                           Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007

                     1           3

                 1
         A

         G


         F

         B

         H

         E


         C

         D

                AG       BFH-E       CD                   AG-BFHE   CD
    AG          X        7.8         19         AG-BFHE   X         9.8
    BFH-E       X        X           9.8        CD        X         X
    CD          X        X           X

I now link AG and BFHE and then the final cluster AGBFHE and CD. Giving me the final
dendrogram shown below. Compare this to the scatter plot shown in the previous problem
and see if it makes sense.

                 1           3                             8                    10


         A

         G


         F

         B

            H

            E


         C

         D


Here is the dendrogram generated by
the KDE data mining tools.




Moustafa Ghanem                                                Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
b)
For complete linkage, we do the same thing, but when updating the matrix, we choose the
maximum distance between clusters rather than the minimum distance.


I still start be choosing A and G to start with the same matrix since they still have the
minimum distance.
           A          B          C           D          E          F           G            H
A          X          8.6        19.1        20         11.7       8.5         1            8.6
B          X          X          12.3        13.6       3.2        1.4         8.1          2.8
C          X          X          X           1.4        9.8        13.6        19           14.9
D          X          X          X           X          11.2       14.9        20           16.2
E          X          X          X           X          X          4           11.2         5.1
F          X          X          X           X          X          X           7.8          1.4
G          X          X          X           X          X          X           X            7.8
H          X          X          X           X          X          X           X            X



Now when updating the matrix, I set the distance between AG and B to be the maximum of
dist(A,B) and dist(A,G) i.e. 8.6 rather than 8.1 as in the previous case

           A-G        B          C           D          E          F           H
A-G        X          8.6        19.1        20         11.7       8.5         8.6
B          X          X          12.3        13.6       3.2        1.4         2.8
C          X          X          X           1.4        9.8        13.6        14.9
D          X          X          X           X          11.2       14.9        16.2
E          X          X          X           X          X          4           5.1
F          X          X          X           X          X          X           1.4
H          X          X          X           X          X          X           X

I choose to merge B and F since they have the minumum distance

           AG         B-F        C           D          E          H
AG         X          8.6        19.1        20         11.7       8.6
B-F        X          X          13.6        14.9       4          2.8
C          X          X          X           1.4        9.8        14.9
D          X          X          X           X          11.2       16.2
E          X          X          X           X          X          5.1
H          X          X          X           X          X          X


I choose to merge BF and H,

Etc

Here is the dendrogram generated by the KDE data mining tool. First compare it to the one
above. Then generate your own dendrogram and compare it to the one below.




Moustafa Ghanem                                                       Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007
13. The following table shows the gene expression values for 8 genes under five types
    of cancer. You are interested in discovering the similarity relationship between the
    eight genes.

                             ID    C1   C2     C3    C4     C5
                             A     1    1      1     1      2
                             B     1    2      1     1      1
                             C     14   15     15    15     15
                             D     15   15     15    15     15
                             E     16   16     16    16     16
                             F     6    6      5     6      6
                             G     4    4      4     4      4
                             H     5    5      5     5      5

        a. Using Manhattan distance and a single linkage show the resulting
           dendrogram.

Work out the calculation yourself by hand. When you do it, you will end up with a dendrogram
looking as the one below.




Note that even though there are more dimensions than in the previous problem (five features
as opposed to only 2), you will mainly be dealing with the same size distance matrix (8x8)
since this defined by the number of elements being clustered. In general it will be as tedious
to solve as the previous one, but get your hand working at it to figure out the pattern of doing
it. Clearly as the computation progresses the matrix size gets smaller.

        b. How would memory storage requirements change if you use complete
           linkage? If you use average linkage?

In complete linkage it is the same requirements, you just pick values from the initial distance
matrix, but update them differently.

In average linkage you need to calculate the distance between every pair of elements in both
clusters. You would need to keep the initial distance matrix to look-up this information in
addition to the one you are updating.

14. Based on the table in question 12, use hierarchical clustering (Manhattan distance
    and single linkage) to study the similarity between the five cancer types (C1 ..C5).
    How can this form of analysis be useful?

Analysis is useful when you want to study similarity between diseases (See question 4 in
tutorial 1). Here is the distance matrix for this problem, it is easier to calculate because of the
Manhattan distance..

                              C1   C2    C3     C4     C5
                        C1    X    X     X      X      X
                        C2    2    X     X      X      X
                        C3    2    2     X      X      X
                        C4    1    1     1      X      X
                        C5    2    2     2      1      X




Moustafa Ghanem                                                        Imperial College London
Course 341: Introduction to Bioinformatics
2004/2005, 2005/2006, 2006/2007

There many different ways to proceed since there are lots of 1, the dendrogram can have any
shape you want based on which diseases you link-up since the distance that separates them
is always 1.

                                    C1     C2        C3        C4        C5
                             C1     X      X         X         X         X
                             C2     2      X         X         X         X
                             C3     2      2         X         X         X
                             C4     1      1         1         X         X
                             C5     2      2         2         1         X


                                         C1-4    C2            C3        C5
                             C1-4        X       X             X         X
                             C2          1       X             X         X
                             C3          1       2             X         X
                             C5          1       2             2         X




                                             C14-2        C3        C5
                                  C14-2      X            X         X
                                  C3         1            X         X
                                  C5         1            2         X


                                                C142-3          C5
                                    C142-3      X               X
                                    C5          1               X




                                                     1




Moustafa Ghanem                                                               Imperial College London