Document Sample

Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 Course 341: Introduction to Bioinformatics Answers to Microarray Bioinformatics Tutorial 2 (Review questions on clustering) 1. Describe what is meant by data clustering and how it can be used for the analysis of gene expression matrices. Lecture 16 slide # 3-5 Clustering of data is a method by which a large set of data is grouped into clusters (groups) of smaller sets of similar data. In the context of gene expression matrices where rows represent genes and columns represent measurements of gene expression values for samples under different conditions, clustering algorithms can be applied to find either groups of similar genes or groups of similar samples or both: –e.g. Groups of genes with “similar expression profiles (Co-expressed Genes) --- similar rows in the gene expression matrix –or Groups of samples (disease cell lines/tissues/toxicants) with “similar effects” on gene expression --- similar columns in the gene expression matrix 2. Describe what is meant by a cluster centoid and what is meant by similarity metrics. Lecture 16 slide # 6-14, 26 The centroid is taken to be a “virtual” representative object for a cluster. Mathematically, it could be calculated as a point in an M-dimensional space whose parameter values are the mean of the parameter values of all the points in the clusters. (where M is the number of features or parameters or dimensions used for describing each object). n C ( S ) X i / n, X 1 ,..., X n S i 1 It is a virtual object, since there does not need to be a real object in the cluster with the calculated vales. A similarity metric is a method used for quantifying the similarity between two objects. We typically represent objects as points in an M-dimensional space. Generally, the distance between two points is taken as a common metric to assess the similarity among them. The most commonly used distance metric is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by : Other metrics include Manhattan distance, which is calculated as follows p d ( x, y ) | xi yi | i 1 Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 Make sure to describe the properties of a good similarity metric. 1. Distance between two profiles must be greater than or equal to zero, distances cannot be negative. 2. The distance between a profile and itself must be zero 3. Conversely if the difference between two profiles is zero, then the profiles must be identical. 4. The distance between profile A and profile B must be the same as the distance between profile B and profile A. 5. The distance between profile A and profile C must be less than or equal to the sum of the distance between profiles A and B and profiles B and C. Make sure to provide formulae for two different similarity metrics that can be used in data clustering. Provided above, Euclidean and Manhattan 3. Describe the operation of the hierarchical clustering algorithms Lecture 16 slide # 16-24 Hierarchical clustering is a method that successively links objects with similar profiles to form a tree structure. The standard hierarchical clustering algorithm works as follows: Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of hierarchical clustering is this: 1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Make sure you explain what is meant by a similarity matrix At each step of the algorithm, we need to compute a similarity matrix (or alternatively a distance matrix) which represent the similarity (alternatively distance) between the N objects being clustered. At each step you use the matrix to find the two elements with maximum similarity (alternatively minimum distance). The two elements are merged into one element and the matrix is recalculated. The matrix is thus updated during the operation of the algorithm by reducing it to a smaller matrix at each step. You start by an NxN matrix, then an (N-1)x(N-1) matrix, …. 0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ... ... 0 Make sure you explain what is meant by single linkage, average linkage and complete linkage Linkage methods refer to how the distance between clusters (groups of objects) are calculated. Whereas it is straightforward to calculate distance between two objects, we do Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 have various options when calculating distance between clusters. These include single linkage, average linkage and complete linkage methods. In Single Linkage we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. In Complete Linkage we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. In Average Linkage we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. Make sure you explain what is meant by a dendrogram. Dendrograms are used to represent the outputs of hierarchical clustering algorithms. A dendrogram is a binary tree structure whose leaf elements represent the data elements, which are joined up the tree based on their similarity. Internal nodes represent sub-clusters of elements. The root of the node represents the cluster containing the whole data collection. The length of each tree branch represents the distance between clusters it joins. 4. Describe the operation k-means clustering algorithm using psuedocode. Lecture 16 slide # 26 Given a set of N items to be grouped into k clusters 1. Select an initial partition of k clusters 2. Assign each object to the cluster with the closest centroid 3. Compute the new centeroid of the clusters. 4. Repeat step 2 and 3 until no object changes cluster. 5. Compare and contrast the advantages of hierarchical clustering and k-means clustering. Lecture 16 slide # 34 The table in the slides provides the required comparison from a computational perspective. In general hierarchical clustering is more informative since it provides a more detailed output showing similarity between individual items in the data set. However, its space and time complexity are higher than k-means clustering since you need to start with an NxX matrix, In k-means you don’t. Also the output of k-means may change based on the seed clusters so it can generate different results each time you execute. 6. Explain briefly the operation of the SOM algorithm and how it relates to k-means algorithm. Lecture 16 slide # 35 Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 7. Explain briefly what is meant by dimensionality reduction and why it may be important in data analysis. Lecture 16 slide # 36 8. Explain briefly how both MDS and PCA work and compare between them. Lecture 16 slide # 37-30 9. What is the main difference between clustering and classification. In classification you already know the groups that the data is divided into, this is provided by a label (e.g. diseased vs. healthy), and you are trying to find a model in terms of the dimensions (d1…dm) that can predict the class. This type of analysis is useful for predictive modelling. In clustering you are trying to divide the data into groups based on the values of their dimensions. You choose these groups such as to maximise the similarity inside the groups and maximise the distance between them. This type of analysis is useful for exploratory analysis. (Problems) 10. If you use k-means clustering on the data in table below to group the following people by age into 3 groups. How many steps would it take the algorithm to converge if you start with centroids defined by Andy, Burt and Claire? How may steps would be needed if you start with Andy, Ed and Harry? ID Age Andy 1 Burt 2 Claire 3 Dave 11 Ed 12 Fred 13 George 21 Harry 22 Ian 23 a) Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1. Cluster 2: Initial Centroid 2. Assigned items (Burt). New centroid 2. Cluster 3: Initial Centroid 3. Assigned items (Claire, Dave, Ed, Fred, George, Harry, Ian). New centorid: 15 Cluster 1: Initial Centroid 1. Assigned items (Andy). New centroid 1. Cluster 2: Initial Centroid 2. Assigned items (Burt, Claire). New centroid 2.5 Cluster 3: Initial Centroid 15. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New centorid: 17 Cluster 1: Initial Centroid Andy. Assigned items (Andy). New centroid Andy. Cluster 2: Initial Centroid 2.5. Assigned items (Burt, Claire). New centroid 2.5 Cluster 3: Initial Centroid 17. Assigned items (Dave, Ed, Fred, George, Harry, Ian). New centorid: 17 STOP, 3 steps b) Cluster 1: Initial centroid 1, items (Andy, Burt, Claire), final centroid 2. Cluster 2: Initial centroid 12, items (Dave, Ed, Fred), final centroid 12. Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22. Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 Cluster 1: Initial centroid 2, items (Andy, Burt, Claire), final centroid 2. Cluster 2: Initial centroid 12, items (Dave, Ed, Fred), final centroid 12. Cluster 3: Initial centroid 22, items (George, Harry Ian,) final centroid 22. STOP 2 steps, but clearly better results. 11. Use the k-means clustering algorithm to find 3 clusters on the following data set, Assume initial cluster centroids adefined by A, B and C. Provide a graphical representation of the clusters. ID Dimension 1 Dimension 2 A 1 1 B 8 6 C 20 3 D 21 2 E 11 7 F 7 7 G 1 2 H 6 8 Cluster 1: Initial centroid (1,1), items (A, G), final centroid (1,1.5) Cluster 2: Initial centroid (8,6), items (B, E, F, H), final centroid (8,7). Cluster 3: Initial centroid (20,3), items (C, D) final centroid (20.5,2.5) Cluster 1: Initial centroid (1,1.5), items (A, G), final centroid (1,1.5) Cluster 2: Initial centroid (8,7), items (B, E, F, H), final centroid (8,7). Cluster 3: Initial centroid (20,2.5), items (C, D) final centroid (20.5,2.5) STOP 9 8 8 H 7 7 F 7 E 6 6 B 5 Series1 4 3 3 C 2 2G 2 D 1 1A 0 0 5 10 15 20 25 Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 1 12. Use the hierarchical clustering on the data of question 11 using a Euclidean metric in the following cases: a. Using single Linkage b. Using Complete Linkage Make sure to show the values of your distance matrix at each step I build a matrix based on distance (not similarity), so at each step, so I scan for the minimum value – If I used a similarity matrix, I would have to choose the maximum value. a. Using single Linkage Note I only have to calculate distances once, I will operate only on this matrix from now on. A B C D E F G H A X 8.6 19.1 20 11.7 8.5 1 8.6 B X X 12.3 13.6 3.2 1.4 8.1 2.8 C X X X 1.4 9.8 13.6 19 14.9 D X X X X 11.2 14.9 20 16.2 E X X X X X 4 11.2 5.1 F X X X X X X 7.8 1.4 G X X X X X X X 7.8 H X X X X X X X X A and G are most similar items so I merge them to get first link between two elements. I draw the connection and label the length on the scale bar. 1 A G I need to update the matrix, I delete the row and column for A and row/column for B. I insert a new row and column called AG. The entries for AG need to be calculated. Since I use single linkage, I choose to keep the minimum value between (AG, B) i.e. min (dist(A,B) , dist(B,G)) = min(8.6, 8.1) = 8.1 the distance from G to B. All other entries that do not involve AG remain the same. The updated values are shown in italics. A-G B C D E F H A-G X 8.1 19 20 11.2 7.8 7.8 B X X 12.3 13.6 3.2 1.4 2.8 C X X X 1.4 9.8 13.6 14.9 D X X X X 11.2 14.9 16.2 E X X X X X 4 5.1 F X X X X X X 1.4 H X X X X X X X I repeat the process, this time I have a choice since the distance between F and B is 1.4, the distance between G and H is also 1.4 and so is the distance between C and D. I arbitrarily choose to link F and B together. 1 A G F B 1 This is a rather big size problem to solve by hand, but given to show how you can do it. Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 AG B-F C D E H AG X 7.8 19 20 11.2 7.8 B-F X X 12.3 13.6 3.2 1.4 C X X X 1.4 9.8 14.9 D X X X X 11.2 16.2 E X X X X X 5.1 H X X X X X X I repeat and I link now BF and H 1 A G F B H AG BF-H C D E AG X 7.8 19 20 11.2 BF-H X X 12.3 13.6 3.2 C X X X 1.4 9.8 D X X X X 11.2 E X X X X X Now I link C and D 1 A G F B H C D AG BFH C-D E AG X 7.8 19 11.2 BFH X X 12.3 3.2 C-D X X X 9.8 E X X X X I now link BFH and E Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 1 3 1 A G F B H E C D AG BFH-E CD AG-BFHE CD AG X 7.8 19 AG-BFHE X 9.8 BFH-E X X 9.8 CD X X CD X X X I now link AG and BFHE and then the final cluster AGBFHE and CD. Giving me the final dendrogram shown below. Compare this to the scatter plot shown in the previous problem and see if it makes sense. 1 3 8 10 A G F B H E C D Here is the dendrogram generated by the KDE data mining tools. Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 b) For complete linkage, we do the same thing, but when updating the matrix, we choose the maximum distance between clusters rather than the minimum distance. I still start be choosing A and G to start with the same matrix since they still have the minimum distance. A B C D E F G H A X 8.6 19.1 20 11.7 8.5 1 8.6 B X X 12.3 13.6 3.2 1.4 8.1 2.8 C X X X 1.4 9.8 13.6 19 14.9 D X X X X 11.2 14.9 20 16.2 E X X X X X 4 11.2 5.1 F X X X X X X 7.8 1.4 G X X X X X X X 7.8 H X X X X X X X X Now when updating the matrix, I set the distance between AG and B to be the maximum of dist(A,B) and dist(A,G) i.e. 8.6 rather than 8.1 as in the previous case A-G B C D E F H A-G X 8.6 19.1 20 11.7 8.5 8.6 B X X 12.3 13.6 3.2 1.4 2.8 C X X X 1.4 9.8 13.6 14.9 D X X X X 11.2 14.9 16.2 E X X X X X 4 5.1 F X X X X X X 1.4 H X X X X X X X I choose to merge B and F since they have the minumum distance AG B-F C D E H AG X 8.6 19.1 20 11.7 8.6 B-F X X 13.6 14.9 4 2.8 C X X X 1.4 9.8 14.9 D X X X X 11.2 16.2 E X X X X X 5.1 H X X X X X X I choose to merge BF and H, Etc Here is the dendrogram generated by the KDE data mining tool. First compare it to the one above. Then generate your own dendrogram and compare it to the one below. Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 13. The following table shows the gene expression values for 8 genes under five types of cancer. You are interested in discovering the similarity relationship between the eight genes. ID C1 C2 C3 C4 C5 A 1 1 1 1 2 B 1 2 1 1 1 C 14 15 15 15 15 D 15 15 15 15 15 E 16 16 16 16 16 F 6 6 5 6 6 G 4 4 4 4 4 H 5 5 5 5 5 a. Using Manhattan distance and a single linkage show the resulting dendrogram. Work out the calculation yourself by hand. When you do it, you will end up with a dendrogram looking as the one below. Note that even though there are more dimensions than in the previous problem (five features as opposed to only 2), you will mainly be dealing with the same size distance matrix (8x8) since this defined by the number of elements being clustered. In general it will be as tedious to solve as the previous one, but get your hand working at it to figure out the pattern of doing it. Clearly as the computation progresses the matrix size gets smaller. b. How would memory storage requirements change if you use complete linkage? If you use average linkage? In complete linkage it is the same requirements, you just pick values from the initial distance matrix, but update them differently. In average linkage you need to calculate the distance between every pair of elements in both clusters. You would need to keep the initial distance matrix to look-up this information in addition to the one you are updating. 14. Based on the table in question 12, use hierarchical clustering (Manhattan distance and single linkage) to study the similarity between the five cancer types (C1 ..C5). How can this form of analysis be useful? Analysis is useful when you want to study similarity between diseases (See question 4 in tutorial 1). Here is the distance matrix for this problem, it is easier to calculate because of the Manhattan distance.. C1 C2 C3 C4 C5 C1 X X X X X C2 2 X X X X C3 2 2 X X X C4 1 1 1 X X C5 2 2 2 1 X Moustafa Ghanem Imperial College London Course 341: Introduction to Bioinformatics 2004/2005, 2005/2006, 2006/2007 There many different ways to proceed since there are lots of 1, the dendrogram can have any shape you want based on which diseases you link-up since the distance that separates them is always 1. C1 C2 C3 C4 C5 C1 X X X X X C2 2 X X X X C3 2 2 X X X C4 1 1 1 X X C5 2 2 2 1 X C1-4 C2 C3 C5 C1-4 X X X X C2 1 X X X C3 1 2 X X C5 1 2 2 X C14-2 C3 C5 C14-2 X X X C3 1 X X C5 1 2 X C142-3 C5 C142-3 X X C5 1 X 1 Moustafa Ghanem Imperial College London

DOCUMENT INFO

Shared By:

Categories:

Tags:
Life Sciences, Bioinformatics Research, computer science, Book News, Naperville, Illinois, North Central College, bioinformatics tools, St. Clair, Graduate Certificate, Faculty of Science

Stats:

views: | 7 |

posted: | 3/27/2011 |

language: | English |

pages: | 11 |

OTHER DOCS BY nikeborome

How are you planning on using Docstoc?
BUSINESS
PERSONAL

Feel free to Contact Us with any questions you might have.