Document Sample

Comp290-090 Final Project Mining in Performance Data Abstract In this project, we tried several methods to mine the patterns in performance data. The ultimate goal of this work is to use these patterns in automated performance optimization which can greatly facilitate researchers to tune their applications for a better performance. However this work has just started, currently we have no idea whether there are really some patterns in the performance data and how to use them in our future work. What we did in this project is to try as many as possible data mining methods on performance data, and try to find some useful patterns out. The data we used are I/O tracing data which come from NCSA at UIUC. All our experiments are based on these data, so currently we can only mine I/O performance data. But the methods we proposed here shouldn’t be limited to I/O operations. Basically, I have done 4 experiments in this project. They are nodes clustering, system clustering, nodes clustering similarity and phase clustering. I will explain each of them in detail in the following sections. 1. Nodes Clustering 1.1 Motivation Nowadays, the scale of high performance computer has grown tremendously, there are usually hundreds or even thousands of nodes working together to execute a large scale application. With this great computing power, we are now able to solve some huge problems in a reasonable time which are not solvable before. However, only increasing the computing power is not sufficient to get a high performance, we also need to tune the nodes to work efficiently with each other. Clearly, getting the performance of each node during application’s execution is the first step to do this tuning work. Then we can schedule the tasks dynamically based on this information. However, monitoring the performance will introduce overhead to the application’s execution. Sometimes the overhead is so big that it is not neglectable especially when there are lots of nodes to monitor. What we are pursuing is a shorter wall-clock application runtime, so obviously we don’t like to see that our efforts to achieve this goal will ironically degrade the overall performance. To reduce this overhead, there are many methods. First, we can optimize the monitoring process to decrease its runtime. And the second method, which is more efficient probably, is that we don’t have to monitor all the nodes. Instead we run a statistical sampling method to monitor only partial of the population of the nodes. It has been demonstrated that this is indeed an effective method to reduce the overhead. The limitation of this method is that it can only be used on a homogeneous system in which every node basically does the same work. To apply the statistical sampling on the heterogeneous system, we propose a method that uses techniques in Data Mining area to group the node population into several different groups based on their behaviors. Then we can do the sampling work in every group. The detailed method we used will be introduced the following section. 1.2 Data: The data we used comes from NCSA (The National Center for Supercomputing Applications) at UIUC. They are a bunch of I/O traces of many large scale applications on many different kinds of systems. The data is recorded in SDDF format which self- explainable like XML. Since our work is done on these I/O data, we only consider the performance of I/O operations and the clustering work is done based on the I/O performance of each node. However our method is not limited to I/O, when other kinds of data come into play, such as CPU and memory performance, we can take them into consideration directly, with little modification. 1.3 Detail method of node clustering 1.3.1 Analyzing the data file The I/O tracing data record all basic I/O operations on every node. For each entry of the data, it records the operation name, time stamp, duration, and the node number, etc. In all the I/O operations, we are most interested in these basic ones: "Open", "Flush", "Close", "Read", "Seek", "Write" and "Dump Cost". I wrote a java program to parse the data file and get accumulative information of each node. For each node, I summed the duration time of every kind of operation, respectively. Finally, each node is represented by a vector of length 7 in which each value is the total duration time of a basic I/O operation on that node. For example, suppose the java program analyzed the operations of node 1 and get the following vector: (49.0597, 0.1659, 0.8177, 6.5939, 2.5639, 5.1553, 1.2977) Then we can know that the total duration time of “Open” on node 1 is 49.0587, and the total duration time of “Flush” is 0.1659. 1.3.2 Clustering method: kmeans In the 7 dimension space of I/O operation, each node is a point. The points that are near each other in the space are some nodes that do relatively the same work. So we can classify the nodes into different groups by clustering the points in the 7 dimension space. The clustering method I used here is kmeans. I ran kmeans on the applications that have 128 nodes. There is one remaining question: what is the value of k, in other words, how many clusters do we want to classify? Since kmeans is unsupervised method, we don’t have priori knowledge of how many clusters are out there. To solve this problem, I run kmeans many times with k values from 2 to 30. In each run, I calculate the total deviation of every node from the centroid of its residing cluster. Obviously, the total deviation should decrease as the value of k increases, and eventually when the number of clusters equals to the number of nodes in which case each cluster only has one node, the total deviation decreases to zero. But we can have a look at this trend of deviation changes, and get a general feeling of what the value of k might be. I tried the following 4 configurations. Both of them have 128 nodes. Application Machine Type File Number Disk OpSys System of Nodes Config CSAR Origin2000 XFS 128 Raid-1 Irix6.5 VTF IBM SP2 GPFS 128 Raid AIX ENZO Origin2000 XFS 128 Raid-1 Irix6.5 HARTREE- Paragon PFS 128 - OSF/1 R1.2 FOCK The deviation curves of the above four configurations are shown below. CSAR128IRIX VTF128AIX ENZO128IRIX HARTREE-FOCK From the above figures, we can see that when k is small, the deviation value decreases dramatically as k increases and when k reaches 10, the deviation curves begin to become flat. Increasing the number of clusters won’t give us more benefits. So 10 is a good choice of the value of k. Run kmeans again with k equals to 10. We will get the grouping label of each node. The sample output of CSAR128IRIX is show below. Rows 1 through 19 1 8 10 7 8 9 7 6 8 1 8 7 4 8 8 3 10 4 5 Rows 20 through 38 3 5 4 4 2 2 8 6 5 6 2 8 3 3 5 2 2 6 4 … Rows 115 through 128 6 4 2 6 9 3 7 8 2 9 8 8 4 5 1.4 Decreasing the dimensionality by PCA Each node is represented by a 7-dimension vector. In the clustering process, we need to calculate the distance in a 7-dimension space which is OK in this experiment. But when there are even more dimensions or thousands or millions of points to cluster. The clustering method may become very slow. Furthermore, it is very hard for human beings to image how the points are grouped in such a high dimension space. To solve the above two problems, we can use PCA to preprocess the data first. That will decrease the dimensionality. I wrote a program in Matlab to calculate the eigenvector and eigenvalue of a matrix. Or we can also use the method princomp provided by Matlab. The matrix in this experiment has 128 rows and 7 columns. The result of the eigenvector and eigenvalue are two matrices. P= -0.0061 -0.0106 -0.0012 0.0289 0.0404 -0.4029 0.9138 -0.2683 -0.9513 -0.1176 -0.0757 -0.0453 0.0382 0.0083 0.9544 -0.2499 -0.1317 -0.0160 0.0371 0.0791 0.0370 -0.0043 0.0408 0.1898 -0.0334 -0.6598 0.6516 0.3182 -0.1155 0.1687 -0.8917 -0.2328 0.1086 0.2841 0.1278 -0.0513 -0.0340 0.2965 0.0124 0.7401 0.5610 0.2139 -0.0323 -0.0358 -0.2230 0.9684 -0.0103 0.0999 0.0130 d= 0.0078 0 0 0 0 0 0 0 0.0143 0 0 0 0 0 0 0 0.0570 0 0 0 0 0 0 0 0.0749 0 0 0 0 0 0 0 0.1652 0 0 0 0 0 0 0 4.1204 0 0 0 0 0 0 0 17.4797 There are only two significant eigenvalues. So we can only consider the space built up by their corresponding eigenvectors. This new space only has two dimensions. We can also project all the points to this space and see how they are located with each other. We calculate the dot products of each node’s vector and the two eigenvectors, one serving as x axis and the other as y axis. Then we can draw all the nodes in a two dimension space. There are not very obvious clusters in the above graph. Although the nodes lie on two major lines, it is not very clear how to partition them in the lines. So although we’ve clustered the nodes into 10 groups, it doesn’t mean that there will be 10 very obvious clusters in the space. This problem is caused by the data itself. But the methods we introduced here should work all the time. 1.5 Summary Based on the above method, we have classified the 128 nodes into 10 different groups. When it comes to monitor the whole heterogeneous system, we can use statistical sampling method to choose only some representative nodes from each group. This will greatly reduce the overhead of the performance monitoring. 2. System Clustering 2.1 Motivation Currently, there are many venders providing the hardware component and software component to build a large scale computer. People have lots of choices in the types of the components that will be used in their computer. For the machine type only, there are IA32, IA64, IBM Power 4, Origin 2000 and various other kinds available. Among all these different kinds of configurations, people tend to choose only several of them to run their applications. And they try to find performance optimization methods only specific to one or several configurations. So basically speaking, people try to optimize the performance of the application on one configuration, and when the application is transplanted to another system, they have to do the same work again. This is absolutely a waste of time and energy. Is it possible that we analyze the configurations of the systems and classify the different systems into different groups based on their performance? If so, then there may be some common optimizing methods that are suitable for all the systems that are in the same group. And performance optimization methods found on one system may be also effective on the similar ones. This will greatly reduce the time and energy used to tune the application on a new system. Automated optimization is also plausible in this sense. 2.2 Data In the data we got from NCSA, there is one application Continuum that has been run on many different systems. The following table lists all the systems used in this experiment. All of them have 64 nodes. Machine File Parallel Other Disk Op Application Short Name Type System Type Options Config Sys seqread 1 Ps1p0 CONTINUUM IBM SP 2 PIOFS IBM MPI RAID AIX parwrite 0 In House seqread 1 PHs1p0 CONTINUUM IBM SP 2 PIOFS RAID AIX MPI parwrite 0 seqread 1 Gs1p0 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 0 In House seqread 1 GHs1p1 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 1 In House seqread 1 GHs1p16 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 16 In House seqread 1 GHs1p0 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 0 seqread 0 Gs0p16 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 16 seqread 0 Gs0p4 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 4 seqread 1 Gs1p1 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 1 seqread 1 Gs1p16 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 16 seqread 0 Gs0p1 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 1 seqread 0 Gs0p0 CONTINUUM IBM SP 2 GPFS IBM MPI RAID AIX parwrite 0 In House seqread 0 GHs0p0 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 0 In House seqread 0 GHs0p1 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 1 In House seqread 0 GHs0p16 CONTINUUM IBM SP 2 GPFS RAID AIX MPI parwrite 16 Coming together with the original SDDF data file, there is another file that contains some statistical information of the raw data. The statistics are done on each of the basic I/O operation over all the nodes. It calculates the total count of the operation, the duration time, and the percentage of the I/O time over the whole execution time, etc. What I used in this experiment is the percentage of I/O over the whole execution time of all the basic I/O operations. The reason why we used and only used this field is that first it is a better way to evaluate the weight of this operation over the whole application. If we use other fields, we can only get a bias view. Second, we don’t wish to use a very long vector which has all the fields to represent a node and then do the clustering work. That will increase the complexity of the clustering work, and also there are many redundant and bias information in that long vector which will result in an inaccurate result. 2.3 Detailed method of system clustering The method used here is the same as the one in section 1, the node clustering. Now, each system configuration is a vector. It has the following fields, (Open, Read, Seek, Write, Flush, Close). The value of each field is the percentage of the I/O time of that operation over the whole application execution time. There are totally 15 different system configurations. We used kmeans to cluster all these systems into different groups. First of all, we have to decide the value of k, how many groups do we want to cluster. I ran kmeans with k values from 2 to 15 to get the deviation curve, just the same as we’ve already done in section 1. From the above curve, we can see that when k equals to 5, the deviation curve becomes flat, so we can cluster the 15 configurations into 5 clusters. Running kmeans again with k equals to 5, we got the following table. Sys Config Group Label Sys Config Group Label Gs0p4 1 Ps1p0 4 Gs1p1 5 PHs1p0 4 Gs1p16 1 Gs1p0 4 Gs0p1 2 GHs1p1 3 Gs0p0 4 GHs1p16 4 GHs0p0 4 GHs1p0 4 GHs0p1 5 Gs0p16 1 GHs0p16 4 According to the analysis above, when only consider I/O operations, Gs0p4, Gs1p16 and Gs0p16 are in the same group which means they have the relatively the same behaviors on I/O. So the I/O performance improving method on one system is also possibly effective on the others. 2.4 Summary This is only a tentative try to cluster system configurations into different groups. The ultimate goal of this work is to find some generic performance improving methods within the same system group so that automated performance improvement becomes possible. In this experiment, we are using I/O tracing data, so we can only compare the system configurations based on their I/O performance. And only I/O improvement method is possibly interchangeable among the systems in the same group. However, in the real system, lots of other factors will come into play, such as CPU, memory, and network, etc. In the future work, we should do some research to consider them all. 3. Node Clustering Similarity 3.1 Motivation This work is an extension of the work done in section 1. In section 1, we have grouped the nodes in one application into many different clusters. Using this grouping information, we can use statistical sampling method to monitor a heterogeneous system. In section 2, we also have grouped the systems into different clusters. And we have said that possibly the system configurations in the same group will have the same performance although we couldn’t demonstrate this because of lack of data. Now we have another brave guess. Is it possible that the executions of one application on two different systems of the same group have the same node grouping configurations? In other word, if when an application is running on system A, node 1, node2, and node3 are in the same group, then when the application is executed on system B which is in the same group as system A, is it possible that these three nodes are still in the same group? So we wish to find some patterns in the node grouping configurations among different systems. If we know that some nodes are in the same group or not in the same group all the time, then we can this information as priori knowledge in semi-supervised clustering. 3.2 Data I am using the same data set as the one used in section 2, Continuum. This application is executed on 15 different system configurations which all have 64 nodes. 3.3 Detailed method of node clustering similarity First, we use kmeans to cluster the nodes in the 15 system configurations. The method of determining the value of k is the same as before. I randomly picked 4 systems and draw their deviation curve below. GHs1p1 Gs0p0 Gs0p1 PHs1p0 From the above deviation curves, we can see that 10 is a good choice for the value of k. Then we cluster the nodes in each system using kmeans with k equals to 10. Two sample outputs of the clustering are listed below. Node grouping configuration of GHs0p0 Rows 1 through 19 3 1 1 1 8 9 2 6 8 2 1 9 8 6 4 6 8 4 6 Rows 20 through 38 4 8 9 9 6 8 9 1 1 8 2 9 6 10 7 7 7 8 5 Rows 39 through 57 6 2 8 2 5 5 10 9 2 9 8 2 6 2 10 7 7 7 10 Rows 58 through 64 7 7 7 8 4 4 4 Node grouping configuration of GHs0p16 Columns 1 through 19 5 4 4 4 1 10 10 2 1 9 4 10 1 10 8 2 1 8 2 Columns 20 through 38 8 1 10 10 2 1 10 4 4 1 9 10 10 7 6 6 6 1 3 Columns 39 through 57 2 8 1 9 3 3 7 10 9 10 1 9 2 8 7 6 6 6 7 Columns 58 through 64 6 6 6 1 8 2 8 Kmeans just groups the nodes into clusters, but not considers which group label to use. In GH0p0, node2, node3 and node4 are in the same group with a label number of 1. In GH0p16, these three nodes are still in the same group, but this time their group number is 4. This causes a problem of calculating their similarity because the group number is not predetermined. I wrote a java code to calculate the similarity. The basic idea is to try all the possible mappings from the group label in system 1 to the group label in system 2. For each mapping, calculate how many nodes in these two systems have the same group label. Of course the maximum is 64 which is the total number of the nodes. It is easy to shown that there are totally 10! mappings. Among all these mappings, record the one with the maximum similarity. The node label similarity is also kind of a measurement of the similarity of the system configuration. If one application is executed on two systems, and the node labels have a strong similarity, we can say that these two systems have some intrinsic relationship, because their node grouping configurations are almost the same so that each node does the same work. The following table calculates all the similarities among all the 15 system configurations. The maximum value of the similarity will be reached when we compare the same configuration. Most of they others are around 20~30 which means they don’t have much similarity. But there is one exception. GHs0p16 and GHs0p0 have a similarity of 58 which is really a high value between two different systems. However we can’t explain the reason because of lack of the detailed information of the application and the system. GHs0p0 GHs0p1 GHs0p16 GHs1p0 Gs0p0 Gs0p1 Gs0p4 Gs0p16 Gs1p0 Gs1p1 Gs1p16 PHs1p0 Ps1p0 GHs1p1 GHs1p16 GHs0p0 64 GHs0p1 20 64 GHs0p16 58 21 64 GHs1p0 28 20 29 64 Gs0p0 23 22 23 22 64 Gs0p1 19 17 17 18 19 64 Gs0p4 20 19 22 22 17 19 64 Gs0p16 22 20 21 22 19 22 21 64 Gs1p0 25 21 21 19 23 20 24 19 64 Gs1p1 20 17 19 22 24 19 16 18 19 64 Gs1p16 26 19 26 28 21 19 18 29 18 21 64 PHs1p0 22 19 21 20 22 20 20 19 22 15 17 64 Ps1p0 19 22 19 20 20 20 21 17 26 23 17 20 64 GHs1p1 16 15 16 16 19 19 17 18 17 18 17 17 23 64 GHs1p16 31 18 30 27 18 17 19 22 19 21 37 18 19 15 64 Since most of the times the similarity is very low, we can conclude that the application is scheduled dynamically on the systems. So probably there is no pattern in the node grouping information. 3.4 Summary After clustering the nodes into different groups, we wish to find some patterns among the grouping information. If we can find some relationships showing that some particular nodes are in the same group or not in the same group all the time, we can use this information in semi-supervised clustering of the nodes in the future. And when scheduling the tasks among the nodes, we can also consider this relationship. However according to the experiment we did, we can see that two different node grouping configurations don’t have much relationship although there is an exception (GHs0p0 and GHs0p16, we don’t know the reason). This is probably because that the Continuum application is scheduled dynamically among the nodes. For the future work, we wish to have some application and system that statically schedule the tasks, so that we may be able to find some node grouping patterns. 4. Application Phase Clustering 4.1 Motivation Large scale applications can solve very complex problems, and usually they take a long time to execute. During this long period of time, the usage of the various components of the system is not a constant. For example, if many threads enter the computing intensive code section, the CPU usage will get a peak. And if many threads want to output some data through I/O, the I/O bus usage will get a peak. If the system doesn’t have enough bandwidth, the peak usually will cause a bottleneck which will degrade the overall performance. Obviously this is not what we expect to happen. People spend a lot time and money to optimize the performance, not only to increase the component usage, but also to reduce the bottleneck. So if we can find the phases of the application, we can do something clever when scheduling the tasks on the system to avoid bottleneck to happen. Furthermore, if we can predict what the next phase is, we can also prepare the resources ahead of time which should be able to reduce the transition time between phases. 4.2 Data The data I used in this experiment is from the application CSAR. It runs on an Origin 2000 cluster which has 8 nodes. We are going to cluster the I/O operations into phases on the 8 nodes, separately. 4.3 Detailed method of phase clustering In each I/O operation in the data file, there is a field called “timestamp” which records when the operation is executed. So we can use a density-based method to cluster the operations into different phases based on their executing time. First we set a threshold of the distance, epsilon. Then cluster the operations on each node into different phases. If the time difference between two operations exceeds this threshold, we consider these two operations in different phases, otherwise they are in the same phase. Consider the following figure. threshold I/O Phase1 I/O Phase2 time Suppose the black spot is the I/O operation. Based on the method we described above, we can partition this series of operations into 2 phases. When setting epsilon to be 10000000 (the unit is microsecond), the output of the program is: Node 0 Group 0 : Size 19522, start at 3093541, end at 3922961 Group 1 : Size 19374, start at 43118208, end at 47818302 Node 1 Group 0 : Size 19522, start at 2324437, end at 7051689 Group 1 : Size 19374, start at 42267194, end at 46831110 Node 2 Group 0 : Size 19522, start at 3010977, end at 6826992 Group 1 : Size 19374, start at 42974715, end at 47527218 Node 3 Group 0 : Size 19522, start at 3012760, end at 7670602 Group 1 : Size 19374, start at 42885584, end at 47458908 Node 4 Group 0 : Size 19522, start at 2322997, end at 8098248 Group 1 : Size 19374, start at 42139258, end at 46675174 Node 5 Group 0 : Size 19522, start at 3015480, end at 7665163 Group 1 : Size 19374, start at 42880382, end at 47445678 Node 6 Group 0 : Size 19522, start at 2306267, end at 8111826 Group 1 : Size 19374, start at 42152356, end at 46688083 Node 7 Group 0 : Size 19522, start at 2289102, end at 8091835 Group 1 : Size 19374, start at 42127913, end at 46737168 In the above result, I list the phases for each node. And for each phase, there are number of operations, starting time and ending time in it. Comparing the results between the nodes, we can see that each node does approximately the same work on I/O since they have the same statistics about I/O operation phase. Now here comes a similar question as the one we have encountered in the kmeans method. How many phases are there? What the value of threshold should be? Similarly, we can also draw some kind of deviation curve and find some possible value of the threshold by observing the trend of the curve. We need to define a method to calculate the deviation first. For each phase, calculate the average timestamp, and then get the square of the difference from this average timestamp for every operation. Finally, sum up all these difference squares. We use the final sum result as the deviation of the phase clustering. If each phase has only one operation, the deviation should be zero. And the bigger the phase is, the larger the deviation will be. I set the threshold ranging from 1 to 1E8. The result is shown below. Threshold Deviation Log(threshold) Log(deviation) 1 0 0 0 10 1.80E+09 1 9.256051 100 1.50E+11 2 11.17662 1000 1.83E+13 3 13.26181 10000 3.07E+14 4 14.48746 100000 5.63E+16 5 16.75043 1000000 2.98E+17 6 17.47479 1.00E+07 2.98E+17 7 17.47482 1.00E+08 1.23E+20 8 20.09044 25 20 15 Series1 10 5 0 0 2 4 6 8 10 We can notice that from 7 to 8, the curve has the biggest jump, not considering the jump from 0 to 1, because it’s sure that there will be a huge jump at the beginning. This observing matches the result well. When threshold is 1E7, the phases are separated very well as shown above. 4.4 Summary Since we only have I/O performance data, so we can only partition the phase as I/O- intensive or non-I/O-intensive. When new factors, such as CPU, memory operations, come into play, we can also use this density-based method to cluster the operations into different phases. 5. Conclusions People have studied to improve the performance of large scale applications on high performance computer for decades. There are many mature and successful methods now, such as performance modeling, etc. Comparing with the performance study, data mining is something new. However it has already been playing a great role in various research fields. How to apply the techniques in data mining to assist the traditional performance optimization methods is really a new and promising area. In this project, I tried several clustering methods in Data Mining on the I/O performance data. Although we couldn’t find many exciting results, which is caused by the limitation of the data, the methods we proposed here should work for all kinds of data. For the future work, we should try some other kinds of data, such as CPU, memory, and network performance. When we want to apply the data mining method to find patterns of the whole application, we must consider all these factors. We should also be noticed that sometimes why we can’t find patterns from performance data is not because of the methods we are using, but because of the data itself, it doesn’t have the pattern we are finding at all.

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 12/5/2011 |

language: | English |

pages: | 13 |

OTHER DOCS BY keralaguest

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.