Document Sample
Wang Powered By Docstoc
					                        Comp290-090 Final Project
                             Mining in Performance Data

    In this project, we tried several methods to mine the patterns in performance data. The
ultimate goal of this work is to use these patterns in automated performance optimization
which can greatly facilitate researchers to tune their applications for a better performance.
However this work has just started, currently we have no idea whether there are really
some patterns in the performance data and how to use them in our future work. What we
did in this project is to try as many as possible data mining methods on performance data,
and try to find some useful patterns out.
    The data we used are I/O tracing data which come from NCSA at UIUC. All our
experiments are based on these data, so currently we can only mine I/O performance data.
But the methods we proposed here shouldn’t be limited to I/O operations.
    Basically, I have done 4 experiments in this project. They are nodes clustering,
system clustering, nodes clustering similarity and phase clustering. I will explain each of
them in detail in the following sections.

1. Nodes Clustering
1.1 Motivation
    Nowadays, the scale of high performance computer has grown tremendously, there
are usually hundreds or even thousands of nodes working together to execute a large
scale application. With this great computing power, we are now able to solve some huge
problems in a reasonable time which are not solvable before. However, only increasing
the computing power is not sufficient to get a high performance, we also need to tune the
nodes to work efficiently with each other. Clearly, getting the performance of each node
during application’s execution is the first step to do this tuning work. Then we can
schedule the tasks dynamically based on this information.
    However, monitoring the performance will introduce overhead to the application’s
execution. Sometimes the overhead is so big that it is not neglectable especially when
there are lots of nodes to monitor. What we are pursuing is a shorter wall-clock
application runtime, so obviously we don’t like to see that our efforts to achieve this goal
will ironically degrade the overall performance.
    To reduce this overhead, there are many methods. First, we can optimize the
monitoring process to decrease its runtime. And the second method, which is more
efficient probably, is that we don’t have to monitor all the nodes. Instead we run a
statistical sampling method to monitor only partial of the population of the nodes. It has
been demonstrated that this is indeed an effective method to reduce the overhead. The
limitation of this method is that it can only be used on a homogeneous system in which
every node basically does the same work.
    To apply the statistical sampling on the heterogeneous system, we propose a method
that uses techniques in Data Mining area to group the node population into several
different groups based on their behaviors. Then we can do the sampling work in every
group. The detailed method we used will be introduced the following section.
1.2 Data:
     The data we used comes from NCSA (The National Center for Supercomputing
Applications) at UIUC. They are a bunch of I/O traces of many large scale applications
on many different kinds of systems. The data is recorded in SDDF format which self-
explainable like XML.
     Since our work is done on these I/O data, we only consider the performance of I/O
operations and the clustering work is done based on the I/O performance of each node.
However our method is not limited to I/O, when other kinds of data come into play, such
as CPU and memory performance, we can take them into consideration directly, with
little modification.
1.3 Detail method of node clustering
1.3.1 Analyzing the data file
     The I/O tracing data record all basic I/O operations on every node. For each entry of
the data, it records the operation name, time stamp, duration, and the node number, etc. In
all the I/O operations, we are most interested in these basic ones: "Open", "Flush",
"Close", "Read", "Seek", "Write" and "Dump Cost".
     I wrote a java program to parse the data file and get accumulative information of each
node. For each node, I summed the duration time of every kind of operation, respectively.
Finally, each node is represented by a vector of length 7 in which each value is the total
duration time of a basic I/O operation on that node.
     For example, suppose the java program analyzed the operations of node 1 and get the
following vector:
     (49.0597, 0.1659, 0.8177, 6.5939, 2.5639, 5.1553, 1.2977)
     Then we can know that the total duration time of “Open” on node 1 is 49.0587, and
the total duration time of “Flush” is 0.1659.
1.3.2 Clustering method: kmeans
     In the 7 dimension space of I/O operation, each node is a point. The points that are
near each other in the space are some nodes that do relatively the same work. So we can
classify the nodes into different groups by clustering the points in the 7 dimension space.
The clustering method I used here is kmeans.
     I ran kmeans on the applications that have 128 nodes.
     There is one remaining question: what is the value of k, in other words, how many
clusters do we want to classify? Since kmeans is unsupervised method, we don’t have
priori knowledge of how many clusters are out there. To solve this problem, I run kmeans
many times with k values from 2 to 30. In each run, I calculate the total deviation of
every node from the centroid of its residing cluster. Obviously, the total deviation should
decrease as the value of k increases, and eventually when the number of clusters equals to
the number of nodes in which case each cluster only has one node, the total deviation
decreases to zero. But we can have a look at this trend of deviation changes, and get a
general feeling of what the value of k might be.
     I tried the following 4 configurations. Both of them have 128 nodes.

   Application    Machine Type       File      Number         Disk         OpSys
                                    System     of Nodes      Config
      CSAR         Origin2000        XFS          128        Raid-1       Irix6.5
       VTF          IBM SP2         GPFS          128         Raid         AIX
     ENZO          Origin2000        XFS         128        Raid-1       Irix6.5
   HARTREE-         Paragon          PFS         128          -        OSF/1 R1.2

   The deviation curves of the above four configurations are shown below.

               CSAR128IRIX                               VTF128AIX

               ENZO128IRIX                             HARTREE-FOCK

    From the above figures, we can see that when k is small, the deviation value
decreases dramatically as k increases and when k reaches 10, the deviation curves begin
to become flat. Increasing the number of clusters won’t give us more benefits. So 10 is a
good choice of the value of k.
    Run kmeans again with k equals to 10. We will get the grouping label of each node.
The sample output of CSAR128IRIX is show below.

   Rows 1 through 19
     1 8 10 7 8 9 7 6 8 1 8 7 4 8                              8     3 10   4   5
   Rows 20 through 38
     3 5 4 4 2 2 8 6 5 6 2 8 3 3                               5   2   2    6   4
   Rows 115 through 128
     6 4 2 6 9 3 7 8 2 9 8 8 4 5
1.4 Decreasing the dimensionality by PCA
    Each node is represented by a 7-dimension vector. In the clustering process, we need
to calculate the distance in a 7-dimension space which is OK in this experiment. But
when there are even more dimensions or thousands or millions of points to cluster. The
clustering method may become very slow. Furthermore, it is very hard for human beings
to image how the points are grouped in such a high dimension space.
    To solve the above two problems, we can use PCA to preprocess the data first. That
will decrease the dimensionality.
    I wrote a program in Matlab to calculate the eigenvector and eigenvalue of a matrix.
Or we can also use the method princomp provided by Matlab.
    The matrix in this experiment has 128 rows and 7 columns. The result of the
eigenvector and eigenvalue are two matrices.
    -0.0061 -0.0106 -0.0012 0.0289 0.0404 -0.4029         0.9138
    -0.2683 -0.9513 -0.1176 -0.0757 -0.0453 0.0382         0.0083
     0.9544 -0.2499 -0.1317 -0.0160 0.0371 0.0791         0.0370
    -0.0043 0.0408 0.1898 -0.0334 -0.6598 0.6516          0.3182
    -0.1155 0.1687 -0.8917 -0.2328 0.1086 0.2841          0.1278
    -0.0513 -0.0340 0.2965 0.0124 0.7401 0.5610           0.2139
    -0.0323 -0.0358 -0.2230 0.9684 -0.0103 0.0999         0.0130
     0.0078     0    0     0     0    0     0
        0 0.0143     0     0     0    0     0
        0    0 0.0570      0     0    0     0
        0    0     0 0.0749      0    0     0
        0    0     0     0 0.1652     0     0
        0    0     0     0     0 4.1204     0
        0    0     0     0     0    0 17.4797

    There are only two significant eigenvalues. So we can only consider the space built
up by their corresponding eigenvectors. This new space only has two dimensions.
    We can also project all the points to this space and see how they are located with each
other. We calculate the dot products of each node’s vector and the two eigenvectors, one
serving as x axis and the other as y axis. Then we can draw all the nodes in a two
dimension space.
    There are not very obvious clusters in the above graph. Although the nodes lie on two
major lines, it is not very clear how to partition them in the lines. So although we’ve
clustered the nodes into 10 groups, it doesn’t mean that there will be 10 very obvious
clusters in the space. This problem is caused by the data itself. But the methods we
introduced here should work all the time.
1.5 Summary
    Based on the above method, we have classified the 128 nodes into 10 different groups.
When it comes to monitor the whole heterogeneous system, we can use statistical
sampling method to choose only some representative nodes from each group. This will
greatly reduce the overhead of the performance monitoring.

2. System Clustering
2.1 Motivation
    Currently, there are many venders providing the hardware component and software
component to build a large scale computer. People have lots of choices in the types of the
components that will be used in their computer. For the machine type only, there are
IA32, IA64, IBM Power 4, Origin 2000 and various other kinds available.
    Among all these different kinds of configurations, people tend to choose only several
of them to run their applications. And they try to find performance optimization methods
only specific to one or several configurations. So basically speaking, people try to
optimize the performance of the application on one configuration, and when the
application is transplanted to another system, they have to do the same work again. This
is absolutely a waste of time and energy.
    Is it possible that we analyze the configurations of the systems and classify the
different systems into different groups based on their performance? If so, then there may
be some common optimizing methods that are suitable for all the systems that are in the
same group. And performance optimization methods found on one system may be also
effective on the similar ones. This will greatly reduce the time and energy used to tune
the application on a new system. Automated optimization is also plausible in this sense.
2.2 Data
    In the data we got from NCSA, there is one application Continuum that has been run
on many different systems.
    The following table lists all the systems used in this experiment. All of them have 64

                Machine     File    Parallel      Other       Disk    Op
  Application                                                                Short Name
                 Type      System    Type        Options     Config   Sys
                                                seqread 1                   Ps1p0
 CONTINUUM      IBM SP 2   PIOFS    IBM MPI                  RAID     AIX
                                                parwrite 0
                                    In House    seqread 1                   PHs1p0
 CONTINUUM      IBM SP 2   PIOFS                             RAID     AIX
                                      MPI       parwrite 0
                                                seqread 1                   Gs1p0
 CONTINUUM      IBM SP 2   GPFS     IBM MPI                  RAID     AIX
                                                parwrite 0
                                    In House    seqread 1                   GHs1p1
 CONTINUUM      IBM SP 2   GPFS                              RAID     AIX
                                      MPI       parwrite 1
                                    In House    seqread 1                   GHs1p16
 CONTINUUM      IBM SP 2   GPFS                              RAID     AIX
                                      MPI      parwrite 16
                                     In House     seqread 1                     GHs1p0
 CONTINUUM      IBM SP 2    GPFS                               RAID      AIX
                                       MPI        parwrite 0
                                                  seqread 0                     Gs0p16
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                 parwrite 16
                                                  seqread 0                     Gs0p4
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                  parwrite 4
                                                  seqread 1                     Gs1p1
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                  parwrite 1
                                                  seqread 1                     Gs1p16
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                 parwrite 16
                                                  seqread 0                     Gs0p1
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                  parwrite 1
                                                  seqread 0                     Gs0p0
 CONTINUUM      IBM SP 2    GPFS     IBM MPI                   RAID      AIX
                                                  parwrite 0
                                     In House     seqread 0                     GHs0p0
 CONTINUUM      IBM SP 2    GPFS                               RAID      AIX
                                       MPI        parwrite 0
                                     In House     seqread 0                     GHs0p1
 CONTINUUM      IBM SP 2    GPFS                               RAID      AIX
                                       MPI        parwrite 1
                                     In House     seqread 0                     GHs0p16
 CONTINUUM      IBM SP 2    GPFS                               RAID      AIX
                                       MPI       parwrite 16

     Coming together with the original SDDF data file, there is another file that contains
some statistical information of the raw data. The statistics are done on each of the basic
I/O operation over all the nodes. It calculates the total count of the operation, the duration
time, and the percentage of the I/O time over the whole execution time, etc.
     What I used in this experiment is the percentage of I/O over the whole execution time
of all the basic I/O operations. The reason why we used and only used this field is that
first it is a better way to evaluate the weight of this operation over the whole application.
If we use other fields, we can only get a bias view. Second, we don’t wish to use a very
long vector which has all the fields to represent a node and then do the clustering work.
That will increase the complexity of the clustering work, and also there are many
redundant and bias information in that long vector which will result in an inaccurate
2.3 Detailed method of system clustering
     The method used here is the same as the one in section 1, the node clustering. Now,
each system configuration is a vector. It has the following fields, (Open, Read, Seek,
Write, Flush, Close). The value of each field is the percentage of the I/O time of that
operation over the whole application execution time.
     There are totally 15 different system configurations. We used kmeans to cluster all
these systems into different groups. First of all, we have to decide the value of k, how
many groups do we want to cluster.
     I ran kmeans with k values from 2 to 15 to get the deviation curve, just the same as
we’ve already done in section 1.
   From the above curve, we can see that when k equals to 5, the deviation curve
becomes flat, so we can cluster the 15 configurations into 5 clusters.
   Running kmeans again with k equals to 5, we got the following table.

    Sys Config        Group Label        Sys Config        Group Label
      Gs0p4               1                Ps1p0               4
      Gs1p1               5               PHs1p0               4
      Gs1p16              1                Gs1p0               4
      Gs0p1               2               GHs1p1               3
      Gs0p0               4               GHs1p16              4
     GHs0p0               4               GHs1p0               4
     GHs0p1               5                Gs0p16              1
     GHs0p16              4

    According to the analysis above, when only consider I/O operations, Gs0p4, Gs1p16
and Gs0p16 are in the same group which means they have the relatively the same
behaviors on I/O. So the I/O performance improving method on one system is also
possibly effective on the others.
2.4 Summary
    This is only a tentative try to cluster system configurations into different groups. The
ultimate goal of this work is to find some generic performance improving methods within
the same system group so that automated performance improvement becomes possible.
    In this experiment, we are using I/O tracing data, so we can only compare the system
configurations based on their I/O performance. And only I/O improvement method is
possibly interchangeable among the systems in the same group. However, in the real
system, lots of other factors will come into play, such as CPU, memory, and network, etc.
In the future work, we should do some research to consider them all.

3. Node Clustering Similarity
3.1 Motivation
    This work is an extension of the work done in section 1. In section 1, we have
grouped the nodes in one application into many different clusters. Using this grouping
information, we can use statistical sampling method to monitor a heterogeneous system.
In section 2, we also have grouped the systems into different clusters. And we have said
that possibly the system configurations in the same group will have the same
performance although we couldn’t demonstrate this because of lack of data.
    Now we have another brave guess. Is it possible that the executions of one application
on two different systems of the same group have the same node grouping configurations?
In other word, if when an application is running on system A, node 1, node2, and node3
are in the same group, then when the application is executed on system B which is in the
same group as system A, is it possible that these three nodes are still in the same group?
    So we wish to find some patterns in the node grouping configurations among
different systems. If we know that some nodes are in the same group or not in the same
group all the time, then we can this information as priori knowledge in semi-supervised
3.2 Data
    I am using the same data set as the one used in section 2, Continuum. This application
is executed on 15 different system configurations which all have 64 nodes.
3.3 Detailed method of node clustering similarity
    First, we use kmeans to cluster the nodes in the 15 system configurations. The method
of determining the value of k is the same as before. I randomly picked 4 systems and
draw their deviation curve below.

                      GHs1p1                                Gs0p0

                      Gs0p1                                 PHs1p0
    From the above deviation curves, we can see that 10 is a good choice for the value of
k. Then we cluster the nodes in each system using kmeans with k equals to 10.
    Two sample outputs of the clustering are listed below.

   Node grouping configuration of GHs0p0
   Rows 1 through 19
     3 1 1 1 8 9 2 6 8 2 1 9 8 6 4 6 8 4 6
   Rows 20 through 38
     4 8 9 9 6 8 9 1 1 8 2 9 6 10 7 7 7 8 5
   Rows 39 through 57
     6 2 8 2 5 5 10 9 2 9 8 2 6 2 10 7 7 7 10
   Rows 58 through 64
     7 7 7 8 4 4 4

   Node grouping configuration of GHs0p16
   Columns 1 through 19
     5 4 4 4 1 10 10 2 1 9 4 10 1 10 8 2 1 8 2
   Columns 20 through 38
     8 1 10 10 2 1 10 4 4 1 9 10 10 7 6 6 6 1 3
   Columns 39 through 57
     2 8 1 9 3 3 7 10 9 10 1 9 2 8 7 6 6 6 7
   Columns 58 through 64
     6 6 6 1 8 2 8

    Kmeans just groups the nodes into clusters, but not considers which group label to
use. In GH0p0, node2, node3 and node4 are in the same group with a label number of 1.
In GH0p16, these three nodes are still in the same group, but this time their group number
is 4. This causes a problem of calculating their similarity because the group number is not
    I wrote a java code to calculate the similarity. The basic idea is to try all the possible
mappings from the group label in system 1 to the group label in system 2. For each
mapping, calculate how many nodes in these two systems have the same group label. Of
course the maximum is 64 which is the total number of the nodes. It is easy to shown that
there are totally 10! mappings. Among all these mappings, record the one with the
maximum similarity.
    The node label similarity is also kind of a measurement of the similarity of the system
configuration. If one application is executed on two systems, and the node labels have a
strong similarity, we can say that these two systems have some intrinsic relationship,
because their node grouping configurations are almost the same so that each node does
the same work.
    The following table calculates all the similarities among all the 15 system
configurations. The maximum value of the similarity will be reached when we compare
the same configuration. Most of they others are around 20~30 which means they don’t
have much similarity. But there is one exception. GHs0p16 and GHs0p0 have a similarity
of 58 which is really a high value between two different systems. However we can’t
explain the reason because of lack of the detailed information of the application and the
          GHs0p0 GHs0p1 GHs0p16 GHs1p0 Gs0p0 Gs0p1 Gs0p4 Gs0p16 Gs1p0 Gs1p1 Gs1p16 PHs1p0 Ps1p0 GHs1p1 GHs1p16
GHs0p0        64
GHs0p1        20      64
GHs0p16       58      21      64
GHs1p0        28      20      29     64
Gs0p0         23      22      23     22    64
Gs0p1         19      17      17     18    19    64
Gs0p4         20      19      22     22    17    19    64
Gs0p16        22      20      21     22    19    22    21     64
Gs1p0         25      21      21     19    23    20    24     19   64
Gs1p1         20      17      19     22    24    19    16     18   19     64
Gs1p16        26      19      26     28    21    19    18     29   18     21    64
PHs1p0        22      19      21     20    22    20    20     19   22     15    17     64
Ps1p0         19      22      19     20    20    20    21     17   26     23    17     20    64
GHs1p1        16      15      16     16    19    19    17     18   17     18    17     17    23     64
GHs1p16       31      18      30     27    18    17    19     22   19     21    37     18    19     15     64

    Since most of the times the similarity is very low, we can conclude that the
application is scheduled dynamically on the systems. So probably there is no pattern in
the node grouping information.
3.4 Summary
    After clustering the nodes into different groups, we wish to find some patterns among
the grouping information. If we can find some relationships showing that some particular
nodes are in the same group or not in the same group all the time, we can use this
information in semi-supervised clustering of the nodes in the future. And when
scheduling the tasks among the nodes, we can also consider this relationship.
    However according to the experiment we did, we can see that two different node
grouping configurations don’t have much relationship although there is an exception
(GHs0p0 and GHs0p16, we don’t know the reason). This is probably because that the
Continuum application is scheduled dynamically among the nodes.
    For the future work, we wish to have some application and system that statically
schedule the tasks, so that we may be able to find some node grouping patterns.

4. Application Phase Clustering
4.1 Motivation
    Large scale applications can solve very complex problems, and usually they take a
long time to execute. During this long period of time, the usage of the various
components of the system is not a constant. For example, if many threads enter the
computing intensive code section, the CPU usage will get a peak. And if many threads
want to output some data through I/O, the I/O bus usage will get a peak.
    If the system doesn’t have enough bandwidth, the peak usually will cause a
bottleneck which will degrade the overall performance. Obviously this is not what we
expect to happen. People spend a lot time and money to optimize the performance, not
only to increase the component usage, but also to reduce the bottleneck.
    So if we can find the phases of the application, we can do something clever when
scheduling the tasks on the system to avoid bottleneck to happen. Furthermore, if we can
predict what the next phase is, we can also prepare the resources ahead of time which
should be able to reduce the transition time between phases.
4.2 Data
    The data I used in this experiment is from the application CSAR. It runs on an Origin
2000 cluster which has 8 nodes. We are going to cluster the I/O operations into phases on
the 8 nodes, separately.
4.3 Detailed method of phase clustering
    In each I/O operation in the data file, there is a field called “timestamp” which
records when the operation is executed. So we can use a density-based method to cluster
the operations into different phases based on their executing time.
    First we set a threshold of the distance, epsilon. Then cluster the operations on each
node into different phases. If the time difference between two operations exceeds this
threshold, we consider these two operations in different phases, otherwise they are in the
same phase.
    Consider the following figure.


                       I/O Phase1
                                                                       I/O Phase2


   Suppose the black spot is the I/O operation. Based on the method we described above,
we can partition this series of operations into 2 phases.
   When setting epsilon to be 10000000 (the unit is microsecond), the output of the
program is:

   Node 0
       Group 0 : Size 19522,   start at 3093541,    end at 3922961
       Group 1 : Size 19374,   start at 43118208,    end at 47818302
   Node 1
       Group 0 : Size 19522,   start at 2324437,    end at 7051689
       Group 1 : Size 19374,   start at 42267194,    end at 46831110
   Node 2
       Group 0 : Size 19522,   start at 3010977,    end at 6826992
       Group 1 : Size 19374,   start at 42974715,    end at 47527218
   Node 3
       Group 0 : Size 19522,   start at 3012760,    end at 7670602
       Group 1 : Size 19374,   start at 42885584,    end at 47458908
   Node 4
       Group 0 : Size 19522,   start at 2322997,    end at 8098248
       Group 1 : Size 19374,   start at 42139258,    end at 46675174
   Node 5
       Group 0 : Size 19522,   start at 3015480,     end at 7665163
       Group 1 : Size 19374,   start at 42880382,     end at 47445678
   Node 6
       Group 0 : Size 19522,   start at 2306267,     end at 8111826
       Group 1 : Size 19374,   start at 42152356,     end at 46688083
   Node 7
       Group 0 : Size 19522,   start at 2289102,     end at 8091835
       Group 1 : Size 19374,   start at 42127913,     end at 46737168

    In the above result, I list the phases for each node. And for each phase, there are
number of operations, starting time and ending time in it. Comparing the results between
the nodes, we can see that each node does approximately the same work on I/O since they
have the same statistics about I/O operation phase.
    Now here comes a similar question as the one we have encountered in the kmeans
method. How many phases are there? What the value of threshold should be? Similarly,
we can also draw some kind of deviation curve and find some possible value of the
threshold by observing the trend of the curve.
    We need to define a method to calculate the deviation first. For each phase, calculate
the average timestamp, and then get the square of the difference from this average
timestamp for every operation. Finally, sum up all these difference squares. We use the
final sum result as the deviation of the phase clustering. If each phase has only one
operation, the deviation should be zero. And the bigger the phase is, the larger the
deviation will be.
    I set the threshold ranging from 1 to 1E8. The result is shown below.

  Threshold        Deviation        Log(threshold)   Log(deviation)
       1               0                  0                0
      10           1.80E+09               1            9.256051
     100           1.50E+11               2            11.17662
    1000           1.83E+13               3            13.26181
    10000          3.07E+14               4            14.48746
   100000          5.63E+16               5            16.75043
  1000000          2.98E+17               6            17.47479
  1.00E+07         2.98E+17               7            17.47482
  1.00E+08         1.23E+20               8            20.09044






       0      2          4         6         8         10

    We can notice that from 7 to 8, the curve has the biggest jump, not considering the
jump from 0 to 1, because it’s sure that there will be a huge jump at the beginning. This
observing matches the result well. When threshold is 1E7, the phases are separated very
well as shown above.
4.4 Summary
    Since we only have I/O performance data, so we can only partition the phase as I/O-
intensive or non-I/O-intensive. When new factors, such as CPU, memory operations,
come into play, we can also use this density-based method to cluster the operations into
different phases.

5. Conclusions
    People have studied to improve the performance of large scale applications on high
performance computer for decades. There are many mature and successful methods now,
such as performance modeling, etc. Comparing with the performance study, data mining
is something new. However it has already been playing a great role in various research
fields. How to apply the techniques in data mining to assist the traditional performance
optimization methods is really a new and promising area.
    In this project, I tried several clustering methods in Data Mining on the I/O
performance data. Although we couldn’t find many exciting results, which is caused by
the limitation of the data, the methods we proposed here should work for all kinds of data.
For the future work, we should try some other kinds of data, such as CPU, memory, and
network performance. When we want to apply the data mining method to find patterns of
the whole application, we must consider all these factors. We should also be noticed that
sometimes why we can’t find patterns from performance data is not because of the
methods we are using, but because of the data itself, it doesn’t have the pattern we are
finding at all.

Shared By: