Docstoc

seminal

Document Sample
seminal Powered By Docstoc
					           The Search Landscape of
         Graph Partitioning Problems
        using Coupling and Cohesion as
            the Clustering Criteria
        Brian S. Mitchell & Spiros Mancoridis
        {bmitchel,smancori}@mcs.drexel.edu
        http://www.mcs.drexel.edu/~{bmitchel,smancori}
        Department of Computer Science
        Software Engineering Research Group
        http://serg.mcs.drexel.edu
        Drexel University, Philadelphia, PA, USA

                                                         1
10/05/2002
Software Clustering with Bunch
                                Bunch Clustering                  Visualization Tool
      Source Code
                                     Tool
 void main()
 {
   printf(“hello”);                  Bunch GUI
 }


  Source Code                          Clustering
 Analysis Tools                        Algorithms
Acacia   Chava
                                 Clustering Tools
                                                                 Partitioned MDG File
      MDG File
 M1    M3       M6
                                   Programming                    M1   M3     M6
                                       API                          M2
  M2                                                                        M7     M8
             M7      M8                                           M4   M5
 M4    M5

  Drexel University Software Engineering Research Group (SERG)
  http://serg.mcs.drexel.edu
                                                                                   2
Software Clustering as a Search
Problem
                                    SEARCH SPACE                     Software Clustering
     Source Code                      Set of All                      Search Algorithms
void main()                         MDG Partitions
{                                                                    bP = null;
  printf(“hello”);                                                   while(searching())
}                                     M1             M6
                                           M3                        {
                                                                       p = selectNext();
                                      M2           M8     M7
                                                                       if(p.isBetter(bP))
  Source Code                    M4        M5                            bP = p;
 Analysis Tools                                                      }
Acacia   Chava                                       M6              return bP;
                                      M1
                                           M3      M8     M7
         MDG                          M2                            “GOOD” MDG Partition
M1    M3          M6             M4                  M5
  M2                                                                  M1   M3      M6
               M7      M8      Total = 4140 Partitions                  M2
M4    M5                                                                          M7    M8
                                                                      M4   M5
     Drexel University Software Engineering Research Group (SERG)
     http://serg.mcs.drexel.edu
                                                                                       3
The Search Space is Enormous
 The number of MDG partitions grows very quickly,
 as the number of modules in the system increases…
                    1                              if k = 1  k = n
          S n, k   =
                     Sn-1,k -1 + kSn-1,k           otherwise
1=1           6 = 203              11 = 678570                    16 = 10480142147
2=2           7 = 877              12 = 4213597                   17 = 82864869804
3=5           8 = 4140             13 = 27644437                  18 = 682076806159
4 = 15        9 = 21147            14 = 190899322                 19 = 5832742205057
5 = 52        10 = 115975          15 = 1382958545                20 = 51724158235372
             A 15 Module System is about the
         limit for performing Exhaustive Analysis
   Drexel University Software Engineering Research Group (SERG)
   http://serg.mcs.drexel.edu
                                                                                4
Our Assumption…
  “Well designed software systems are
  organized into cohesive clusters that are
  loosely interconnected.”

  We designed a measurement called MQ that
  embodies our assumption
  The MQ measurement balances cohesion and
  coupling
  We apply MQ to partitions of the MDG


 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                5
Not all Partitions of the MDG are
Good Solutions
                                        MDG
                           M1                           M4

                    M2              M3          M5               M6

 Good Partition!                                            Bad Partition!
 M1               M4                              M1         M4

 M2              M5                                 M2                     M5
                                                                      M3
  M3               M6                                                        M6


        MQ(Good Partition) > MQ(Bad Partition)
  Drexel University Software Engineering Research Group (SERG)
  http://serg.mcs.drexel.edu
                                                                                  6
The Software Clustering Problem:
Algorithm Objectives
“Find a good partition of the MDG.”
  A partition is the decomposition of a set of
  elements (i.e., all the nodes of the graph)
  into mutually disjoint clusters.
  A good partition is a partition where:
      highly interdependent nodes are grouped in the
       same clusters
      independent nodes are assigned to separate
       clusters
  The better the partition the higher the MQ
 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                7
   Bunch Hill Climbing Clustering
   Algorithm
                              Generate a Random Decomposition of MDG                         Neighbor
                                                                                             Partition
                                                          Iteration Step
                                                                                            A neighbor
                                                              Generate                      partition is
                           Current                                                Measure   created by

                                             Measure MQ
                                                                Next
Neighboring Partition




                           Partition                                                MQ      altering the
                                                              Neighbor
                                                                                              current
     New Best




                                                                                             partition
                                                                  Compare to Best             slightly.
                                                                Neighboring Partition
                                                                            Better?
                                                                            Better
                                Best Neighboring Partition for Iteration

                                                                   Convergence
                                         Best Neighboring Partition
                   Drexel University Software Engineering Research Group (SERG)
                   http://serg.mcs.drexel.edu
                                                                                                 8
   Bunch Hill Climbing Clustering
   Algorithm
                              Generate a Random Decomposition of MDG                        Neighbor
                                                                                            Partition
                                                          Iteration Step
                                                                                           A neighbor
                               Other Things of Interest
                                            Generate
                                                                                 Measure
                                                                                           partition is
                           Current                                                         created by

                                             Measure MQ
                                                      Next
Neighboring Partition




                             We have
                           Partition          implemented a                        MQ
                                                                           family of       altering the
                                                    Neighbor
                                                                                             current
                               hill-climbing algorithms
     New Best




                                                                                            partition
                                                      Compare to Best                        slightly.
                               We also       implemented an Exhaustive
                                                    Neighboring Partition
                               and Genetic Algorithm                         Better?
                                                                             Better
                                Best Neighboring Partition for Iteration

                                                                   Convergence
                                         Best Neighboring Partition
                   Drexel University Software Engineering Research Group (SERG)
                   http://serg.mcs.drexel.edu
                                                                                                9
     Hierarchical Clustering (1):
     Nested View
1.                                              4.




2. Default                                      3.




       Drexel University Software Engineering Research Group (SERG)
       http://serg.mcs.drexel.edu
                                                                      10
     Hierarchical Clustering (2):
     Consolidated View
1.                                              4.




2. Default                                      3.




       Drexel University Software Engineering Research Group (SERG)
       http://serg.mcs.drexel.edu
                                                                      11
Hierarchical Clustering (3):
Tree View




  Drexel University Software Engineering Research Group (SERG)
  http://serg.mcs.drexel.edu
                                                                 12
Hierarchical Clustering (3):
Tree View
                 Observations

                 • The number of levels for a given
                   system’s clustering hierarchy is
                   bounded by:

                                   O(log2N)

                    because Bunch places at least 2
                    nodes in each cluster.


  Drexel University Software Engineering Research Group (SERG)
  http://serg.mcs.drexel.edu
                                                                 13
Evaluating The Software
Clustering Results

  Over the past few years we have spent
  a lot of time evaluating Bunch’s
  software clustering results
      Empirically
      Semi-formally
      Measuring Similarity



 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                14
What We Know
  Given a particular MDG, the results
  produced by Bunch converge to a
  family of related solutions
  The search space is large, and the
  probability of finding a good solution by
  random sampling is infinitesimal



 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                15
Software Clustering using Graph
Partitioning Techniques
  Running Bunch multiple times produces a
  family of related clustering results
      Bunch starts with a random partition of the MDG,
       and makes random moves to explore the search
       space




 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                16
Software Clustering using Graph
Partitioning Techniques
How related are these clustering results?




 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                17
Software Clustering using Graph
Partitioning Techniques
Given that there are 2,7644,437 distinct partitions
of this MDG, there is a lot of agreement…




 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                18
Software Clustering using Graph
Partitioning Techniques
Why Some Modules Don’t Agree…




                                                                 Library Modules
                                                                  Isomorphism
                                                                  Omnipresent
                                                                Module Influences
 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                            19
Special Modules
  Isomorphic – Modules that are
  connected to multiple clusters with
  equal strength
  Library – All edges fan-in
  Driver – All edges fan-out
  Omnipresent – Modules that are
  strongly connected to many other
  modules in the system

 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                20
      Clustering a System                                                                                                                                              Random

      Many Times (1)…                                                                                                                                                  Bunch


                           RCS (Random)                                         RCS (Bunch)                                                      RCS                                        RCS
                 2.5                                                  2.5                                                         30                                        2.5

                                                                                                                                  25




                                                                                                                Number Clusters
                  2                                                    2                                                                                                     2
      MQ Value




                                                           MQ Value
RCS




                                                                                                                                  20
                 1.5                                                  1.5                                                                                                   1.5




                                                                                                                                                                       MQ
                                                                                                                                  15
                  1                                                    1                                                                                                     1
                                                                                                                                  10
                 0.5                                                  0.5                                                         5                                         0.5

                  0                                                    0                                                          0                                          0
                       0          10        20        30                    0          10        20        30                          0   250    500     750   1000              0   250    500     750   1000
                           Number of Clusters                                   Number of Clusters                                               Sample                                     Sample


                           Dot (Random)                                         Dot (Bunch)                                                      Dot                                        Dot
                 1.8                                                  1.8                                                         45                                        1.8
                 1.6                                                  1.6                                                         40                                        1.6



                                                                                                                Number Clusters
                 1.4                                                  1.4                                                         35                                        1.4
      MQ Value




                                                           MQ Value




                 1.2                                                  1.2                                                         30                                        1.2
Dot




                  1                                                    1                                                          25                                         1




                                                                                                                                                                       MQ
                 0.8                                                  0.8                                                         20                                        0.8
                 0.6                                                  0.6                                                         15                                        0.6
                 0.4                                                  0.4                                                         10                                        0.4
                 0.2                                                  0.2                                                         5                                         0.2
                  0                                                    0                                                          0                                          0
                       0     10        20        30   40                    0     10        20        30   40                          0   250    500     750   1000              0   250    500     750   1000
                           Number of Clusters                                   Number of Clusters                                               Sample                                     Sample


                       Drexel University Software Engineering Research Group (SERG)
                       http://serg.mcs.drexel.edu
                                                                                                                                                                                                   21
        Clustering a System                                                                                                                                                     Random

        Many Times (2)…                                                                                                                                                         Bunch


                               Swing (Random)                                           Swing (Bunch)                                                 Swing                                            Swing
                   7                                                        7                                                             450                                        7
                                                                                                                                          400                                        6
                                                                            6
Swing




                   6




                                                                                                                        Number Clusters
                                                                                                                                          350
                   5                                                        5                                                                                                        5



                                                                 MQ Value
        MQ Value




                                                                                                                                          300
                   4                                                        4                                                             250                                        4




                                                                                                                                                                                MQ
                   3                                                        3                                                             200                                        3
                                                                                                                                          150
                   2                                                        2                                                                                                        2
                                                                                                                                          100
                   1                                                        1                                                                                                        1
                                                                                                                                          50
                   0                                                        0                                                              0                                         0
                       0        100   200   300      400                        0       100   200   300     400                                 0   250    500     750   1000            0       250     500     750   1000
                                Number of Clusters                                      Number of Clusters                                                Sample                                        Sample




                               Bunch (Random)                                           Bunch (Bunch)                                                 Bunch                                            Bunch
                   4.5                                                      4.5                                                           125                                        4.5
                       4                                                        4                                                                                                        4
Bunch




                                                                                                                        Number Clusters
                   3.5                                                      3.5                                                           100                                        3.5
                                                                 MQ Value
        MQ Value




                       3                                                        3                                                                                                        3
                                                                                                                                           75
                   2.5                                                      2.5                                                                                                      2.5




                                                                                                                                                                                MQ
                       2                                                        2                                                                                                        2
                                                                                                                                           50
                   1.5                                                      1.5                                                                                                      1.5
                       1                                                        1                                                          25                                            1
                   0.5                                                      0.5                                                                                                      0.5
                       0                                                        0                                                           0                                            0
                           0     25   50    75    100      125                      0    25   50    75    100     125                           0   250    500     750   1000                0    250     500    750   1000
                                 Number of Clusters                                     Number of Clusters                                                Sample                                        Sample

                               Drexel University Software Engineering Research Group (SERG)
                               http://serg.mcs.drexel.edu
                                                                                                                                                                                                                 22
        Clustering a System                                                                                                                                                     Random

        Many Times (2)…                                                                                                                                                         Bunch


                               Swing (Random)                                           Swing (Bunch)                                                 Swing                                            Swing
                   7                                                        7                                                             450                                        7


                                      Observations                          6
                                                                                                                                          400                                        6
Swing




                   6




                                                                                                                        Number Clusters
                                                                                                                                          350
                   5                                                        5                                                                                                        5



                                                                 MQ Value
        MQ Value




                                                                                                                                          300
                   4                                                        4                                                             250                                        4




                                                                                                                                                                                MQ
                                                                            3                                                             200                                        3

                                      • As the number of clusters increased
                   3
                                                                                                                                          150
                   2                                                        2                                                                                                        2
                                                                                                                                          100


                                        in the random samples, MQ decreased
                   1                                                        1                                                                                                        1
                                                                                                                                          50
                   0                                                        0                                                              0                                         0
                       0        100   200   300      400                        0       100   200   300     400                                 0   250    500     750   1000            0       250     500     750   1000

                                      • Bunch converged to a consistent
                                Number of Clusters                                      Number of Clusters                                                Sample                                        Sample


                                        “family” of solutions, no matter where
                               Bunch (Random)                                           Bunch (Bunch)                                                 Bunch                                            Bunch
                   4.5
                       4
                                        the random starting point was generated
                                                                            4.5
                                                                                4
                                                                                                                                          125                                        4.5
                                                                                                                                                                                         4


                                      • Some solutions were multi-modal
Bunch




                                                                                                                        Number Clusters
                   3.5                                                      3.5                                                           100                                        3.5
                                                                 MQ Value
        MQ Value




                       3                                                        3                                                                                                        3
                                                                                                                                           75

                                      • Random solutions were consistently
                   2.5                                                      2.5                                                                                                      2.5




                                                                                                                                                                                MQ
                       2                                                        2                                                                                                        2
                                                                                                                                           50
                   1.5                                                      1.5                                                                                                      1.5


                   0.5
                       1
                                        worse than Bunch’s solutions.       0.5
                                                                                1                                                          25
                                                                                                                                                                                     0.5
                                                                                                                                                                                         1


                       0                                                        0                                                           0                                            0
                           0     25   50    75    100      125                      0    25   50    75    100     125                           0   250    500     750   1000                0    250     500    750   1000
                                 Number of Clusters                                     Number of Clusters                                                Sample                                        Sample

                               Drexel University Software Engineering Research Group (SERG)
                               http://serg.mcs.drexel.edu
                                                                                                                                                                                                                 23
                                                                                                                      The search space
Example - Detailed Results:                                                                                          has some inherent
                                                                                                                    structure, as random
Bunch System                                                                                                        clusters constrained
                                                                                                                      to the area where
                                                                                                                    Bunch converged did
                                                                MQ versus Number of Clusters
                                                      4.5

                         23%                           4
                                                      3.5
                                                                                                                     not produce better
                                                       3
                                                      2.5
                                                                                                                           MQ values.



                                                 MQ
                                                       2


                         77%
                                                      1.5
                                                       1
                                                      0.5
                                                       0
                                                            0          5           10          15             20

                                                                           Number of Clusters




                  MQ For Random Clusters (4-8)                                                         MQ For Random Clusters (11-16)
        4.5                                                                                  4.5
         4                                                                                    4
        3.5                                                                                  3.5
         3                                                                                    3
                                                                                             2.5


                                                                                        MQ
        2.5
   MQ




         2                                                                                    2
        1.5                                                                                  1.5
         1                                                                                    1
        0.5                                                                                  0.5
         0                                                                                    0
              0        250       500      750           1000                                       0          250      500      750     1000

                               Sample                                                                                Sample



  Drexel University Software Engineering Research Group (SERG)
  http://serg.mcs.drexel.edu
                                                                                                                                               24
Understanding the Search Space
  There are characteristics of Bunch’s clustering
  algorithms that are interesting:
      It seems unusual that the clustering algorithms
       produce consistent MQ values given the large
       search space
      Other approaches [spectral methods] to solving
       the clustering problem using Bunch’s MQ have not
       produced better clustering results
      The median clustering level is a good tradeoff
       between cluster size and number of clusters
          Harman et al. examined using a target granularity
            [GECCO’02] to bias the desired cluster sizes


 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                25
Investigating the Search Space
  Examined multiple systems of different
  size:
      15 open source systems developed in C,
       C++, or Java
      13 randomly generated graphs with
       different properties that we wanted to
       investigate
We clustered each MDG 500 times and examined
the clustering data to gain some insight into the
search space.
 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                26
    Example: Median Clustering
    Level
                             swing                                             Kerbos v.5
                70                                                        75



                                                                          70




                                                          Cumulative MQ
Cumulative MQ




                65


                                                                          65
                60

                                                                          60
                                                                                L1   L2
                55          L1       L2                                         L3   L4
                            L3       L4                                   55    L5   L6
                            L5       L6                                         L7   Median
                50          L7       Median
                                                                          50



                45                                                45
                Drexel University Software Engineering Research Group (SERG)
                http://serg.mcs.drexel.edu
                                                                                              27
Example: Median Clustering
Level
                 telnetd                                                php
     4.5                                                  9

      4
                                                          8




                                                  MQ
     3.5
                                                          7
      3
MQ




     2.5                                                  6


      2                                                   5

     1.5
                                                          4
      1         L1          L2                                       L1       L2
                                                                     L3       L4
                L3          Median                        3
     0.5                                                             Median

      0                                                   2
      Drexel University Software Engineering Research Group (SERG)
      http://serg.mcs.drexel.edu
                                                                                   28
Example: Median Clustering                                                 X Axis:
                                                                          MQ Value
Level
         bash                          mod_ssl                           lynx
10                              16                                  10
                                14
 8                                                                  8
                                12
 6                                                                  6
                                10
 4                               8                                  4

     ping_libc                             elm                           mailx
70                              10                                   6
65                                                                   5
                                 8
60                                                                   4
55                               6
                                                                     3
50
45                               4                                   2


     Drexel University Software Engineering Research Group (SERG)
     http://serg.mcs.drexel.edu
                                                                                 29
Example: Median Clustering
Level – Random Bipartite Graphs
          bip-100-1                         bip-100-2                     bip-100-5
33                                   8                               10

28                                   6                               8
                                                                     6
23                                   4
                                                                     4
18                                   2                               2

            bip-100-25                             bip-100-75
     10                                    5
     8
                                           4                                  X Axis:
     6
                                                                             MQ Value
                                           3
     4
     2                                     2

      Drexel University Software Engineering Research Group (SERG)
      http://serg.mcs.drexel.edu
                                                                                 30
Example: Median Clustering
Level – Random Graphs
         rnd-100-1                         rnd-100-2                     rnd-100-5
38                                 38
                                                                    18
33                                 33
28                                 28
                                                                    13
23                                 23
18                                 18                               8

           rnd-100-25                             rnd-100-75
     8                                    5

     6                                    4
                                                                             X Axis:
     4                                    3                                 MQ Value

     2                                    2


     Drexel University Software Engineering Research Group (SERG)
     http://serg.mcs.drexel.edu
                                                                                31
Example: Median Clustering
Level – Random “Circle” Graphs
        circle-50                                          circle-100
25                                                50

20                                                40

15                                                30

10                                                20


                               circle-150
                      75
                      65                                                 X Axis:
                      55                                                MQ Value
                      45
                      35
 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                            32
                                                                                      X Axis: #Clusters
        MQ versus #Clusters                                                           Y Axis: MQ Value


 47
            krb5                        swing                       telnetd                            php
                              45.6                        3                           4.65
46.8                          45.4                                                     4.6
46.6                                                      2
46.4                          45.2                                                    4.55
46.2                            45                        1                            4.5
 46                           44.8                        0                           4.45
    170        180     190         150 160 170 180            0         5        10          10        15    20


5.15
            bash             8.5
                                        mod_ssl           47
                                                                  ping_libc            4.3
                                                                                                   elm
 5.1                                                     46.8                         4.25
5.05                         8.4                         46.6                          4.2
   5                         8.3                         46.4                         4.15
4.95                                                     46.2                          4.1
 4.9                         8.2                          46                          4.05
       25       35     45          40     45      50              170   180   190            20        30    40
                      4.3                                                2.4
                      4.2                                               2.35
            lynx 4.1                                          mailx 2.3
                                                                        2.25
                        4                                                2.2
              Drexel University Software Engineering Research Group (SERG)
              http://serg.mcs.drexel.edu
                                                                                                       33
                          25        35      45                               5        10          15
                                                                                  X Axis: #Clusters
        MQ versus #Clusters                                                       Y Axis: MQ Value

    bip-100-1                     bip-100-5             bip-100-25                      bip-100-75
19.46                       4.95                       4.05                        1.8
19.44                        4.9                          4
                            4.85                                                  1.79
19.42                                                  3.95
                             4.8                                                  1.78
 19.4                       4.75                        3.9
19.38                        4.7                       3.85                       1.77
        20    25      30           10    12      14              38    40    42          20   30    40
    rnd-100-1
25.67                      11.5
                                  rnd-100-5             rnd-100-25
                                                      3.9                         1.9
                                                                                        rnd-100-75
25.67                       11                        3.8                         1.8
                                                      3.7
25.67                      10.5                       3.6                         1.7
25.67                       10                        3.5                         1.6
        30     31     32          35 40 45 50               30        40    50          30    35    40
             12.6                                 25                              37.5
             12.4
cir-50       12.2                       cir- 24.5                           cir- 37
               12                       100 24                              150 36.5
                                                23.5
             11.8 University Software Engineering Research Group (SERG)
             Drexel                                                                 36
             http://serg.mcs.drexel.edu
                                                                                               34
                   20     25       30                40       45     50                  65    70   75
 Internal- versus                                                          X Axis: External Edges
 External Edges                                                            Y Axis: Internal Edges

          krb5                        swing                       telnetd                                php
2320                          1240                    80                                145
                              1230
                                                      60                                140
2300                          1220
                              1210                    40                                135
2280                          1200
                              1190                    20                                130
2260                          1180                      0                               125
    500    550     600            250   300     350         10      30             50          0         50    100

980
          bash             980
                                      mod_ssl         2320
                                                                 ping_libc               145
                                                                                                     elm
960                        960                        2300                               140
940                        940                                                           135
920                        920                        2280                               130
900                        900                        2260                               125
   100     150     200          100     150    200           500    550 600                    0         50    100
                   1600                                              300

                   1550                                              200
          lynx     1500
                                                            mailx 100
                    1450
            Drexel University Software Engineering Research Group (SERG)
                                                                           0
            http://serg.mcs.drexel.edu
                                                                                                          35
                          0      200    400                                    0        100        200
 Internal- versus                                                          X Axis: External Edges
 External Edges                                                            Y Axis: Internal Edges

     bip-100-1                 bip-100-5                   bip-100-25                bip-100-75
15                       142                             1000                      2450
                         140
10                       138                              995                      2400
                         136                                                       2350
 5                       134                              990
                         132                                                       2300
 0                       130                              985                      2250
     0   20        40          85    90        95 100       100      110    120              0   200   400
     rnd-100-1           195
                               rnd-100-5                   rnd-100-25
                                                         1140
                                                                                     rnd-100-75
                                                                                   3600
15
10                       190                             1120                      3500
                         185                                                       3400
 5                                                       1100
                         180                                                       3300
 0                       175                             1080                      3200
     0             50          0          50      100           0    100    200              0         500
         26                                       50                               74
         25                                       48                               72
cir-50   24
         23                          cir- 46                                cir-   70
         22
         21
                                     100 44                                 150    68
         20                                   42
         Drexel University Software Engineering Research Group (SERG)              66
         http://serg.mcs.drexel.edu
                                                                                                 36
              20        25          30              50          55     60               75       80     85
Real Systems
                                     Similarity of Clustering Results
             100                                            IntraEdge Agreement
             90                                             Isomporphic Nodes
             80
             70
Percentage




             60
             50
             40
             30
             20
             10
              0
                   telnetd

                             crond

                                      mailx

                                              joe

                                                    dhcpd

                                                            php

                                                                  elm

                                                                          inn

                                                                                bash

                                                                                       bunch

                                                                                               mod_ssl

                                                                                                         lynx

                                                                                                                swing

                                                                                                                        ping_libc

                                                                                                                                    krb5
                                                                        System
             Drexel University Software Engineering Research Group (SERG)
             http://serg.mcs.drexel.edu
                                                                                                                                           37
Random Systems
                                              Similarity of Clustering Results
             100
             90
             80
             70
Percentage




             60
             50
             40                                                                                                                                  IntraEdge Agreement
             30                                                                                                                                  Isomporphic Nodes
             20
             10
              0
                   bip-100-1


                               bip-100-2


                                           bip-100-5


                                                       bip-100-25


                                                                    bip-100-75


                                                                                 rnd-100-1


                                                                                               rnd-100-2


                                                                                                           rnd-100-5


                                                                                                                       rnd-100-25


                                                                                                                                    rnd-100-75


                                                                                                                                                     circle-50


                                                                                                                                                                 circle-100


                                                                                                                                                                                   circle-150
                                                                                             System

               Drexel University Software Engineering Research Group (SERG)
               http://serg.mcs.drexel.edu
                                                                                                                                                                              38
Real Systems
                                      Similarity of Clustering Results
              100
              90
              80
              70
 Percentage




              60
              50
              40
              30
              20                                                                 IntraEdge Agreement
              10
               0
                    telnetd

                              crond

                                       mailx

                                               joe

                                                     dhcpd

                                                             php

                                                                   elm

                                                                           inn

                                                                                 bash

                                                                                        bunch

                                                                                                mod_ssl

                                                                                                          lynx

                                                                                                                 swing

                                                                                                                         ping_libc

                                                                                                                                     krb5
                                                                         System

 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                                                                                            39
  Random Systems
                                                 Similarity of Clustering Results
             100
             90
             80
             70
Percentage




             60                                                                                                                IntraEdge Agreement
             50
             40
             30
             20
             10
              0
                      bip-100-1


                                  bip-100-2


                                              bip-100-5


                                                          bip-100-25


                                                                       bip-100-75


                                                                                    rnd-100-1


                                                                                                  rnd-100-2


                                                                                                              rnd-100-5


                                                                                                                          rnd-100-25


                                                                                                                                       rnd-100-75


                                                                                                                                                    circle-50


                                                                                                                                                                circle-100


                                                                                                                                                                              circle-150
                                                                                                System


                   Drexel University Software Engineering Research Group (SERG)
                   http://serg.mcs.drexel.edu
                                                                                                                                                                             40
What we Learned From Studying
the Search Landscape
    Not all modules are “equal” - Some modules:
        Are connected to many other modules
        Are connected to few other modules
        Have a large fan-in
        Have a large fan-out
        Are uniformly connected to other system
         components
        Are not uniformly connected to other system
         components
Some modules may have a more “natural” home than
other subsystems with respect to their assigned cluster
   Drexel University Software Engineering Research Group (SERG)
   http://serg.mcs.drexel.edu
                                                                  41
What we Learned From Studying
the Search Landscape
  Bunch tends to converge to a consistent
  solution with respect to MQ
      There is a very low probability of finding one of
       these partitions by random selection
      The partitions found by Bunch are a very small
       subset of the overall search landscape
  The degree of isomorphism in the clustering
  results was larger than expected


 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                42
What we Learned From Studying
the Search Landscape
  When examining the median level of the clustering
  hierarchy we observed that all systems tend to
  converge to at most 2 levels
      The systems that we studied range from under 100 modules
       to several thousand modules
      The number of levels in the clustering hierarchy is bounded
       by O(log2N)
      We expect that studying systems with several hundred
       thousand modules would produce results where the median
       level converges to more than 2 levels.
          We observed this in very sparse graphs (e.g., rnd-100-1, and
            bip-100-1)



 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                      43
Conclusions (1)
  Understanding the search landscape is
  important
      A single run of Bunch is helpful, but it does
       not highlight modules/classes that tend to
       drift between clusters
      Analysis of many Bunch runs helps build a
       mental model of the search landscape


 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                44
Conclusions (2)
  A best practice for program understanding
      Cluster a system many times in order to
       understand the search landscape
      Identify and separate omnipresent, library and
       supplier modules
      Identify that tend to drift between many
       subsystems
          Assign to other clusters manually, or influence the
           clustering algorithm by adjusting the edge weights
          Bunch supports manual and semi-automatic clustering
           features to help with this type of analysis


 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                 45
Questions
  Special Thanks To:
      AT&T Research
      Sun Microsystems
      DARPA
      NSF
      US Army

      SEMINAL Group



 Drexel University Software Engineering Research Group (SERG)
 http://serg.mcs.drexel.edu
                                                                46

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:12/19/2011
language:
pages:46