The Search Landscape of
Graph Partitioning Problems
using Coupling and Cohesion as
the Clustering Criteria
Brian S. Mitchell & Spiros Mancoridis
{bmitchel,smancori}@mcs.drexel.edu
http://www.mcs.drexel.edu/~{bmitchel,smancori}
Department of Computer Science
Software Engineering Research Group
http://serg.mcs.drexel.edu
Drexel University, Philadelphia, PA, USA
1
10/05/2002
Software Clustering with Bunch
Bunch Clustering Visualization Tool
Source Code
Tool
void main()
{
printf(“hello”); Bunch GUI
}
Source Code Clustering
Analysis Tools Algorithms
Acacia Chava
Clustering Tools
Partitioned MDG File
MDG File
M1 M3 M6
Programming M1 M3 M6
API M2
M2 M7 M8
M7 M8 M4 M5
M4 M5
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
2
Software Clustering as a Search
Problem
SEARCH SPACE Software Clustering
Source Code Set of All Search Algorithms
void main() MDG Partitions
{ bP = null;
printf(“hello”); while(searching())
} M1 M6
M3 {
p = selectNext();
M2 M8 M7
if(p.isBetter(bP))
Source Code M4 M5 bP = p;
Analysis Tools }
Acacia Chava M6 return bP;
M1
M3 M8 M7
MDG M2 “GOOD” MDG Partition
M1 M3 M6 M4 M5
M2 M1 M3 M6
M7 M8 Total = 4140 Partitions M2
M4 M5 M7 M8
M4 M5
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
3
The Search Space is Enormous
The number of MDG partitions grows very quickly,
as the number of modules in the system increases…
1 if k = 1 k = n
S n, k =
Sn-1,k -1 + kSn-1,k otherwise
1=1 6 = 203 11 = 678570 16 = 10480142147
2=2 7 = 877 12 = 4213597 17 = 82864869804
3=5 8 = 4140 13 = 27644437 18 = 682076806159
4 = 15 9 = 21147 14 = 190899322 19 = 5832742205057
5 = 52 10 = 115975 15 = 1382958545 20 = 51724158235372
A 15 Module System is about the
limit for performing Exhaustive Analysis
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
4
Our Assumption…
“Well designed software systems are
organized into cohesive clusters that are
loosely interconnected.”
We designed a measurement called MQ that
embodies our assumption
The MQ measurement balances cohesion and
coupling
We apply MQ to partitions of the MDG
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
5
Not all Partitions of the MDG are
Good Solutions
MDG
M1 M4
M2 M3 M5 M6
Good Partition! Bad Partition!
M1 M4 M1 M4
M2 M5 M2 M5
M3
M3 M6 M6
MQ(Good Partition) > MQ(Bad Partition)
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
6
The Software Clustering Problem:
Algorithm Objectives
“Find a good partition of the MDG.”
A partition is the decomposition of a set of
elements (i.e., all the nodes of the graph)
into mutually disjoint clusters.
A good partition is a partition where:
highly interdependent nodes are grouped in the
same clusters
independent nodes are assigned to separate
clusters
The better the partition the higher the MQ
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
7
Bunch Hill Climbing Clustering
Algorithm
Generate a Random Decomposition of MDG Neighbor
Partition
Iteration Step
A neighbor
Generate partition is
Current Measure created by
Measure MQ
Next
Neighboring Partition
Partition MQ altering the
Neighbor
current
New Best
partition
Compare to Best slightly.
Neighboring Partition
Better?
Better
Best Neighboring Partition for Iteration
Convergence
Best Neighboring Partition
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
8
Bunch Hill Climbing Clustering
Algorithm
Generate a Random Decomposition of MDG Neighbor
Partition
Iteration Step
A neighbor
Other Things of Interest
Generate
Measure
partition is
Current created by
Measure MQ
Next
Neighboring Partition
We have
Partition implemented a MQ
family of altering the
Neighbor
current
hill-climbing algorithms
New Best
partition
Compare to Best slightly.
We also implemented an Exhaustive
Neighboring Partition
and Genetic Algorithm Better?
Better
Best Neighboring Partition for Iteration
Convergence
Best Neighboring Partition
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
9
Hierarchical Clustering (1):
Nested View
1. 4.
2. Default 3.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
10
Hierarchical Clustering (2):
Consolidated View
1. 4.
2. Default 3.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
11
Hierarchical Clustering (3):
Tree View
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
12
Hierarchical Clustering (3):
Tree View
Observations
• The number of levels for a given
system’s clustering hierarchy is
bounded by:
O(log2N)
because Bunch places at least 2
nodes in each cluster.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
13
Evaluating The Software
Clustering Results
Over the past few years we have spent
a lot of time evaluating Bunch’s
software clustering results
Empirically
Semi-formally
Measuring Similarity
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
14
What We Know
Given a particular MDG, the results
produced by Bunch converge to a
family of related solutions
The search space is large, and the
probability of finding a good solution by
random sampling is infinitesimal
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
15
Software Clustering using Graph
Partitioning Techniques
Running Bunch multiple times produces a
family of related clustering results
Bunch starts with a random partition of the MDG,
and makes random moves to explore the search
space
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
16
Software Clustering using Graph
Partitioning Techniques
How related are these clustering results?
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
17
Software Clustering using Graph
Partitioning Techniques
Given that there are 2,7644,437 distinct partitions
of this MDG, there is a lot of agreement…
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
18
Software Clustering using Graph
Partitioning Techniques
Why Some Modules Don’t Agree…
Library Modules
Isomorphism
Omnipresent
Module Influences
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
19
Special Modules
Isomorphic – Modules that are
connected to multiple clusters with
equal strength
Library – All edges fan-in
Driver – All edges fan-out
Omnipresent – Modules that are
strongly connected to many other
modules in the system
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
20
Clustering a System Random
Many Times (1)… Bunch
RCS (Random) RCS (Bunch) RCS RCS
2.5 2.5 30 2.5
25
Number Clusters
2 2 2
MQ Value
MQ Value
RCS
20
1.5 1.5 1.5
MQ
15
1 1 1
10
0.5 0.5 5 0.5
0 0 0 0
0 10 20 30 0 10 20 30 0 250 500 750 1000 0 250 500 750 1000
Number of Clusters Number of Clusters Sample Sample
Dot (Random) Dot (Bunch) Dot Dot
1.8 1.8 45 1.8
1.6 1.6 40 1.6
Number Clusters
1.4 1.4 35 1.4
MQ Value
MQ Value
1.2 1.2 30 1.2
Dot
1 1 25 1
MQ
0.8 0.8 20 0.8
0.6 0.6 15 0.6
0.4 0.4 10 0.4
0.2 0.2 5 0.2
0 0 0 0
0 10 20 30 40 0 10 20 30 40 0 250 500 750 1000 0 250 500 750 1000
Number of Clusters Number of Clusters Sample Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
21
Clustering a System Random
Many Times (2)… Bunch
Swing (Random) Swing (Bunch) Swing Swing
7 7 450 7
400 6
6
Swing
6
Number Clusters
350
5 5 5
MQ Value
MQ Value
300
4 4 250 4
MQ
3 3 200 3
150
2 2 2
100
1 1 1
50
0 0 0 0
0 100 200 300 400 0 100 200 300 400 0 250 500 750 1000 0 250 500 750 1000
Number of Clusters Number of Clusters Sample Sample
Bunch (Random) Bunch (Bunch) Bunch Bunch
4.5 4.5 125 4.5
4 4 4
Bunch
Number Clusters
3.5 3.5 100 3.5
MQ Value
MQ Value
3 3 3
75
2.5 2.5 2.5
MQ
2 2 2
50
1.5 1.5 1.5
1 1 25 1
0.5 0.5 0.5
0 0 0 0
0 25 50 75 100 125 0 25 50 75 100 125 0 250 500 750 1000 0 250 500 750 1000
Number of Clusters Number of Clusters Sample Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
22
Clustering a System Random
Many Times (2)… Bunch
Swing (Random) Swing (Bunch) Swing Swing
7 7 450 7
Observations 6
400 6
Swing
6
Number Clusters
350
5 5 5
MQ Value
MQ Value
300
4 4 250 4
MQ
3 200 3
• As the number of clusters increased
3
150
2 2 2
100
in the random samples, MQ decreased
1 1 1
50
0 0 0 0
0 100 200 300 400 0 100 200 300 400 0 250 500 750 1000 0 250 500 750 1000
• Bunch converged to a consistent
Number of Clusters Number of Clusters Sample Sample
“family” of solutions, no matter where
Bunch (Random) Bunch (Bunch) Bunch Bunch
4.5
4
the random starting point was generated
4.5
4
125 4.5
4
• Some solutions were multi-modal
Bunch
Number Clusters
3.5 3.5 100 3.5
MQ Value
MQ Value
3 3 3
75
• Random solutions were consistently
2.5 2.5 2.5
MQ
2 2 2
50
1.5 1.5 1.5
0.5
1
worse than Bunch’s solutions. 0.5
1 25
0.5
1
0 0 0 0
0 25 50 75 100 125 0 25 50 75 100 125 0 250 500 750 1000 0 250 500 750 1000
Number of Clusters Number of Clusters Sample Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
23
The search space
Example - Detailed Results: has some inherent
structure, as random
Bunch System clusters constrained
to the area where
Bunch converged did
MQ versus Number of Clusters
4.5
23% 4
3.5
not produce better
3
2.5
MQ values.
MQ
2
77%
1.5
1
0.5
0
0 5 10 15 20
Number of Clusters
MQ For Random Clusters (4-8) MQ For Random Clusters (11-16)
4.5 4.5
4 4
3.5 3.5
3 3
2.5
MQ
2.5
MQ
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 250 500 750 1000 0 250 500 750 1000
Sample Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
24
Understanding the Search Space
There are characteristics of Bunch’s clustering
algorithms that are interesting:
It seems unusual that the clustering algorithms
produce consistent MQ values given the large
search space
Other approaches [spectral methods] to solving
the clustering problem using Bunch’s MQ have not
produced better clustering results
The median clustering level is a good tradeoff
between cluster size and number of clusters
Harman et al. examined using a target granularity
[GECCO’02] to bias the desired cluster sizes
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
25
Investigating the Search Space
Examined multiple systems of different
size:
15 open source systems developed in C,
C++, or Java
13 randomly generated graphs with
different properties that we wanted to
investigate
We clustered each MDG 500 times and examined
the clustering data to gain some insight into the
search space.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
26
Example: Median Clustering
Level
swing Kerbos v.5
70 75
70
Cumulative MQ
Cumulative MQ
65
65
60
60
L1 L2
55 L1 L2 L3 L4
L3 L4 55 L5 L6
L5 L6 L7 Median
50 L7 Median
50
45 45
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
27
Example: Median Clustering
Level
telnetd php
4.5 9
4
8
MQ
3.5
7
3
MQ
2.5 6
2 5
1.5
4
1 L1 L2 L1 L2
L3 L4
L3 Median 3
0.5 Median
0 2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
28
Example: Median Clustering X Axis:
MQ Value
Level
bash mod_ssl lynx
10 16 10
14
8 8
12
6 6
10
4 8 4
ping_libc elm mailx
70 10 6
65 5
8
60 4
55 6
3
50
45 4 2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
29
Example: Median Clustering
Level – Random Bipartite Graphs
bip-100-1 bip-100-2 bip-100-5
33 8 10
28 6 8
6
23 4
4
18 2 2
bip-100-25 bip-100-75
10 5
8
4 X Axis:
6
MQ Value
3
4
2 2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
30
Example: Median Clustering
Level – Random Graphs
rnd-100-1 rnd-100-2 rnd-100-5
38 38
18
33 33
28 28
13
23 23
18 18 8
rnd-100-25 rnd-100-75
8 5
6 4
X Axis:
4 3 MQ Value
2 2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
31
Example: Median Clustering
Level – Random “Circle” Graphs
circle-50 circle-100
25 50
20 40
15 30
10 20
circle-150
75
65 X Axis:
55 MQ Value
45
35
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
32
X Axis: #Clusters
MQ versus #Clusters Y Axis: MQ Value
47
krb5 swing telnetd php
45.6 3 4.65
46.8 45.4 4.6
46.6 2
46.4 45.2 4.55
46.2 45 1 4.5
46 44.8 0 4.45
170 180 190 150 160 170 180 0 5 10 10 15 20
5.15
bash 8.5
mod_ssl 47
ping_libc 4.3
elm
5.1 46.8 4.25
5.05 8.4 46.6 4.2
5 8.3 46.4 4.15
4.95 46.2 4.1
4.9 8.2 46 4.05
25 35 45 40 45 50 170 180 190 20 30 40
4.3 2.4
4.2 2.35
lynx 4.1 mailx 2.3
2.25
4 2.2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
33
25 35 45 5 10 15
X Axis: #Clusters
MQ versus #Clusters Y Axis: MQ Value
bip-100-1 bip-100-5 bip-100-25 bip-100-75
19.46 4.95 4.05 1.8
19.44 4.9 4
4.85 1.79
19.42 3.95
4.8 1.78
19.4 4.75 3.9
19.38 4.7 3.85 1.77
20 25 30 10 12 14 38 40 42 20 30 40
rnd-100-1
25.67 11.5
rnd-100-5 rnd-100-25
3.9 1.9
rnd-100-75
25.67 11 3.8 1.8
3.7
25.67 10.5 3.6 1.7
25.67 10 3.5 1.6
30 31 32 35 40 45 50 30 40 50 30 35 40
12.6 25 37.5
12.4
cir-50 12.2 cir- 24.5 cir- 37
12 100 24 150 36.5
23.5
11.8 University Software Engineering Research Group (SERG)
Drexel 36
http://serg.mcs.drexel.edu
34
20 25 30 40 45 50 65 70 75
Internal- versus X Axis: External Edges
External Edges Y Axis: Internal Edges
krb5 swing telnetd php
2320 1240 80 145
1230
60 140
2300 1220
1210 40 135
2280 1200
1190 20 130
2260 1180 0 125
500 550 600 250 300 350 10 30 50 0 50 100
980
bash 980
mod_ssl 2320
ping_libc 145
elm
960 960 2300 140
940 940 135
920 920 2280 130
900 900 2260 125
100 150 200 100 150 200 500 550 600 0 50 100
1600 300
1550 200
lynx 1500
mailx 100
1450
Drexel University Software Engineering Research Group (SERG)
0
http://serg.mcs.drexel.edu
35
0 200 400 0 100 200
Internal- versus X Axis: External Edges
External Edges Y Axis: Internal Edges
bip-100-1 bip-100-5 bip-100-25 bip-100-75
15 142 1000 2450
140
10 138 995 2400
136 2350
5 134 990
132 2300
0 130 985 2250
0 20 40 85 90 95 100 100 110 120 0 200 400
rnd-100-1 195
rnd-100-5 rnd-100-25
1140
rnd-100-75
3600
15
10 190 1120 3500
185 3400
5 1100
180 3300
0 175 1080 3200
0 50 0 50 100 0 100 200 0 500
26 50 74
25 48 72
cir-50 24
23 cir- 46 cir- 70
22
21
100 44 150 68
20 42
Drexel University Software Engineering Research Group (SERG) 66
http://serg.mcs.drexel.edu
36
20 25 30 50 55 60 75 80 85
Real Systems
Similarity of Clustering Results
100 IntraEdge Agreement
90 Isomporphic Nodes
80
70
Percentage
60
50
40
30
20
10
0
telnetd
crond
mailx
joe
dhcpd
php
elm
inn
bash
bunch
mod_ssl
lynx
swing
ping_libc
krb5
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
37
Random Systems
Similarity of Clustering Results
100
90
80
70
Percentage
60
50
40 IntraEdge Agreement
30 Isomporphic Nodes
20
10
0
bip-100-1
bip-100-2
bip-100-5
bip-100-25
bip-100-75
rnd-100-1
rnd-100-2
rnd-100-5
rnd-100-25
rnd-100-75
circle-50
circle-100
circle-150
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
38
Real Systems
Similarity of Clustering Results
100
90
80
70
Percentage
60
50
40
30
20 IntraEdge Agreement
10
0
telnetd
crond
mailx
joe
dhcpd
php
elm
inn
bash
bunch
mod_ssl
lynx
swing
ping_libc
krb5
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
39
Random Systems
Similarity of Clustering Results
100
90
80
70
Percentage
60 IntraEdge Agreement
50
40
30
20
10
0
bip-100-1
bip-100-2
bip-100-5
bip-100-25
bip-100-75
rnd-100-1
rnd-100-2
rnd-100-5
rnd-100-25
rnd-100-75
circle-50
circle-100
circle-150
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
40
What we Learned From Studying
the Search Landscape
Not all modules are “equal” - Some modules:
Are connected to many other modules
Are connected to few other modules
Have a large fan-in
Have a large fan-out
Are uniformly connected to other system
components
Are not uniformly connected to other system
components
Some modules may have a more “natural” home than
other subsystems with respect to their assigned cluster
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
41
What we Learned From Studying
the Search Landscape
Bunch tends to converge to a consistent
solution with respect to MQ
There is a very low probability of finding one of
these partitions by random selection
The partitions found by Bunch are a very small
subset of the overall search landscape
The degree of isomorphism in the clustering
results was larger than expected
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
42
What we Learned From Studying
the Search Landscape
When examining the median level of the clustering
hierarchy we observed that all systems tend to
converge to at most 2 levels
The systems that we studied range from under 100 modules
to several thousand modules
The number of levels in the clustering hierarchy is bounded
by O(log2N)
We expect that studying systems with several hundred
thousand modules would produce results where the median
level converges to more than 2 levels.
We observed this in very sparse graphs (e.g., rnd-100-1, and
bip-100-1)
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
43
Conclusions (1)
Understanding the search landscape is
important
A single run of Bunch is helpful, but it does
not highlight modules/classes that tend to
drift between clusters
Analysis of many Bunch runs helps build a
mental model of the search landscape
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
44
Conclusions (2)
A best practice for program understanding
Cluster a system many times in order to
understand the search landscape
Identify and separate omnipresent, library and
supplier modules
Identify that tend to drift between many
subsystems
Assign to other clusters manually, or influence the
clustering algorithm by adjusting the edge weights
Bunch supports manual and semi-automatic clustering
features to help with this type of analysis
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
45
Questions
Special Thanks To:
AT&T Research
Sun Microsystems
DARPA
NSF
US Army
SEMINAL Group
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
46