Embed
Email

seminal

Document Sample

Shared by: panniuniu
Categories
Tags
Stats
views:
0
posted:
12/19/2011
language:
pages:
46
The Search Landscape of

Graph Partitioning Problems

using Coupling and Cohesion as

the Clustering Criteria

Brian S. Mitchell & Spiros Mancoridis

{bmitchel,smancori}@mcs.drexel.edu

http://www.mcs.drexel.edu/~{bmitchel,smancori}

Department of Computer Science

Software Engineering Research Group

http://serg.mcs.drexel.edu

Drexel University, Philadelphia, PA, USA



1

10/05/2002

Software Clustering with Bunch

Bunch Clustering Visualization Tool

Source Code

Tool

void main()

{

printf(“hello”); Bunch GUI

}





Source Code Clustering

Analysis Tools Algorithms

Acacia Chava

Clustering Tools

Partitioned MDG File

MDG File

M1 M3 M6

Programming M1 M3 M6

API M2

M2 M7 M8

M7 M8 M4 M5

M4 M5



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

2

Software Clustering as a Search

Problem

SEARCH SPACE Software Clustering

Source Code Set of All Search Algorithms

void main() MDG Partitions

{ bP = null;

printf(“hello”); while(searching())

} M1 M6

M3 {

p = selectNext();

M2 M8 M7

if(p.isBetter(bP))

Source Code M4 M5 bP = p;

Analysis Tools }

Acacia Chava M6 return bP;

M1

M3 M8 M7

MDG M2 “GOOD” MDG Partition

M1 M3 M6 M4 M5

M2 M1 M3 M6

M7 M8 Total = 4140 Partitions M2

M4 M5 M7 M8

M4 M5

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

3

The Search Space is Enormous

The number of MDG partitions grows very quickly,

as the number of modules in the system increases…

1 if k = 1  k = n

S n, k =

 Sn-1,k -1 + kSn-1,k otherwise

1=1 6 = 203 11 = 678570 16 = 10480142147

2=2 7 = 877 12 = 4213597 17 = 82864869804

3=5 8 = 4140 13 = 27644437 18 = 682076806159

4 = 15 9 = 21147 14 = 190899322 19 = 5832742205057

5 = 52 10 = 115975 15 = 1382958545 20 = 51724158235372

A 15 Module System is about the

limit for performing Exhaustive Analysis

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

4

Our Assumption…

“Well designed software systems are

organized into cohesive clusters that are

loosely interconnected.”



We designed a measurement called MQ that

embodies our assumption

The MQ measurement balances cohesion and

coupling

We apply MQ to partitions of the MDG





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

5

Not all Partitions of the MDG are

Good Solutions

MDG

M1 M4



M2 M3 M5 M6



Good Partition! Bad Partition!

M1 M4 M1 M4



M2 M5 M2 M5

M3

M3 M6 M6





MQ(Good Partition) > MQ(Bad Partition)

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

6

The Software Clustering Problem:

Algorithm Objectives

“Find a good partition of the MDG.”

A partition is the decomposition of a set of

elements (i.e., all the nodes of the graph)

into mutually disjoint clusters.

A good partition is a partition where:

 highly interdependent nodes are grouped in the

same clusters

 independent nodes are assigned to separate

clusters

The better the partition the higher the MQ

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

7

Bunch Hill Climbing Clustering

Algorithm

Generate a Random Decomposition of MDG Neighbor

Partition

Iteration Step

A neighbor

Generate partition is

Current Measure created by



Measure MQ

Next

Neighboring Partition









Partition MQ altering the

Neighbor

current

New Best









partition

Compare to Best slightly.

Neighboring Partition

Better?

Better

Best Neighboring Partition for Iteration



Convergence

Best Neighboring Partition

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

8

Bunch Hill Climbing Clustering

Algorithm

Generate a Random Decomposition of MDG Neighbor

Partition

Iteration Step

A neighbor

Other Things of Interest

Generate

Measure

partition is

Current created by



Measure MQ

Next

Neighboring Partition









We have

Partition implemented a MQ

family of altering the

Neighbor

current

hill-climbing algorithms

New Best









partition

Compare to Best slightly.

We also implemented an Exhaustive

Neighboring Partition

and Genetic Algorithm Better?

Better

Best Neighboring Partition for Iteration



Convergence

Best Neighboring Partition

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

9

Hierarchical Clustering (1):

Nested View

1. 4.









2. Default 3.









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

10

Hierarchical Clustering (2):

Consolidated View

1. 4.









2. Default 3.









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

11

Hierarchical Clustering (3):

Tree View









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

12

Hierarchical Clustering (3):

Tree View

Observations



• The number of levels for a given

system’s clustering hierarchy is

bounded by:



O(log2N)



because Bunch places at least 2

nodes in each cluster.





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

13

Evaluating The Software

Clustering Results



Over the past few years we have spent

a lot of time evaluating Bunch’s

software clustering results

 Empirically

 Semi-formally

 Measuring Similarity







Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

14

What We Know

Given a particular MDG, the results

produced by Bunch converge to a

family of related solutions

The search space is large, and the

probability of finding a good solution by

random sampling is infinitesimal







Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

15

Software Clustering using Graph

Partitioning Techniques

Running Bunch multiple times produces a

family of related clustering results

 Bunch starts with a random partition of the MDG,

and makes random moves to explore the search

space









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

16

Software Clustering using Graph

Partitioning Techniques

How related are these clustering results?









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

17

Software Clustering using Graph

Partitioning Techniques

Given that there are 2,7644,437 distinct partitions

of this MDG, there is a lot of agreement…









Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

18

Software Clustering using Graph

Partitioning Techniques

Why Some Modules Don’t Agree…









Library Modules

Isomorphism

Omnipresent

Module Influences

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

19

Special Modules

Isomorphic – Modules that are

connected to multiple clusters with

equal strength

Library – All edges fan-in

Driver – All edges fan-out

Omnipresent – Modules that are

strongly connected to many other

modules in the system



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

20

Clustering a System Random



Many Times (1)… Bunch





RCS (Random) RCS (Bunch) RCS RCS

2.5 2.5 30 2.5



25









Number Clusters

2 2 2

MQ Value









MQ Value

RCS









20

1.5 1.5 1.5









MQ

15

1 1 1

10

0.5 0.5 5 0.5



0 0 0 0

0 10 20 30 0 10 20 30 0 250 500 750 1000 0 250 500 750 1000

Number of Clusters Number of Clusters Sample Sample





Dot (Random) Dot (Bunch) Dot Dot

1.8 1.8 45 1.8

1.6 1.6 40 1.6







Number Clusters

1.4 1.4 35 1.4

MQ Value









MQ Value









1.2 1.2 30 1.2

Dot









1 1 25 1









MQ

0.8 0.8 20 0.8

0.6 0.6 15 0.6

0.4 0.4 10 0.4

0.2 0.2 5 0.2

0 0 0 0

0 10 20 30 40 0 10 20 30 40 0 250 500 750 1000 0 250 500 750 1000

Number of Clusters Number of Clusters Sample Sample





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

21

Clustering a System Random



Many Times (2)… Bunch





Swing (Random) Swing (Bunch) Swing Swing

7 7 450 7

400 6

6

Swing









6









Number Clusters

350

5 5 5







MQ Value

MQ Value









300

4 4 250 4









MQ

3 3 200 3

150

2 2 2

100

1 1 1

50

0 0 0 0

0 100 200 300 400 0 100 200 300 400 0 250 500 750 1000 0 250 500 750 1000

Number of Clusters Number of Clusters Sample Sample









Bunch (Random) Bunch (Bunch) Bunch Bunch

4.5 4.5 125 4.5

4 4 4

Bunch









Number Clusters

3.5 3.5 100 3.5

MQ Value

MQ Value









3 3 3

75

2.5 2.5 2.5









MQ

2 2 2

50

1.5 1.5 1.5

1 1 25 1

0.5 0.5 0.5

0 0 0 0

0 25 50 75 100 125 0 25 50 75 100 125 0 250 500 750 1000 0 250 500 750 1000

Number of Clusters Number of Clusters Sample Sample



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

22

Clustering a System Random



Many Times (2)… Bunch





Swing (Random) Swing (Bunch) Swing Swing

7 7 450 7





Observations 6

400 6

Swing









6









Number Clusters

350

5 5 5







MQ Value

MQ Value









300

4 4 250 4









MQ

3 200 3



• As the number of clusters increased

3

150

2 2 2

100





in the random samples, MQ decreased

1 1 1

50

0 0 0 0

0 100 200 300 400 0 100 200 300 400 0 250 500 750 1000 0 250 500 750 1000



• Bunch converged to a consistent

Number of Clusters Number of Clusters Sample Sample





“family” of solutions, no matter where

Bunch (Random) Bunch (Bunch) Bunch Bunch

4.5

4

the random starting point was generated

4.5

4

125 4.5

4





• Some solutions were multi-modal

Bunch









Number Clusters

3.5 3.5 100 3.5

MQ Value

MQ Value









3 3 3

75



• Random solutions were consistently

2.5 2.5 2.5









MQ

2 2 2

50

1.5 1.5 1.5





0.5

1

worse than Bunch’s solutions. 0.5

1 25

0.5

1





0 0 0 0

0 25 50 75 100 125 0 25 50 75 100 125 0 250 500 750 1000 0 250 500 750 1000

Number of Clusters Number of Clusters Sample Sample



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

23

The search space

Example - Detailed Results: has some inherent

structure, as random

Bunch System clusters constrained

to the area where

Bunch converged did

MQ versus Number of Clusters

4.5



23% 4

3.5

not produce better

3

2.5

MQ values.







MQ

2





77%

1.5

1

0.5

0

0 5 10 15 20



Number of Clusters









MQ For Random Clusters (4-8) MQ For Random Clusters (11-16)

4.5 4.5

4 4

3.5 3.5

3 3

2.5





MQ

2.5

MQ









2 2

1.5 1.5

1 1

0.5 0.5

0 0

0 250 500 750 1000 0 250 500 750 1000



Sample Sample







Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

24

Understanding the Search Space

There are characteristics of Bunch’s clustering

algorithms that are interesting:

 It seems unusual that the clustering algorithms

produce consistent MQ values given the large

search space

 Other approaches [spectral methods] to solving

the clustering problem using Bunch’s MQ have not

produced better clustering results

 The median clustering level is a good tradeoff

between cluster size and number of clusters

 Harman et al. examined using a target granularity

[GECCO’02] to bias the desired cluster sizes





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

25

Investigating the Search Space

Examined multiple systems of different

size:

 15 open source systems developed in C,

C++, or Java

 13 randomly generated graphs with

different properties that we wanted to

investigate

We clustered each MDG 500 times and examined

the clustering data to gain some insight into the

search space.

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

26

Example: Median Clustering

Level

swing Kerbos v.5

70 75







70









Cumulative MQ

Cumulative MQ









65





65

60



60

L1 L2

55 L1 L2 L3 L4

L3 L4 55 L5 L6

L5 L6 L7 Median

50 L7 Median

50







45 45

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

27

Example: Median Clustering

Level

telnetd php

4.5 9



4

8









MQ

3.5

7

3

MQ









2.5 6





2 5



1.5

4

1 L1 L2 L1 L2

L3 L4

L3 Median 3

0.5 Median



0 2

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

28

Example: Median Clustering X Axis:

MQ Value

Level

bash mod_ssl lynx

10 16 10

14

8 8

12

6 6

10

4 8 4



ping_libc elm mailx

70 10 6

65 5

8

60 4

55 6

3

50

45 4 2





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

29

Example: Median Clustering

Level – Random Bipartite Graphs

bip-100-1 bip-100-2 bip-100-5

33 8 10



28 6 8

6

23 4

4

18 2 2



bip-100-25 bip-100-75

10 5

8

4 X Axis:

6

MQ Value

3

4

2 2



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

30

Example: Median Clustering

Level – Random Graphs

rnd-100-1 rnd-100-2 rnd-100-5

38 38

18

33 33

28 28

13

23 23

18 18 8



rnd-100-25 rnd-100-75

8 5



6 4

X Axis:

4 3 MQ Value



2 2





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

31

Example: Median Clustering

Level – Random “Circle” Graphs

circle-50 circle-100

25 50



20 40



15 30



10 20





circle-150

75

65 X Axis:

55 MQ Value

45

35

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

32

X Axis: #Clusters

MQ versus #Clusters Y Axis: MQ Value





47

krb5 swing telnetd php

45.6 3 4.65

46.8 45.4 4.6

46.6 2

46.4 45.2 4.55

46.2 45 1 4.5

46 44.8 0 4.45

170 180 190 150 160 170 180 0 5 10 10 15 20





5.15

bash 8.5

mod_ssl 47

ping_libc 4.3

elm

5.1 46.8 4.25

5.05 8.4 46.6 4.2

5 8.3 46.4 4.15

4.95 46.2 4.1

4.9 8.2 46 4.05

25 35 45 40 45 50 170 180 190 20 30 40

4.3 2.4

4.2 2.35

lynx 4.1 mailx 2.3

2.25

4 2.2

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

33

25 35 45 5 10 15

X Axis: #Clusters

MQ versus #Clusters Y Axis: MQ Value



bip-100-1 bip-100-5 bip-100-25 bip-100-75

19.46 4.95 4.05 1.8

19.44 4.9 4

4.85 1.79

19.42 3.95

4.8 1.78

19.4 4.75 3.9

19.38 4.7 3.85 1.77

20 25 30 10 12 14 38 40 42 20 30 40

rnd-100-1

25.67 11.5

rnd-100-5 rnd-100-25

3.9 1.9

rnd-100-75

25.67 11 3.8 1.8

3.7

25.67 10.5 3.6 1.7

25.67 10 3.5 1.6

30 31 32 35 40 45 50 30 40 50 30 35 40

12.6 25 37.5

12.4

cir-50 12.2 cir- 24.5 cir- 37

12 100 24 150 36.5

23.5

11.8 University Software Engineering Research Group (SERG)

Drexel 36

http://serg.mcs.drexel.edu

34

20 25 30 40 45 50 65 70 75

Internal- versus X Axis: External Edges

External Edges Y Axis: Internal Edges



krb5 swing telnetd php

2320 1240 80 145

1230

60 140

2300 1220

1210 40 135

2280 1200

1190 20 130

2260 1180 0 125

500 550 600 250 300 350 10 30 50 0 50 100



980

bash 980

mod_ssl 2320

ping_libc 145

elm

960 960 2300 140

940 940 135

920 920 2280 130

900 900 2260 125

100 150 200 100 150 200 500 550 600 0 50 100

1600 300



1550 200

lynx 1500

mailx 100

1450

Drexel University Software Engineering Research Group (SERG)

0

http://serg.mcs.drexel.edu

35

0 200 400 0 100 200

Internal- versus X Axis: External Edges

External Edges Y Axis: Internal Edges



bip-100-1 bip-100-5 bip-100-25 bip-100-75

15 142 1000 2450

140

10 138 995 2400

136 2350

5 134 990

132 2300

0 130 985 2250

0 20 40 85 90 95 100 100 110 120 0 200 400

rnd-100-1 195

rnd-100-5 rnd-100-25

1140

rnd-100-75

3600

15

10 190 1120 3500

185 3400

5 1100

180 3300

0 175 1080 3200

0 50 0 50 100 0 100 200 0 500

26 50 74

25 48 72

cir-50 24

23 cir- 46 cir- 70

22

21

100 44 150 68

20 42

Drexel University Software Engineering Research Group (SERG) 66

http://serg.mcs.drexel.edu

36

20 25 30 50 55 60 75 80 85

Real Systems

Similarity of Clustering Results

100 IntraEdge Agreement

90 Isomporphic Nodes

80

70

Percentage









60

50

40

30

20

10

0

telnetd



crond



mailx



joe



dhcpd



php



elm



inn



bash



bunch



mod_ssl



lynx



swing



ping_libc



krb5

System

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

37

Random Systems

Similarity of Clustering Results

100

90

80

70

Percentage









60

50

40 IntraEdge Agreement

30 Isomporphic Nodes

20

10

0

bip-100-1





bip-100-2





bip-100-5





bip-100-25





bip-100-75





rnd-100-1





rnd-100-2





rnd-100-5





rnd-100-25





rnd-100-75





circle-50





circle-100





circle-150

System



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

38

Real Systems

Similarity of Clustering Results

100

90

80

70

Percentage









60

50

40

30

20 IntraEdge Agreement

10

0

telnetd



crond



mailx



joe



dhcpd



php



elm



inn



bash



bunch



mod_ssl



lynx



swing



ping_libc



krb5

System



Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

39

Random Systems

Similarity of Clustering Results

100

90

80

70

Percentage









60 IntraEdge Agreement

50

40

30

20

10

0

bip-100-1





bip-100-2





bip-100-5





bip-100-25





bip-100-75





rnd-100-1





rnd-100-2





rnd-100-5





rnd-100-25





rnd-100-75





circle-50





circle-100





circle-150

System





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

40

What we Learned From Studying

the Search Landscape

Not all modules are “equal” - Some modules:

 Are connected to many other modules

 Are connected to few other modules

 Have a large fan-in

 Have a large fan-out

 Are uniformly connected to other system

components

 Are not uniformly connected to other system

components

Some modules may have a more “natural” home than

other subsystems with respect to their assigned cluster

Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

41

What we Learned From Studying

the Search Landscape

Bunch tends to converge to a consistent

solution with respect to MQ

 There is a very low probability of finding one of

these partitions by random selection

 The partitions found by Bunch are a very small

subset of the overall search landscape

The degree of isomorphism in the clustering

results was larger than expected





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

42

What we Learned From Studying

the Search Landscape

When examining the median level of the clustering

hierarchy we observed that all systems tend to

converge to at most 2 levels

 The systems that we studied range from under 100 modules

to several thousand modules

 The number of levels in the clustering hierarchy is bounded

by O(log2N)

 We expect that studying systems with several hundred

thousand modules would produce results where the median

level converges to more than 2 levels.

 We observed this in very sparse graphs (e.g., rnd-100-1, and

bip-100-1)







Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

43

Conclusions (1)

Understanding the search landscape is

important

 A single run of Bunch is helpful, but it does

not highlight modules/classes that tend to

drift between clusters

 Analysis of many Bunch runs helps build a

mental model of the search landscape





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

44

Conclusions (2)

A best practice for program understanding

 Cluster a system many times in order to

understand the search landscape

 Identify and separate omnipresent, library and

supplier modules

 Identify that tend to drift between many

subsystems

 Assign to other clusters manually, or influence the

clustering algorithm by adjusting the edge weights

 Bunch supports manual and semi-automatic clustering

features to help with this type of analysis





Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

45

Questions

Special Thanks To:

 AT&T Research

 Sun Microsystems

 DARPA

 NSF

 US Army



 SEMINAL Group







Drexel University Software Engineering Research Group (SERG)

http://serg.mcs.drexel.edu

46



Related docs
Other docs by panniuniu
Brochure
Views: 4  |  Downloads: 0
Pre-law minor 11
Views: 2  |  Downloads: 0
CASPER COLLEGE
Views: 0  |  Downloads: 0
2011_ICD-9_Handout_for_Webinar
Views: 3  |  Downloads: 0
NATIONAL LUCKY DOG DAYS' PROMOTION
Views: 0  |  Downloads: 0
AboutMeRequirements
Views: 0  |  Downloads: 0
admissions_presentation
Views: 0  |  Downloads: 0
KobeILC_report_20090608
Views: 0  |  Downloads: 0
books
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!