Stochastic Evolution

Document Sample
Stochastic Evolution Powered By Docstoc
					Parallelization of Stochastic Evolution for Cell Placement
Committee Members:

Khawar S. Khan Computer Engineering MS Thesis Presentation

Dr. Sadiq M. Sait Dr. Aiman El-Maleh Dr. Mohammed Al-Suwaiyel

1

Outline
 Problem Focus  Brief Overview of Related Concepts  Motivation & Literature Review  Parallelization Aspects  Parallel strategies: Design and Implementation  Experiments and results  Comparisons  Contributions

 Conclusion & Future work
2

Problem Focus
 Real-World combinatorial optimization problems are complex,

difficult-to-navigate and have multi-modal, vast search-spaces.
 Iterative Heuristics work well for such problems but incur heavy

runtimes.

E.g., Stochastic Evolution

3

Problem Focus
 Accelerating Performance – Reducing runtime with

consistent quality and/or achieve higher quality solutions within comparable runtime. HOW? Parallel Computing in a Distributed Environment

4

VLSI Design Steps
CAD subproblem level Behavioral/Architectural Register transfer/logic Cell/mask Generic CAD tools Behavioral Modeling and Simulation tool Functional and logic minimization, logic fitting and simulation tools Tools for partitioning, placement, routing, etc.

5

Placement Problem
 The problem under investigation is the VLSI standard cell

placement problem
 Given a collection of cells or modules, the process of placement

consists of finding suitable physical locations for each cell on the entire layout
 Finding locations that optimize given objective functions (Wire-

length, Power, Area, Delay etc), subject to certain constraints imposed by the designer, the implementation process, layout strategy or the design style

6

Iterative heuristics
 The computational complexity increases as the module count on

chip increases.  For example, Alpha 21464 (EV8) "Arana" 64-bit SMT microprocessor has a transistor count of 250 million.
 Considering brute force solutions, a combinatorial problem

having just 250 modules require 250 factorial (3.23e492) combinations to be evaluated.
 This is where the iterative heuristics come into the play.

7

Iterative heuristics
 Iterative heuristics have proven remarkably effective for many

complex NP-hard optimization problems in industrial applications



 

Placement Partitioning Network optimization Non-linear systems’ simulation

8

Iterative heuristics

 Characteristics
    

Conceptually simple Robust towards real-life complexities Well suited for complex decision-support applications May be interrupted virtually at any time Storage of generated solutions

9

Iterative heuristics
 Following are the five dominant algorithms that are

instance of general iterative non-deterministic algorithms,
   



Genetic Algorithm (GA), Tabu Search (TS), Simulated Annealing (SA), Simulated Evolution (SimE), and Stochastic Evolution (StocE).

 In our research, Stochastic Evolution is used for VLSI

standard cell placement.

10

Stochastic Evolution (StocE)
(Historical Background)

 Stochastic Evolution (StocE) is a powerful general and

randomized iterative heuristic for solving combinatorial optimization problems.
 The first paper describing Stochastic Evolution appeared in

1989 by Youssef Saab.

11

Stochastic Evolution
(Characterstics)

 It is stochastic because the decision to accept a move is a

probabilistic decision  Good moves are accepted with probability one and bad moves may also get accepted with a non-zero probability.
 Hill-climbing property.
 Searches for solutions within the constraints while optimizing the

objective function.

12

Stochastic Evolution
(Algorithm)

 Inputs to the algorithm:  Initial valid solution,  Initial Range variable po, and  Termination parameter R.

13

Stochastic Evolution
(Algorithm)

14

Stochastic Evolution
(Application)
 Applied on VLSI placement problem.  Cost Function




A multi-objective function that calculates the wire-length, power and delay. The three objectives are combined and represented as one objective through fuzzy calculations.

15

Stochastic Evolution
(Cost Functions)
 Wire-length Calculation

16

Stochastic Evolution
(Cost Functions)
 Power Calculation

 Delay Calculation

17

Stochastic Evolution
(Cost Functions)

 Fuzzy Cost Calculation

is the membership of solution x in fuzzy set of acceptable solutions. for j = p,d,l are the membership values in the fuzzy sets within acceptable power length, delay and wire-length respectively. is a constant in the range [0,1].

18

Motivation & Literature Review


With the proliferation of parallel computers, powerful workstations, and fast communication networks, parallel implementations of meta-heuristics appear quite naturally as an alternative to modifications in algorithm itself for speedup. Parallel implementations do allow solving large problem instances and finding improved solutions in lesser times, with respect to their sequential counterparts.



19

Motivation & Literature Review
 Advantages of Parallelization
 




Faster Runtimes, Solving Large Problem Sizes, Better quality, and Cost Effective Technology.

20

Motivation & Literature Review
 Efforts have been made in parallelization of certain heuristics for

cell placement which produced good results but as far as we know no attempts have been made for Stochastic Evolution


Parallelization of Simulated Annealing 1. E.H.L. Aarts and J. Korst. Simulated annealing and Boltzmann machines. Wiley, 1989. 2. R. Azencott. Simulated annealing: Parallelization techniques. Wiley, 1992. 3. D.R. Greening. Parallel simulated annealing techniques, Physica.

21

Motivation & Literature Review


Parallelization of Simulated Evolution
1.

2.

3.

Erick Cant-Paz. Markov chain models of parallel genetic algorithms. IEEE Transactions On Evolutionary Computation, 2000. Johan Berntsson and Maolin Tang. A convergence model for asynchronous parallel genetic algorithms. IEEE, 2003. Lucas A. Wilson, Michelle D. Moore, Jason P. Picarazzi, and Simon D. San Miquel. Parallel genetic algorithm for search and constrained multi-objective optimization, 2004.



Parallelization of Tabu Search
1.

2.

3.

S. M. Sait, M. R. Minhas, and J. A. Khan. Performance and lowpower driven VLSI standard cell placement using tabu search, 2002. Hiroyuki Mori. Application of parallel tabu search to distribution network expansion planning with distributed generation, 2003. Michael Ng. A parallel tabu search heuristic for clustering data sets, 2003.
22

Motivation & Literature Review
 Literature survey of parallel Stochastic Evolution reveals the

absence of any research efforts in this direction.
 This lack of research at a time presented a challenging task of

implementing any parallelization scheme for Stochastic Evolution as well as a vast room for experimentation and evaluation.
 Parallel models adopted for other iterative heuristics were also

studied and analyzed.

23

Sequential Flow Analysis
 To proceed with StocE parallelization, the sequential

implementation of Stochastic Evolution was first analyzed.

 Analysis was carried out using the Linux gprof by profiling the

sequential results.

24

Sequential Flow Analysis
 The result of sequential code analysis

25

Sequential Flow Analysis
Sequential Code Analysis
100

Percentage of Total Execution Time

90 80 70 60 50 40 30 20 10 0 Cost Calculation Other Functions Time Intensive Modules s3330 s5378 s9234 s15850 s35932 s38417

26

Parallelization Issues
 The three possible domain for any heuristic parallelization are

as follows  Low level parallelization


The operations within an iteration can be parallelized. The search space (problem domain) is divided and assigned to different processors. Multiple concurrent exploration of the solution space using search threads with various degrees of synchronization or information exchange.



Domain Decomposition




Parallel Search


27

Parallelization Issues
 Based on profiling results a simple approach would thus be to

parallelize the cost functions, i.e., Parallel Search strategy.

 In distributed computation environment, communication carries

a high cost.

28

Parallelization Issues
 For Simulated Annealing, low-level parallelization gave poor

speedups, i.e., maximum speedup of three with eight processors.
 StocE also invokes the cost function calculations after each

swap in PERTURB function which makes it computationally intensive like Simulated Annealing.
 Thus the approach may prove well for StocE in shared memory

architectures but is not at all well suited in distributed environment.

29

Parallelization Issues

 For StocE parallelization, all the following parallelization

categories were evaluated while designing the parallel strategies





Low-Level Parallelization, Domain Decomposition, and Multithreaded or Parallel Search.

30

Parallelization Issues
 Thorough analysis of StocE's sequential flow combined with the

profiling results led to the following conclusions.
 Any strategy designed to parallelize StocE in a distributed

computing environment should address the following issues:
   

Workload division, while keeping the Algorithm's sequential flow intact, Communication overhead should be minimal, Remedy continuously the errors introduced due to parallelization Avoid low-level or fine grained parallelization

31

Parallel strategies: Design and Implementation

 Broadly classifying the designed parallel models  Asynchronous Multiple Markov Chains (AMMC)  Row Division



Fixed Pattern Row Division Random Row Division

32

Asynchronous Markov Chain (AMC)
 A randomized local search technique that operates on a state

space. The search proceeds step by step, by moving from a certain configuration (state) Si to its neighbor Sj, with a certain probability Prob(Si,Sj) denoted by pij.

 AMC approach is an example of Parallel Search strategy.

33

Asynchronous Markov Chain (Working)
 A managing node or server maintains the best cost and placement.  At periodic intervals, processors query the server and if their current

placement is better than that of the server, they export their solution to it, otherwise they import the server's placement.
 This removes the need for expensive synchronization across all

processors.
 The managing node can either be involved in sharing computing load

with its own searching process or can be restricted to serving queries.
 For a very small number of processors, the server may also be involved

with clients in searching process, but in a scalable design, the server is better off servicing queries only.

34

Asynchronous Markov Chain (AMC)
Master Process

35

Asynchronous Markov Chain (AMC)
Slave Process

36

Fixed Pattern Row Division

 A randomized search technique that operates the search

process by work division among the candidate processors.
 Row-Division is an example of Domain Decomposition strategy.

37

Fixed Pattern Row Division
 More promising when compared to AMC since it ensures the

reduction in effective load on each working node.
 Fair distribution of rows among processors.  Each processor is assigned with two sets of rows and is

responsible for swapping cells among them.
 Set of rows keeps alternating in every iteration.  Not much communication overhead, since the processors do not

need to exchange information or synchronize during iterations.

38

Fixed Pattern Row Division
Iteration - i Iteration - i+1

39

Fixed Pattern Row Division
R1

P1

R2 R3 R4 R5

P2

R6 R7 R8 R9 R10 R11 R12

P3

Iteration ‘i’
40

Fixed Pattern Row Division
R1

P1

R2 R3 R4 R5

P2

R6 R7 R8 R9 R10

P3

R11 R12

Iteration ‘i+1’
41

Fixed Pattern Row Division
R1 R4

P1

R7
R10 R2

P2

R5 R8 R11 R3

P3

R6 R9 R12

Iteration ‘i+1’
42

Fixed Pattern Row Division
Master Process

43

Fixed Pattern Row Division
Slave Process

44

Random Row Division
 Variation of Fixed Pattern Row Division.  Instead of having fixed two sets of non-overlapping rows, the

master processor generates the non-overlapping rows randomly.
 Set of rows broadcasted to the slaves in each iteration.  The apparent advantage of this scheme over the previous is the

randomness in rows which ensures that none of the rows remains with any specific processor throughout the search process.

45

Random Row Division
Master Process

46

Random Row Division
Slave Process

47

Experiments and Results
 The parallel implementation was tested on different ranges of

ISCAS-89 benchmarks circuits.  These benchmark circuits cover set of circuits with varying sizes, in terms of number of gates and paths.  Cluster Specs:  Eight node generic workstations,  Intel x86 3.20 GHz; 512 MB DDR RAM,  Cisco 3550 switch for cluster interconnect,  Linux kernel 2.6.9, and  MPICH ver 1.7 (MPI implementation from Argonne laboratories).
48

Experiments and Results

49

Experiments and Results
• Parallel Asynchronous Markov Chain Strategy

50

Experiments and Results
• Parallel Asynchronous Markov Chain Strategy

51

Discussion
 The AMC approach worked well with Simulated Annealing but

its results with StocE are very limited
 Reasons:


For Simulated Annealing:  The acceptance rate is decided by a predictable varying value of temperature.  The sequential algorithm itself is not intelligent enough to focus the search in neighborhoods of best solutions.

52

Discussion


For Stochastic Evolution




The acceptance probability for solutions is dependent on the parameter po, which varies unpredictably. The sequential algorithm itself is intelligent enough to focus the search in neighborhoods of best solutions and thus it does not gain much through collaborative search efforts.

53

Experiments and Results
• Fixed Row Division Strategy

54

Experiments and Results
• Fixed Row Division Strategy

55

StocE - Fixed Row Division (s38417)

56

StocE - Fixed Row Division (s35932)

57

Discussion
 The direct workload distribution was the primary reason behind

the favourable speedup trends seen with this strategy.
 Very good speedups achieved for large circuits with large

number of rows.
 The strategy however does not perform well for smaller circuits,

as the runtime gains achieved by dividing computation quickly saturate due to less number of rows.

58

Experiments and Results
• Random Row Division Strategy

59

Experiments and Results
• Random Row Division Strategy

60

Discussion
 The random row-division further improved the speedup when

compared to Fixed Pattern row-division.

 Speedup is increased in case of random-row division since the

probability of a cell movement to any location in the solution becomes non-zero in the first iteration unlike the case of fixed row-division where two iterations were needed to achieve this non-zero probability. Thus, reducing the overall runtime.

61

Comparisons
 Compared parallel implementations of StocE with SA and SimE.  Comparison is made
  

For fitness values best achieved by StocE, Among the best strategies that were implemented for individual heuristics, and Among the same parallel strategies implemented.

 Results are respective to the parallel environment and problem

instances.

62

Comparisons
 Comparisons have been made across
    

Contd…

StocE - Random Row Division StocE - Fixed Row Division Simulated Annealing - AMMC Approach Simulated Annealing - Fixed Row Division Simulated Evolution - Random Row Division

63

Comparisons
• Simulated Annealing - AMC Strategy

Contd…

64

Comparisons
• Simulated Annealing - AMC Strategy
Speedup Vs NumberOfProcessors
18 16 14 12

Contd…

s1494 s3330 s5378 s9234 s15850

Speedup

10 8 6 4 2 0 3 4 5 Number Of Processors 6 7

65

Comparisons
•

Contd…

Simulated Annealing – Fixed Row-Division Strategy

66

Comparisons
Speedup Vs Number Of Processors
9 8 7 6
Speedup

Contd…

• Simulated Annealing – Fixed Row Division Strategy

5 4 3 2 1 0 2 3 4 Number Of Processors 5 6

s5378 s9234 s15850

67

Comparisons

Contd…

• Simulated Evolution – Random Row Division Strategy

68

Comparisons
Speedup Vs NumberOfProcessors
8 7 6 5
Speedup

Contd…

• Simulated Evolution – Random Row Division Strategy

s1494 s3330 s5378 s9234

4 3 2 1 0 2 3 4 5 Number Of Processors

69

Comparisons (s15850)

Contd…

70

Comparisons (s9234)

Contd…

71

Comparisons (s5378)

Contd…

72

Comparisons (s3330)

Contd…

73

Comparisons (s1494)

Contd…

74

Discussion
 StocE - Random Row Division was found to be best among all

the strategies in terms of run-time reduction and useful speedups.
 StocE - Fixed Row Division performs better on increasing the

circuit size. For smaller circuits, it hits the saturation point much earlier.
 SA - AMC outperforms SA - Fixed row division as the circuit size

increases.
 SimE-Random Row Division failed to perform for multi-objective

optimization. Though, the same algorithm has been reported to work well for single objective optimizations.
75

Contributions
1.

Studied and analyzed all the three possible parallelization models for StocE.  Low-Level parallelization of StocE appeared as an ineffective approach given the distributed computation environment.  Parallel search model was designed and implemented for StocE as AMC approach, found to give very limited speedups  Domain Decomposition: Designed and implemented two strategies. Both the strategies are giving excellent performance when compared to the parallel implementation of other heuristics

76

Contributions
2.

StocE-AMC approach reported runtime gains for very few processors where as the same parallelization scheme reported to work well with Simulated Annealing.  Results were analyzed and justified.

3.

Row-based division method distributed the workload effectively, allowing very good speedups for large circuits with large number of rows.

77

Contributions
4.

Row-based division however does not perform well for smaller circuits, as the runtime gains achieved by dividing computation, quickly saturate. Row-based division was further enhanced by modifying the row distribution method which further reduced the run-times. Run-times achieved were compared against the run-times of Simulated Annealing and Simulated Evolution and were found the lowest among the three heuristics.

5.

6.

78

Conclusion & Future Work
 This research work was focused on



Improving the run-times of Stochastic Evolution applied for VLSI cell placement Comparing the results with the minimum run-times achieved with other iterative heuristics.

 Applied three different parallelization schemes to Stochastic

Evolution  Asynchronous Markov Chains (AMC),  Fixed Row-Division, and  Random Row-Division.

79

Conclusion & Future Work
 The following class of strategies can be focused as the future

work  Variants of the ones reported,  New strategies, and  Hybrid models.
 Combining the characteristics of StocE with Tabu Search’s

memory components may very well lead to even further runtime reduction and speedup.

80

Thank You

81

Stochastic Evolution
(Algorithm)

82

Stochastic Evolution
(Algorithm)

83

ISCAS89 Benchmark Circuits

84