Docstoc

Stochastic Evolution

Document Sample
Stochastic Evolution Powered By Docstoc
					       Parallelization of
       Stochastic Evolution for
       Cell Placement
                         Committee Members:
Khawar S. Khan           Dr. Sadiq M. Sait
Computer Engineering     Dr. Aiman El-Maleh
MS Thesis Presentation   Dr. Mohammed Al-Suwaiyel


                                                    1
Outline
 Problem Focus
 Brief Overview of Related Concepts
 Motivation & Literature Review
 Parallelization Aspects
 Parallel strategies: Design and Implementation
 Experiments and results
 Comparisons
 Contributions
 Conclusion & Future work

                                                   2
Problem Focus

 Real-World combinatorial optimization problems are complex,
  difficult-to-navigate and have multi-modal, vast search-spaces.

 Iterative Heuristics work well for such problems but incur heavy
  runtimes.
                    E.g., Stochastic Evolution




                                                                    3
Problem Focus

 Accelerating Performance – Reducing runtime with
  consistent quality and/or achieve higher quality
  solutions within comparable runtime.
                        HOW?
      Parallel Computing in a Distributed
                     Environment




                                                 4
  VLSI Design Steps

CAD subproblem level       Generic CAD tools

                           Behavioral Modeling and
Behavioral/Architectural   Simulation tool
                           Functional and logic
Register transfer/logic    minimization, logic fitting
                           and simulation tools
Cell/mask                  Tools for partitioning,
                           placement, routing, etc.




                                                         5
Placement Problem

 The problem under investigation is the VLSI standard cell
  placement problem

 Given a collection of cells or modules, the process of placement
  consists of finding suitable physical locations for each cell on the
  entire layout

 Finding locations that optimize given objective functions (Wire-
  length, Power, Area, Delay etc), subject to certain constraints
  imposed by the designer, the implementation process, layout
  strategy or the design style



                                                                     6
Iterative heuristics

 The computational complexity increases as the module count on
   chip increases.
     For example, Alpha 21464 (EV8) "Arana" 64-bit SMT
       microprocessor has a transistor count of 250 million.

 Considering brute force solutions, a combinatorial problem
   having just 250 modules require 250 factorial (3.23e492)
   combinations to be evaluated.

 This is where the iterative heuristics come into the play.



                                                               7
Iterative heuristics

 Iterative heuristics have proven remarkably effective for many
   complex NP-hard optimization problems in industrial
   applications
       Placement
       Partitioning
       Network optimization
       Non-linear systems’ simulation




                                                                   8
Iterative heuristics


 Characteristics
      Conceptually simple
      Robust towards real-life complexities
      Well suited for complex decision-support applications
      May be interrupted virtually at any time
      Storage of generated solutions




                                                               9
Iterative heuristics
 Following are the five dominant algorithms that are
  instance of general iterative non-deterministic algorithms,
      Genetic Algorithm (GA),
      Tabu Search (TS),
      Simulated Annealing (SA),
      Simulated Evolution (SimE), and
      Stochastic Evolution (StocE).

 In our research, Stochastic Evolution is used for VLSI
  standard cell placement.



                                                                10
Stochastic Evolution (StocE)
(Historical Background)


 Stochastic Evolution (StocE) is a powerful general and
  randomized iterative heuristic for solving combinatorial
  optimization problems.

 The first paper describing Stochastic Evolution appeared in
  1989 by Youssef Saab.




                                                                11
Stochastic Evolution
(Characterstics)


 It is stochastic because the decision to accept a move is a
   probabilistic decision
     Good moves are accepted with probability one and bad
      moves may also get accepted with a non-zero probability.

 Hill-climbing property.


 Searches for solutions within the constraints while optimizing the
   objective function.




                                                                  12
Stochastic Evolution
(Algorithm)




 Inputs to the algorithm:
    Initial valid solution,
    Initial Range variable po, and
    Termination parameter R.




                                      13
Stochastic Evolution
(Algorithm)




                       14
Stochastic Evolution
(Application)


 Applied on VLSI placement problem.


 Cost Function
      A multi-objective function that calculates the wire-length,
       power and delay.
      The three objectives are combined and represented as one
       objective through fuzzy calculations.




                                                                     15
Stochastic Evolution
(Cost Functions)


 Wire-length Calculation




                            16
Stochastic Evolution
(Cost Functions)

 Power Calculation




 Delay Calculation




                       17
Stochastic Evolution
(Cost Functions)


 Fuzzy Cost Calculation




        is the membership of solution x in fuzzy set of acceptable solutions.

        for j = p,d,l are the membership values in the fuzzy sets within
        acceptable power length, delay and wire-length respectively.

        is a constant in the range [0,1].




                                                                                18
Motivation & Literature Review

   With the proliferation of parallel computers, powerful
    workstations, and fast communication networks, parallel
    implementations of meta-heuristics appear quite naturally as
    an alternative to modifications in algorithm itself for speedup.

   Parallel implementations do allow solving large problem
    instances and finding improved solutions in lesser times, with
    respect to their sequential counterparts.




                                                                       19
Motivation & Literature Review

 Advantages of Parallelization


      Faster Runtimes,
      Solving Large Problem Sizes,
      Better quality, and
      Cost Effective Technology.




                                      20
Motivation & Literature Review

 Efforts have been made in parallelization of certain heuristics for
   cell placement which produced good results but as far as we
   know no attempts have been made for Stochastic Evolution

       Parallelization of Simulated Annealing
         1. E.H.L. Aarts and J. Korst. Simulated annealing and
            Boltzmann machines. Wiley, 1989.
         2. R. Azencott. Simulated annealing: Parallelization
            techniques. Wiley, 1992.
         3. D.R. Greening. Parallel simulated annealing
            techniques, Physica.


                                                                    21
Motivation & Literature Review
    Parallelization of Simulated Evolution
      1.   Erick Cant-Paz. Markov chain models of parallel genetic
           algorithms. IEEE Transactions On Evolutionary Computation,
           2000.
      2.   Johan Berntsson and Maolin Tang. A convergence model for
           asynchronous parallel genetic algorithms. IEEE, 2003.
      3.   Lucas A. Wilson, Michelle D. Moore, Jason P. Picarazzi, and
           Simon D. San Miquel. Parallel genetic algorithm for search and
           constrained multi-objective optimization, 2004.

    Parallelization of Tabu Search
      1.   S. M. Sait, M. R. Minhas, and J. A. Khan. Performance and low-
           power driven VLSI standard cell placement using tabu search,
           2002.
      2.   Hiroyuki Mori. Application of parallel tabu search to distribution
           network expansion planning with distributed generation, 2003.
      3.   Michael Ng. A parallel tabu search heuristic for clustering data
           sets, 2003.

                                                                                22
Motivation & Literature Review

 Literature survey of parallel Stochastic Evolution reveals the
   absence of any research efforts in this direction.

 This lack of research at a time presented a challenging task of
   implementing any parallelization scheme for Stochastic
   Evolution as well as a vast room for experimentation and
   evaluation.

 Parallel models adopted for other iterative heuristics were also
   studied and analyzed.




                                                                     23
Sequential Flow Analysis

 To proceed with StocE parallelization, the sequential
   implementation of Stochastic Evolution was first analyzed.



 Analysis was carried out using the Linux gprof by profiling the
   sequential results.




                                                                    24
Sequential Flow Analysis
 The result of sequential code analysis




                                           25
Sequential Flow Analysis
                                                          Sequential Code Analysis

                                       100

                                       90
  Percentage of Total Execution Time




                                       80

                                       70                                                         s3330
                                       60                                                         s5378
                                                                                                  s9234
                                       50
                                                                                                  s15850
                                       40                                                         s35932
                                       30                                                         s38417

                                       20

                                       10

                                        0
                                             Cost Calculation                   Other Functions
                                                           Time Intensive Modules



                                                                                                           26
Parallelization Issues
 The three possible domain for any heuristic parallelization are
   as follows
     Low level parallelization
            The operations within an iteration can be parallelized.
       Domain Decomposition
            The search space (problem domain) is divided and
             assigned to different processors.
       Parallel Search
            Multiple concurrent exploration of the solution space using
             search threads with various degrees of synchronization or
             information exchange.




                                                                           27
Parallelization Issues

 Based on profiling results a simple approach would thus be to
  parallelize the cost functions, i.e., Parallel Search strategy.


 In distributed computation environment, communication carries
  a high cost.




                                                                    28
Parallelization Issues
 For Simulated Annealing, low-level parallelization gave poor
   speedups, i.e., maximum speedup of three with eight
   processors.

 StocE also invokes the cost function calculations after each
   swap in PERTURB function which makes it computationally
   intensive like Simulated Annealing.

 Thus the approach may prove well for StocE in shared memory
   architectures but is not at all well suited in distributed
   environment.



                                                                 29
Parallelization Issues


 For StocE parallelization, all the following parallelization
   categories were evaluated while designing the parallel
   strategies
       Low-Level Parallelization,
       Domain Decomposition, and
       Multithreaded or Parallel Search.




                                                                 30
Parallelization Issues

 Thorough analysis of StocE's sequential flow combined with the
   profiling results led to the following conclusions.

 Any strategy designed to parallelize StocE in a distributed
   computing environment should address the following issues:
       Workload division, while keeping the Algorithm's sequential flow
        intact,
       Communication overhead should be minimal,
       Remedy continuously the errors introduced due to parallelization
       Avoid low-level or fine grained parallelization




                                                                           31
Parallel strategies: Design and
Implementation


 Broadly classifying the designed parallel models
    Asynchronous Multiple Markov Chains (AMMC)
    Row Division
         Fixed Pattern Row Division
         Random Row Division




                                                     32
Asynchronous Markov Chain (AMC)

 A randomized local search technique that operates on a state
  space. The search proceeds step by step, by moving from a
  certain configuration (state) Si to its neighbor Sj, with a certain
  probability Prob(Si,Sj) denoted by pij.



 AMC approach is an example of Parallel Search strategy.




                                                                        33
Asynchronous Markov Chain (Working)

 A managing node or server maintains the best cost and placement.

 At periodic intervals, processors query the server and if their current
   placement is better than that of the server, they export their solution to
   it, otherwise they import the server's placement.

 This removes the need for expensive synchronization across all
   processors.

 The managing node can either be involved in sharing computing load
   with its own searching process or can be restricted to serving queries.

 For a very small number of processors, the server may also be involved
   with clients in searching process, but in a scalable design, the server is
   better off servicing queries only.


                                                                                34
Asynchronous
Markov Chain
   (AMC)

 Master Process




                  35
Asynchronous
Markov Chain
   (AMC)

 Slave Process




                 36
Fixed Pattern Row Division


 A randomized search technique that operates the search
  process by work division among the candidate processors.

 Row-Division is an example of Domain Decomposition strategy.




                                                             37
Fixed Pattern Row Division

 More promising when compared to AMC since it ensures the
   reduction in effective load on each working node.

 Fair distribution of rows among processors.

 Each processor is assigned with two sets of rows and is
   responsible for swapping cells among them.

 Set of rows keeps alternating in every iteration.

 Not much communication overhead, since the processors do not
   need to exchange information or synchronize during iterations.



                                                                    38
Fixed Pattern Row Division
       Iteration - i   Iteration - i+1




                                         39
Fixed Pattern Row Division

                             R1
      P1                     R2
                             R3
                             R4
                             R5
                             R6
      P2
                             R7
                             R8
                             R9
                             R10

      P3                     R11
                             R12




             Iteration ‘i’
                                   40
Fixed Pattern Row Division

                              R1
  P1                          R2
                              R3
                              R4
                              R5
  P2                          R6
                              R7
                              R8
                              R9
                              R10
  P3                          R11
                              R12



            Iteration ‘i+1’
                                    41
Fixed Pattern Row Division

              R1
              R4
  P1          R7
              R10

              R2
              R5
  P2
              R8
              R11


              R3
              R6
  P3
              R9
              R12


            Iteration ‘i+1’
                              42
Fixed Pattern
Row Division

Master Process




                 43
Fixed Pattern
Row Division

Slave Process




                44
Random Row Division

 Variation of Fixed Pattern Row Division.

 Instead of having fixed two sets of non-overlapping rows, the
   master processor generates the non-overlapping rows
   randomly.

 Set of rows broadcasted to the slaves in each iteration.

 The apparent advantage of this scheme over the previous is the
   randomness in rows which ensures that none of the rows
   remains with any specific processor throughout the search
   process.



                                                                  45
Random Row
  Division

Master Process




                 46
Random Row
  Division

Slave Process




                47
Experiments and Results

 The parallel implementation was tested on different ranges of
  ISCAS-89 benchmarks circuits.
 These benchmark circuits cover set of circuits with varying
  sizes, in terms of number of gates and paths.
 Cluster Specs:
    Eight node generic workstations,
    Intel x86 3.20 GHz; 512 MB DDR RAM,
    Cisco 3550 switch for cluster interconnect,
    Linux kernel 2.6.9, and
    MPICH ver 1.7 (MPI implementation from Argonne
      laboratories).

                                                                  48
Experiments and Results




                          49
Experiments and Results
• Parallel Asynchronous Markov Chain Strategy




                                                50
Experiments and Results
• Parallel Asynchronous Markov Chain Strategy




                                                51
Discussion

 The AMC approach worked well with Simulated Annealing but
  its results with StocE are very limited

 Reasons:
      For Simulated Annealing:
         The acceptance rate is decided by a predictable varying
          value of temperature.
         The sequential algorithm itself is not intelligent enough to
          focus the search in neighborhoods of best solutions.




                                                                     52
Discussion

    For Stochastic Evolution
         The acceptance probability for solutions is dependent
          on the parameter po, which varies unpredictably.
         The sequential algorithm itself is intelligent enough to
          focus the search in neighborhoods of best solutions
          and thus it does not gain much through collaborative
          search efforts.




                                                                     53
Experiments and Results
 • Fixed Row Division Strategy




                                 54
Experiments and Results
• Fixed Row Division Strategy




                                55
StocE - Fixed Row Division (s38417)




                                      56
StocE - Fixed Row Division (s35932)




                                      57
Discussion

 The direct workload distribution was the primary reason behind
   the favourable speedup trends seen with this strategy.

 Very good speedups achieved for large circuits with large
   number of rows.

 The strategy however does not perform well for smaller circuits,
   as the runtime gains achieved by dividing computation quickly
   saturate due to less number of rows.




                                                                   58
Experiments and Results
• Random Row Division Strategy




                                 59
Experiments and Results
• Random Row Division Strategy




                                 60
Discussion

 The random row-division further improved the speedup when
  compared to Fixed Pattern row-division.


 Speedup is increased in case of random-row division since the
  probability of a cell movement to any location in the solution
  becomes non-zero in the first iteration unlike the case of fixed
  row-division where two iterations were needed to achieve this
  non-zero probability. Thus, reducing the overall runtime.




                                                                     61
Comparisons

 Compared parallel implementations of StocE with SA and SimE.

 Comparison is made
      For fitness values best achieved by StocE,
      Among the best strategies that were implemented for
       individual heuristics, and
      Among the same parallel strategies implemented.

 Results are respective to the parallel environment and problem
  instances.




                                                                   62
Comparisons                                        Contd…




 Comparisons have been made across
      StocE - Random Row Division
      StocE - Fixed Row Division
      Simulated Annealing - AMMC Approach
      Simulated Annealing - Fixed Row Division
      Simulated Evolution - Random Row Division




                                                            63
Comparisons                              Contd…




•   Simulated Annealing - AMC Strategy




                                                  64
    Comparisons                                                  Contd…


•    Simulated Annealing - AMC Strategy
                          Speedup Vs NumberOfProcessors

                 18

                 16

                 14

                 12                                           s1494
       Speedup




                 10                                           s3330
                                                              s5378
                 8                                            s9234
                 6                                            s15850

                 4

                 2

                 0
                      3    4            5             6   7
                               Number Of Processors


                                                                          65
Comparisons                                   Contd…




•   Simulated Annealing – Fixed Row-Division Strategy




                                                        66
 Comparisons                                                              Contd…


• Simulated Annealing – Fixed Row Division Strategy
                            Speedup Vs Number Of Processors

                    9

                    8

                    7

                    6
          Speedup




                    5                                             s5378
                                                                  s9234
                    4                                             s15850
                    3

                    2

                    1

                    0
                        2    3            4             5     6
                                 Number Of Processors



                                                                                   67
 Comparisons                                 Contd…


• Simulated Evolution – Random Row Division Strategy




                                                       68
 Comparisons                                                       Contd…


• Simulated Evolution – Random Row Division Strategy
                           Speedup Vs NumberOfProcessors

                   8

                   7

                   6

                   5                                           s1494
         Speedup




                                                               s3330
                   4
                                                               s5378
                   3                                           s9234

                   2

                   1

                   0
                       2        3                 4        5
                                Number Of Processors



                                                                            69
Comparisons (s15850)   Contd…




                                70
Comparisons (s9234)   Contd…




                               71
Comparisons (s5378)   Contd…




                               72
Comparisons (s3330)   Contd…




                               73
Comparisons (s1494)   Contd…




                               74
Discussion
 StocE - Random Row Division was found to be best among all
  the strategies in terms of run-time reduction and useful
  speedups.

 StocE - Fixed Row Division performs better on increasing the
  circuit size. For smaller circuits, it hits the saturation point much
  earlier.

 SA - AMC outperforms SA - Fixed row division as the circuit size
  increases.

 SimE-Random Row Division failed to perform for multi-objective
  optimization. Though, the same algorithm has been reported to
  work well for single objective optimizations.

                                                                          75
Contributions

1.    Studied and analyzed all the three possible parallelization
      models for StocE.
        Low-Level parallelization of StocE appeared as an
         ineffective approach given the distributed computation
         environment.
        Parallel search model was designed and implemented for
         StocE as AMC approach, found to give very limited
         speedups
        Domain Decomposition: Designed and implemented two
         strategies. Both the strategies are giving excellent
         performance when compared to the parallel
         implementation of other heuristics




                                                                    76
Contributions

2.    StocE-AMC approach reported runtime gains for very few
      processors where as the same parallelization scheme reported
      to work well with Simulated Annealing.
         Results were analyzed and justified.



3.   Row-based division method distributed the workload
     effectively, allowing very good speedups for large circuits with
     large number of rows.




                                                                    77
Contributions

4.   Row-based division however does not perform well for smaller
     circuits, as the runtime gains achieved by dividing
     computation, quickly saturate.

5.   Row-based division was further enhanced by modifying the
     row distribution method which further reduced the run-times.

6.   Run-times achieved were compared against the run-times of
     Simulated Annealing and Simulated Evolution and were found
     the lowest among the three heuristics.




                                                                    78
Conclusion & Future Work

 This research work was focused on
       Improving the run-times of Stochastic Evolution applied for
        VLSI cell placement
       Comparing the results with the minimum run-times achieved
        with other iterative heuristics.

 Applied three different parallelization schemes to Stochastic
   Evolution
     Asynchronous Markov Chains (AMC),
     Fixed Row-Division, and
     Random Row-Division.




                                                                  79
Conclusion & Future Work

 The following class of strategies can be focused as the future
   work
     Variants of the ones reported,
     New strategies, and
     Hybrid models.


 Combining the characteristics of StocE with Tabu Search’s
   memory components may very well lead to even further runtime
   reduction and speedup.




                                                                   80
Thank You




            81
Stochastic Evolution
(Algorithm)




                       82
Stochastic Evolution
(Algorithm)




                       83
ISCAS89 Benchmark Circuits




                             84

				
DOCUMENT INFO