pldi97

Document Sample
pldi97 Powered By Docstoc
					      Dynamic Feedback:
    An Effective Technique
   for Adaptive Computing

     Pedro Diniz and Martin Rinard

     Department of Computer Science
  University of California, Santa Barbara
http://www.cs.ucsb.edu/~{pedro,martin}
            Basic Issue:
Efficient Implementation of Atomic
Operations in Object-Based Languages

           Approach:
    Reduce Lock Overhead by
   Coarsening Lock Granularity

            Problem:
   Coarsening Lock Granularity
          May Reduce
     Available Concurrency
         Solution: Dynamic Feedback

• Multiple Lock Coarsening Policies

• Dynamic Feedback
   • Generate Multiple Versions of Code
   • Measure Dynamic Overhead of Each Policy
   • Dynamically Select Best Version


• Context
   • Parallelizing Compiler
      • Irregular Object-Based Programs
      • Pointer-Based Data Structures
   • Commutativity Analysis
                    Talk Outline

• Lock Coarsening

• Dynamic Feedback

• Experimental Results

• Related Work

• Conclusions
              Model of Computation

                              Atomic
• Parallel Programs          Operations
                                                                Serial
                                                                Phase
   • Serial Phases
   • Parallel Phases                                           Parallel
                                                                Phase

                                                                Serial
                                                                Phase

•Atomic Operations on Shared Objects
   •Mutual Exclusion Locks
   •Acquire Constructs                    L.acquire()
   •Release Constructs                                  Mutual Exclusion
                                                            Region
                                          L.release()
Problem: Lock Overhead


        L.acquire()


        L.release()


        L.acquire()


        L.release()
              Solution: Lock Coarsening

     Original                         After Lock Coarsening

     L.acquire()                               L.acquire()



     L.release()


     L.acquire()
                                               L.release()

     L.release()

Reference: Diniz and Rinard
        “Synchronization Transformations for Parallel Computing”, POPL97
          Lock Coarsening Trade-Off

• Advantage:
   • Reduces Number of Executed Acquires and Releases
   • Reduces Acquire and Release Overhead


• Disadvantage: May Introduce False Exclusion
   • Multiple Processors Attempt to Acquire Same Lock
   • Processor Holding the Lock is Executing Code that was
     Originally in No Mutual Exclusion Region
                       False Exclusion

       Original                  After Lock Coarsening

L.acquire()                     L.acquire()


L.release()   L.acquire()                     L.acquire()
                                                   •
                                                   •
                                                              False
L.acquire()   L.release()                          •
                                L.release()                 Exclusion

L.release()
                                              L.release()
         Lock Coarsening Policy

                      Goal:
   Limit Potential Severity of False Exclusion

                  Mechanism:
        Multiple Lock Coarsening Policies

• Original:     Never Coarsen Granularity
• Bounded:      Coarsen Granularity Only Within
                Cycle-Free Subgraphs of ICFG
• Aggressive:   Always Coarsen Granularity
              Choosing Best Policy

• Best Lock Coarsening Policy May Depend On
   • Topology of Data Structures
   • Dynamic Schedule Of Computation


• Information Required to Choose Best Policy
  Unavailable at Compile Time

• Complications
   • Different Phases May Have Different Best Policy
   • In Same Phase, Best Policy May Change Over Time
                 Solution: Dynamic Feedback

• Generated Code Executes
     • Sampling Phases: Measure Performance of Different Policies
     • Production Phases : Use Best Policy From Sampling Phase


• Periodically Resample to Discover Best Policy Changes
Code
               Original Bounded Aggressive      Aggressive      Original
Version
    Overhead




                                                                           Time


                    Sampling Phase           Production Phase      Sampling Phase
        Guaranteed Performance Bounds

• Assumptions:
   • Overhead Changes Bounded by Exponential Decay Functions
• Worst Case Scenario:
   •   No Useful Work During Sampling Phase
   •   Sampled Overheads Are Same For All Versions
   •   Overhead of Selected Version Increases at Maximum Rate
   •   Overhead of Other Versions Decreases at Maximum Rate
             Overhead




                        V0




                                                 Time
                             S   S   S   P
              Guaranteed Performance Bound

 Definition 1. Policy pi is at Most  Worse Than Policy p
                                                         j
   over a Time Interval T if
                                                                           T
        Work - Work Š T                  where          Work = • - oi(t)) dt
               T            T                                      T
               i            j                                     (1
                                                                   i
                                                                       0

Definition 2. Dynamic Feedback is at Most  Worse
  Than the Optimal if
                                                                           P+SN
                            Š (P+SN)  where Work
       P+SN            P
Work   opt    - Work   0
                                                            P+SN
                                                            opt    =   •          (1 - o1(t)) dt
                                                                       1


Result 1. To Guarantee this Bound
                           (1 - ) P + (1/) e(-P) Š (- 1) SN + (1/)
       Guaranteed Performance Bounds


                                                             (1 - ) P + (1/) e(-P)

             Constraint Values                                      (- 1) SN + (1/)
                                               Feasible
                                               Region




                                         Production Interval P

Production Interval Too Short:                            Production Interval Too Long:
Unable to Amortize Sampling                               May Execute Suboptimal Policy
         Overhead                                                for Long Time

                                          Basic Constraint:
                                 Decay Rate () Must be Small Enough
   Dynamic Feedback: Implementation

• Code Generation

• Measuring Policy Overhead

• Interval Selection

• Interval Expiration

• Policy Switch
                   Code Generation

• Statically Generate Different Code Versions for
  Each Policy
   • Alternative: Dynamic Code Generation

• Advantages of Static Code Generation:
   • Simplicity of Implementation
   • Fast Policy Switching

• Potential Drawback of Static Code Generation
   • Code Size (In Practice Not a Problem)
           Measuring Policy Overhead

 • Sources of Overhead
    • Locking Overhead
    • Waiting Overhead


 • Compute Locking Overhead
    • Count Number of Executed Acquire/Release Constructs


 • Estimate Waiting Overhead
    • Count Number of Spins on Locks Waiting to be Released
                         Number                         Number of     Acquire/Release
                     (   of Spins x Spin Time   ) + (Acquire/Releasex Execution Time )
Sampled Overhead =
                                         Sampling Time
      Interval Selection and Expiration

• Fixed Interval Values
   • Sampling Interval: 10 milliseconds
   • Production Interval: 10 seconds
   • Good Results for Wide Range of Interval Values


• Polling Code for Expiration Detection
   • Location: Back Edges of Parallel Loop
   • Advantage: Low Overhead
   • Disadvantage: Potential Interaction with
     Iteration Size                                    Atomic
                                             Polling   Operations
                                             Points
                     Policy Switch

• Synchronous
   • Processors Poll Timer to Detect Interval Expiration
   • Barrier At End of Each Interval


• Advantages:
   • Consistent Transitions
   • Clean Overhead Measurements
• Disadvantages:
   • Need to Synchronize All Processors
   • Potential Idle Time At Barrier
               Experimental Results

• Parallelizing Compiler Based on Commutativity
  Analysis [PLDI’96]

• Set of Complete Scientific Applications
   • Barnes-Hut N-Body Solver (1500 lines of C++)
   • Liquid Water Simulation Code (1850 lines of C++)
   • Seismic Modeling String Code (2050 lines of C++)


• Different Lock Coarsening Policies

• Dynamic Feedback

• Performance on Stanford DASH Multiprocessor
                                             Code Sizes


                             60                                            60                                              60
Size Text Segment (Kbytes)




                                                                                              Size Text Segment (Kbytes)
                                              Size Text Segment (Kbytes)
                                                                                   Dynamic
                                                                                                                                    Dynamic
                                                                                   Original
                                                                                                                                    Original
                             40                                            40                                              40
                                  Dynamic                                          Serial                                           Serial
                                  Original
                                  Serial
                             20                                            20                                              20



                              0                                             0                                               0
   Barnes-Hut                                                              Water                                           String
                                              Lock Overhead

                     Percentage of Time that the Single Processor Execution
                        Spends Acquiring and Releasing Mutual Exclusion
                                               Locks
                            60                                            60                                           60




                                                                                            Percentage Lock Overhead
                                               Percentage Lock Overhead
 Percentage Lock Overhead




                            40                                            40                                           40
                                 Original

                            20   Bounded                                  20   Original                                20

                                                                               Bounded
                                 Aggressive                                    Aggressive                                   Original
                             0                                             0                                            0   Aggressive
  Barnes-Hut                                                  Water                                         String
(16K Particles)                                          (512 Molecules)                               (Big Well Model)
                                           Contention Overhead
                          Percentage of Time that Processors Spend Waiting to
                                 Acquire Locks Held by Other Processors
Contention Percentage




                        100                    100                     100
                         75                     75                      75                      Aggressive
                         50                     50                      50                      Bounded
                                                                                                Original
                         25                     25                      25
                          0                      0                       0
                              0   4 8 12 16          0   4 8 12 16           0   4 8 12 16
                                  Processors             Processors              Processors

                           Barnes-Hut                     Water                       String
                         (16K Particles)             (512 Molecules)             (Big Well Model)
Performance Results: Barnes-Hut


             16          Ideal
                         Aggressive
             12          Dynamic
                         Feedback
   Speedup

                         Bounded
             8           Original


             4

             0
                  0       4         8   12     16
                        Number of Processors

                      Barnes-Hut on DASH
                         (16K Particles)
Performance Results: Water

          16       Ideal
                   Bounded
                   Dynamic
          12       Feedback
Speedup

                   Original
          8        Aggressive


          4

          0
               0     4          8   12    16
                   Number of Processors

                    Water on DASH
                    (512 Molecules)
Performance Results: String


           16       Ideal
                    Original
           12       Dynamic
                    Feedback
 Speedup

                    Aggressive
           8

           4

           0
                0     4          8   12    16
                    Number of Processors
                    String on DASH
                    (Big Well Model)
                    Summary

• Code Size Is Not An Issue

• Lock Coarsening Has Significant Performance Impact

• Best Lock Coarsening Policy Varies With Application

• Dynamic Feedback Delivers Code With Performance
  Comparable to The Best Static Lock Coarsening Policy
                      Related Work

• Adaptive Execution Techniques (Saavedra Park:PACT96)

• Dynamic Dispatch Optimizations (Hölzle Ungar:PLDI94)

• Dynamic Code Generation (Engler:PLDI96)

• Profiling (Brewer:PPoPP95)

• Synchronization Optimizations (Plevyak et al:POPL95)
                      Conclusions

• Dynamic Feedback
   • Generated Code Adapts to Different Execution Environments


• Integration with Parallelizing Compiler
   • Irregular Object-Based Programs
   • Pointer-Based Linked Data Structures
   • Commutativity Analysis


• Evaluation with Three Complete Applications
   • Performance Comparable to Best Hand-Tuned Optimization
BACKUP SLIDES
Performance Results : Barnes-Hut

               16           Ideal
               14           Aggressive
               12           Bounded

               10           Original
     Speedup

               8
               6
               4
               2
               0
                    0   2     4 6 8 10 12 14          16
                               Number of Processors

                        Barnes-Hut (16K Particles)
Performance Results: Water


           16           Ideal
           14           Bounded
           12           Original
           10           Aggressive
 Speedup


           8
           6
           4
           2
           0
                0   2      4 6 8 10 12 14         16
                           Number of Processors

                        Water (512 Molecules)
Performance Results: String

            16           Ideal
            14           Original
            12
                         Aggressive
            10
  Speedup

            8
            6
            4
            2
            0
                 0   2      4       6   8   10 12 14   16
                           Number of Processors


                         String (Big Well Model)
           Policy Switch




Policy 1                   Timer
                           Expires




                           Timer
Policy 2                   Expires
                  Motivation

                   Challenges:
• Match Best Implementation to Environment
• Heterogeneous and Mobile Systems

                      Goal:
• Develop Mechanisms to Support Code that Adapts
  to Environment Characteristics

                     Technique:
• Dynamic Feedback
                         Overhead for Barnes-Hut

                   0.5


                   0.4
Sampled Overhead


                                                                      Original
                   0.3


                   0.2                                                Bounded

                   0.1


                    0                                                 Aggressive
                         0        5       10      15     20      25
                                      Execution Time (Seconds)

                             Barnes-Hut on DASH (8 Processors)
                                      FORCES Loop
                                  Data Set - 16K Particles
                              Overhead for Water

                   0.5


                   0.4
Sampled Overhead


                   0.3


                   0.2
                                                                Original
                   0.1
                                                                Bounded

                    0
                         0      10     20     30    40     50   60

                                     Execution Time (Seconds)

                             Water on DASH (8 Processors)
                                     INTERF Loop
                               Data Set - 512 Molecules
                             Overhead for Water

                    1

                                                              Aggressive
                   0.8
Sampled Overhead


                   0.6


                   0.4


                   0.2
                                                              Original
                    0
                         0    10    20     30    40     50    60
                                   Execution Time (Seconds)


                             Water on DASH (8 Processors)
                                    POTENG Loop
                               Data Set - 512 Molecules
                              Overhead for String

                    1
                                                                 Aggressive
                   0.8
Sampled Overhead


                   0.6


                   0.4


                   0.2
                                                                 Original
                    0
                         0    100       200    300     400     500

                                    Execution Time (Seconds)

                             String on DASH (8 Processors)
                                    PROJFWD Loop
                                   Data Set -Big Well
                               Dynamic Feedback

Code
Version       Aggressive Bounded Original    Aggressive
   Overhead




                                                                         Time


                    Sampling Phase          Production Phase   Sampling Phase

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:1/2/2012
language:
pages:41