Joshi

Document Sample
Joshi Powered By Docstoc
					The Return of Synthetic Benchmarks
          Ajay M. Joshi (UT Austin)
     Lieven Eeckhout (Ghent University)
          Lizy K. John (UT Austin)
                January 28, 2008

        Laboratory of Computer Architecture
  Department of Electrical & Computer Engineering
         The University of Texas at Austin
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               2
                       Benchmark Spectrum
                                                                         Complete Application
                                                                                Code
                                                          Application Suites
                                                                  e.g. SPEC CPU

                                             Kernel Codes
                                           e.g. Livermore Loops

                         Synthetic Benchmarks
                           e.g. Dhrystone, Whetstone

             Microbenchmarks
                     e.g. STREAM

 Toy Benchmarks
    e.g. Heap sort



Less Development Effort                                            More Development Effort
More Scalable                                                      Less Scalable
More Maintainable                                                  Less Maintainable
Less Representative                                                More Representative       3
         Focus on Simulation Time Reduction
                                                              • Statistical Sampling




                                    Benchmark Explosion
                                                              [Conte et al., ICCD’96 ] [Wunderlich et al., ISCA’03]
                                                              • Representative Sampling
                                                              [Sherwood et al., ASPLOS’02]
Benchmark Subsetting
                                                              • Reduced Input Set
[Eeckhout et al., PACT’02]                                    [ KleinOsowski, CAN’04]
[Vandierendonck et al., CAECW’04]                             • Statistical Simulation & Synthetic Workloads
[Phansalkar et al., ISPASS’05]                                [Oskin et al., ISCA’00] [ Eeckhout et al., ISPASS’00]
                                                              [Nussbaum et al., PACT’01] [Bell et al., ICS’05]
[Eeckhout et al. IISWC’05]
                                                             Benchmark Run Length



                                   r                      • Analytical Modeling
                               s so                       [Noonburg et al., MICRO’94]
                            oce ity                       [Karkhanis et al., ISCA’04]
                         opr lex                          • Speedup Simulation
                      icr omp
                     M C                                  [Schnarr et al., ASPLOS’98]
                                                          [Loh et al., SIGMETRICS’01]

                                                                                                               4
      Motivation : Benchmarking Challenges
     Using Real-World Applications as Benchmarks
       Proprietary Nature of Real-World Applications

      Single-Point Performance Characterization
      Application Benchmarks are Rigid

      Applications Evolve Faster than Benchmarks
       Benchmark Suites are Costly to Develop, Maintain, and Upgrade

  Studying Commercial Workload Performance
       Early Design Stage Power/Performance Studies

Usefulness of Synthetic Benchmarks Beyond Simulation Time Reduction
                                                                       5
Resurgence of Synthetic Benchmarks…..




                  IEEE Computer, August 2003   6
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               7
   Workload Synthesis: Central Idea
                       Just 40 workload characteristics




                                                                                                       Instruction Level
                                         Program Locality
                       Instruction Mix
    Application




                                                                                                          Parallelism
                                                                              Control Flow
  Behavior Space




                                                                               Behavior
‘Knobs’ for Changing
     Program
   Characteristcs


 Workload Synthesis
     Algorithm
                                           Workload Synthesizer




                                                                 A D D   R 1 ,   R 2 , R 3
                                                                 L D   R 4 ,   R 1 ,   R 6
                                                               M U L   R 3 ,   R 6 ,   R 7
                                                                A D D   R 3 ,   R 2 ,   R 5
                                                               D I V   R 1 0 ,   R 2 ,   R 1
                                                                S U B   R 3 ,   R 5 ,   R 6
                                                            S T O R E   R 3 ,   R 1 0 ,   R 2 0
                                                                 A D D   R 1 ,   R 2 , R 3
                                                                 L D   R 4 ,   R 1 ,   R 6
                                                               M U L   R 3 ,   R 6 ,   R 7

 Synthetic Benchmark                                            A D D
                                                               D I V
                                                                S U B
                                                                        R 3 ,
                                                                       R 1 0 ,
                                                                        R 3 ,
                                                                                R 2 ,
                                                                                 R 2 ,
                                                                                R 5 ,
                                                                                        R 5
                                                                                         R 1
                                                                                        R 1
                                                              B E Q   R 3 ,   R 6 ,   L O O P
                                                                S U B   R 3 ,   R 5 ,   R 6
                                                            S T O R E   R 3 ,   R 1 0 ,   R 2 0
                                                               D I V   R 1 0 ,   R 2 ,   R 1
                                                                         … … … … .




 Compile and Execute

                       Real Hardware or                                                           Execution Driven
                             RTL                                                                     Simulator
                                                                                                                           8
           Modeling Real-World Applications
Microarchitecture-Independent          Modeling Workload Attributes       Experiment
Workload Profiling                     into Synthetic Workload            Environment


 Real World          Workload Profiler
 Proprietary      Binary Instrumentation OR
  Workload                Simulation

                                                                              Real
                                                                            Hardware
                      Workload                                Synthetic
                                               Workload      Benchmark
                      Profile =
                                              Synthesizer       Clone
                      Workload
                                                                            Execution
                      Attributes
                                                                             Driven
                           +
                                                                            Simulator
                     Distribution
                     Of Attribute
                       Values


                                                                                        9
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               10
           Workload Characteristics as ‘Knobs’
Category                Num.   Characteristic
instruction mix         10     percentage of integer short latency
                               percentage of integer long latency
                               percentage of floating-point short latency
                               percentage of floating-point long latency
                               percentage of integer load
                               percentage of integer store
                               percentage of floating-point load
                               percentage of floating-point store
                               percentage of branches



Instruction-level       8      register-dependency-distance – 8 distributions for register
parallelism                    dependencies. Register dependency distance equal to 1 instruction,
                               and the percentage of dependency dependencies that have a distance
                               of up to 2, 4, 6, 8, 16, 32, and greater than 32 instructions.



data locality           1      data footprint
                        10     distribution of local stride values

instruction locality    1      instruction footprint

branch predictability   10     distribution of branch transition rate



                                                                                              11
     Capturing The Essence of Workloads

 Attributes to capture inherent workload behavior
  – Data Locality: Dominant strides of static Load/Store
  – Control Flow Predictability: Branch transition rate

 Modeling Locality & Control Flow Predictability
  – Data Locality of Integer, Scientific, and Embedded
    Workloads effectively modeled using circular streams
  – Replicating transition-rate of static branches
                                                          12
           Modeling Data Access Pattern
• Identify streams of data references
• A Stream?
  – Sequence of memory addresses in an arithmetic progression
  – Elements of arrays A, B, and C form 3 streams

  for( ii = 0; ii < N; ii ++)
        A [ii]          =            B [ii]        +        C [ii]

   200, 204, 208 ..             320, 324, 328 ..       404, 408, 412 ...
   Issuing Sequence : 320, 404, 200, 324, 408, 204 ….


• Streams are interleaved and may contain noise
    4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28 …
                                                                           13
                    Extracting Streams
 Reference pattern of static Load / Store Instructions
  – PC-correlated spatial locality
    - Dependence on address referenced by nearby Ld / St
    - Programs with pointer chasing codes
  – PC-correlated temporal locality
    - Dependence on previous address generated by same Ld / St
    - Programs with multidimensional arrays

 Could static Load / Store instructions be natural
 sources of streams ?

 Profile every static Load / Store instruction
   – Number of different strides with which it accesses data
                                                                 14
Modeling Instruction Level Parallelism
                Dependency Distance
                      ADD R1, R3,R4
                      MUL R5,R3,R2
                      ADD R5,R3,R6
                       LD R4, (R1)
                      SUB R8,R2,R1

         Read After Write Dependency Distance = 3

    Measure Distribution of Dependency Distances
 Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32


                                                         15
      Modeling Control Flow Predictability
 Capture behavior of easy and difficult to predict branches
 Inherent program feature that captures branch behavior
 Transition Rate [ Haungs et al. HPCA’00 ]
   # of Taken-Not Taken transitions / # of times executed
 Branches with low transition-rate (easier to predict)
  TTTTTTTTTN, NNNNNNNNNT
 Branches with high transition-rate (easier to predict)
  TNTNTNTNTN
 Branches with moderate transition-rate (tougher to predict)
                                                               16
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               17
                                      Workload Synthesis (1)
Instruction Mix
Register Dependency Distance
Stride Pattern of Load/Store
Branch Transition Rate
Branch Transition Probabilities
                                                               A

                                                               B

         A                                    1 Big Loop       D

                                                               A

                                                               B
                  BR
       0.8             0.2
                                                               D
             B                    C
                                                               A
      BR
                          BR                                   C
      1.0          1.0

                                                               D
         D
                                                               A

                  BR     0.1                                   B
      0.9
                                                               D

     Workload Profile


                                                                   18
                                      Workload Synthesis (2)
Instruction Mix
                                                           Memory Access Model (Strides)
Register Dependency Distance
Stride Pattern of Load/Store
Branch Transition Rate
Branch Transition Probabilities
                                                                                   A

                                                                                   B

         A                                    1 Big Loop                           D

                                                                                   A

                                                                                   B
                  BR
       0.8             0.2
                                                                                   D
             B                    C
                                                                                   A
      BR
                          BR                                                       C
      1.0          1.0

                                                                                   D
         D
                                                                                   A

                  BR     0.1                                                       B
      0.9
                                                                                   D

     Workload Profile


                                                                                           19
                                      Workload Synthesis (3)
Instruction Mix
                                                                  Memory Access Model (Strides)
Register Dependency Distance
Stride Pattern of Load/Store
Branch Transition Rate
Branch Transition Probabilities
                                                                                          A

                                                                                          B

         A                                          1 Big Loop                            D

                                                                                          A

                                                                                          B
                  BR
       0.8             0.2
                                                                                          D
             B                    C
                                                                                          A
      BR
                          BR                                                              C
      1.0          1.0

                                                                                          D
         D
                                                                                          A

                  BR     0.1              Branching Model – Based on                      B
      0.9                                     Transition Rate
                                                                                          D

     Workload Profile


                                                                                                  20
                                      Workload Synthesis (4)
Instruction Mix
                                                                   Memory Access Model (Strides)
Register Dependency Distance
Stride Pattern of Load/Store
Branch Transition Rate
Branch Transition Probabilities
                                                                                            A

                                                                                            B

         A                                          1 Big Loop                              D

                                                                                            A

                                                                                            B
                  BR
       0.8             0.2
                                                                                            D
             B                    C
                                                                                            A
      BR
                          BR                                                                C
      1.0          1.0

                                                                                            D
         D
                                                                                            A

                  BR     0.1              Branching Model – Based on                        B
      0.9                                     Transition Rate
                                                                                            D

     Workload Profile
                                                        Register Assignment
                                                        C code with asm & volatile constructs      21
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               22
                Evaluation of BenchMaker
 SPEC CPU2000, SPECjbb2005, and DBT2 workloads
 Validated Sim-Alpha Performance Model of Alpha 21264
       Benchmark      Input              SimPoint(s)
                              SPEC CPU2000 Integer
       bzip2          graphic            553
       crafty         ref                774
       eon            rushmeier          403
       gcc            166.i              389
       gzip           graphic            389
       mcf            ref                553
       perlbmk        perfect-ref        5
       twolf          ref                1066
       vortex         lendian1           271
       vpr            route              476
       gcc            expr               8, 24, 47, 51, 56, 73, 87, 99
                               SPEC CPU95 Integer                        23
       gcc            expr               0, 3,5,6,7,8,9,10,12
                                                 Performance Correlation
                                                                          Original Benchmark          Synthetic Benchmark
                         1.8
                         1.6
Instructions-Per-Cycle




                         1.4
                         1.2
                          1
                         0.8
                         0.6
                         0.4
                         0.2
                          0




                                                                                       vortex


                                                                                                vpr
                                        crafty




                                                        gzip
                                bzip2




                                                                               twolf
                                                  gcc




                                                                                                        dbt2




                                                                                                                      SPECjbb2005
                                                               mcf


                                                                     perlbmk




                                                                                                               dbms
                               Trade Accuracy for Flexibility – Average Error of 11%
                                                                                                                                    24
                                       Energy/Power Correlation

                                                                      Original Benchmark           Synthetic Benchmark
                         35
                         30
Energy-Per-Instruction




                         25
                         20
                         15
                         10

                          5
                          0




                                                                                    vortex


                                                                                             vpr
                                      crafty




                                                     gzip
                              bzip2




                                                                            twolf
                                               gcc




                                                                                                     dbt2




                                                                                                                   SPECjbb2005
                                                            mcf


                                                                  perlbmk




                                                                                                            dbms
                                                     Average Error of 13%
                                                                                                                                 25
                   Outline
 The Need for Synthetic Benchmarks
 BenchMaker Framework for Benchmark Synthesis
 Workload Characteristics Used in Synthesis
 Synthetic Benchmark Construction
 Evaluation of BenchMaker
 Applications
 Summary
                                               26
Altering Individual Program Characteristics

                           1.4

                           1.2
  Instructions-Per-Cycle




                            1

                           0.8

                           0.6

                           0.4

                           0.2

                            0
                                 0   10     20   30    40   50   60   66   70    80      90   100
                                          Percentage of References with Stride Value 0




                                                                                                    27
Interaction of Program Characteristics
                                       Data Footprint - 600K                Data Footprint - 300K
                                       Data Footprint - 900K
                       0.35

                        0.3
L1 D-cache Miss-Rate




                       0.25

                        0.2

                       0.15

                        0.1

                       0.05

                         0
                              0   10     20    30    40    50   60      66    70     80   90   100
                                       Percentage of references w ith Stride Value 0



                                                                                                     28
              Modeling Impact of Benchmark Drift
                                   Increase in Code Footprint (hypothetical)


                         1.2
Instructions-Per-Cycle




                          1

                         0.8

                         0.6

                         0.4

                         0.2

                          0
                               1      2        3        4       5        6         7   8
                                          Factor by which code size is increased



     Increase in Data Footprint from SPEC CPU95 to SPEC CPU2000
                    for gcc (Model with 7% accuracy)                                       29
                   Summary
 Synthetic Benchmarks to Address Benchmarking
  Challenges

 Constructing Synthetic Benchmarks from
  Hardware-Independent Characteristics

 Applications of Synthetic Benchmarks
  - Altering Program Characteristics
  - Studying Interaction of Program Characteristics
  - Modeling Benchmark Drift

                                                30
      Questions?
Ajay’s email: ajoshi@ece.utexas.edu




                                      31

				
DOCUMENT INFO