Docstoc

Presentación Tesis

Document Sample
Presentación Tesis Powered By Docstoc
					      LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005)




      Microarchitectural Techniques
           to Exploit Repetitive
        Computations and Values

            Carlos Molina Clemente

      Advisors: Antonio González and Jordi Tubella
UPC
                          Outline

     Motivation & Objectives
     Overview of Proposals
         To improve the memory system
         To speed-up the execution of instructions

     Non Redundant Data Cache
     Trace-Level Speculative Multithreaded Arch.
     Conclusions & Future Work


2
                          Outline

     Motivation & Objectives
     Overview of Proposals
         To improve the memory system
         To speed-up the execution of instructions

     Non Redundant Data Cache
     Trace-Level Speculative Multithreaded Arch.
     Conclusions & Future Work


3
                       Motivation
     General by design
         real-world programs
         operating systems
     Often designed in mind to
         future expansion
         code reuse
     Input sets have little variation

     Even with aggressive compilers
               Repetition is relatively common
4
    Types of Repetition

                   Repetition




    Computations                Values




         z = F (x, y)
5
              Repetitive Computations


    100 %
    90 %
    80 %
    70 %
    60 %
    50 %
    40 %
    30 %
    20 %
    10 %
    0%
                           i

                                  t




                                            ty

                                                  n




                                                                                                         r
                                                                        er
              p

                    lu




                                                               cc




                                                                          a




                                                                                             f
                                           p2




                                                         ke




                                                                                                                   n
                                                                        id




                                                                                           x
                                                                         p




                                                                                                        e
                            s




                                                                                im
                                Ar




                                                                                                      Vp
                                                                                          ol
                                                                       es
                                                 Eo
            m




                                                                                        rte




                                                                                                      is
                                                                       zi




                                                                                                                 ea
                         Ap




                                          af
                     p




                                                                      rs
                                                                      gr
                                                              G
                                                       ua
                                         i




                                                                                      Tw
                                                                              Sw
                  Ap




                                                                     G
                                      Bz
         Am




                                                                                                     w
                                                                     M
                                        Cr




                                                                                                                M
                                                                    Pa




                                                                                     Vo
                                                                    M




                                                                                                  up
                                                      Eq




                                                                                                             A_
                                                                                                 W
6   Spec CPU2000, 500 million instructions
    Types of Repetition

                   Repetition




    Computations                Values




         z = F (x, y)
7
                                Repetitive Values

    100 %
    90 %
    80 %
    70 %
    60 %
    50 %
    40 %
    30 %
    20 %
    10 %
    0%
                           i




                                                       n
                                                 ty




                                                                    cc
                    lu




                                                                                                    r
                                  t
              p




                                                                                  a

                                                                                        id
                                                              ke




                                                                                                                            r




                                                                                                                                      n
                                                                                                                f
                                        p2




                                                                            p




                                                                                                  im




                                                                                                                           e
                                                                                                              x
                            s

                                Ar




                                                                                                                         Vp
                                                                                                             ol
                                                                                                 e
                                                      Eo




                                                                                es
            m




                                                                                                                                    ea
                                                                                                                         is
                                                                          zi




                                                                                                           rte
                         Ap




                                               af




                                                                                      gr
                     p




                                                                                               rs
                                                                   G
                                                            ua
                                         i




                                                                                                Sw

                                                                                                         Tw
                                                                         G
                  Ap




                                      Bz




                                                                                                                        w
         Am




                                                                                M
                                             Cr




                                                                                                                                   M
                                                                                      M

                                                                                             Pa




                                                                                                        Vo
                                                           Eq




                                                                                                                     up




                                                                                                                                A_
                                                                                                                    W
8   Spec CPU2000, 500 million instructions, analysis of destination value
                       Objectives

     Exploit Value Repetition of Store Instructions
         Redundant store instructions
         Non redundant data cache

                   To improve the memory system

     Exploit Computation Repetition of all Insts
         Redundant computation buffer (ILR)
         Trace-level reuse (TLR)
         Trace-level speculative multithreaded architecture (TLS)

              To speed-up the execution of instructions
9
           Experimental Framework
      Methodology
          Analysis of benchmarks
          Definition of proposal
          Evaluation of proposal

      Tools
          Atom
          Cacti 3.0
          Simplescalar Tool Set

      Benchmarks
          Spec CPU95
          Spec CPU2000
10
                           Outline

      Motivation & Objectives
      Overview of Proposals
          To improve the memory system
          To speed-up the execution of instructions

      Non Redundant Data Cache
      Trace-Level Speculative Multithreaded Arch.
      Conclusions & Future Work


11
     Techniques to Improve Memory


                      Value Repetition




       Redundant Stores           Non Redundant Cache




12
     Redundant Stores Instructions
                                                                                           Memory
      Do NOT modify memory
                                       STORE (@i , Value Y)                        @i       Value X


      If (Value X==Value Y) then                                     Redundant Store
      Contributions
          Redundant stores
          Analysis of repetition into same storage location
          Redundant stores applied to reduce memory traffic
      Main results
          15%-25% of redundant store instructions
          5%-20% of memory traffic reduction
13          Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99
        Non Redundant Data Cache
                                                  Data Cache

                                       Tag X         1234    FFFF
                                                    Value A Value B

                                       Tag Y         0000    1234
                                                    Value C Value D


      If (Value A==Value D) then                                      Value Repetition
      Contributions
           Analysis of repetition in several storage locations
           Non redundant data cache (NRC)
      Main results
           On average, a value is stored 4 times at any given time
           NRC: -32% area, -13% energy, -25% latency, +5% miss
                    Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03
14     Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03
                           Outline

      Motivation & Objectives
      Overview of Proposals
          To improve the memory system
          To speed-up the execution of instructions

      Non Redundant Data Cache
      Trace-Level Speculative Multithreaded Arch.
      Conclusions & Future Work


15
     Techniques to Speed-up I Execution

                     Computation Repetition




       Data Value Reuse               Data Value Speculation



          Avoid serialization caused by data dependences
          Determine results of instructions without executing them
          Target is to speed-up the execution of programs


16
     Techniques to Speed-up I Execution

                     Computation Repetition




       Data Value Reuse                Data Value Speculation



          NON SPECULATIVE !!!
          Buffers previous inputs and their corresponding outputs
          Only possible if a computation has been done in the past
          Inputs have to be ready at reuse test time
17
     Techniques to Speed-up I Execution

                     Computation Repetition




       Data Value Reuse                Data Value Speculation



          SPECULATIVE !!!
          Predicts values as a function of the past history
          Needs to confirm speculation at a later point
          Solves reuse test but introduces misspeculation penalty
18
     Techniques to Speed-up I Execution

                       Computation Repetition




          Data Value Reuse               Data Value Speculation




 Instruction Level   Trace Level     Instruction Level   Trace Level


                     Applied to a SINGLE instruction
19
     Techniques to Speed-up I Execution

                         Computation Repetition




          Data Value Reuse                Data Value Speculation




 Instruction Level     Trace Level    Instruction Level   Trace Level


                     Applied to a GROUP of instructions
20
     Techniques to Speed-up I Execution

                       Computation Repetition




          Data Value Reuse             Data Value Speculation




 Instruction Level   Trace Level   Instruction Level   Trace Level



21
      Instruction Level Reuse (ILR)

                                    Reuse
                    index           RCB
                                    Table



                                            Decode        OOO
                            Fetch             &         Execution        Commit
                                            Rename




      Contributions
          Performance potential of ILR
          Redundant Computation Buffer (RCB)
      Main results
          Ideal ILR speed-up of 1.5
          RCB speed-up of 1.1 (outperforms previous proposals)
22             Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99
           Trace Level Reuse (TLR)
                         I1
                         I2
                         I3
                         I4                                       TRACE
                         I5
                         I6



      Contributions
          Trace Level Reuse
          Performance potential of TLR
          Initial design issues for integrating TLR

      Main results
          Ideal TLR speed-up of 3.6
          4K-entry table: 25% of reuse, average trace size of 6
23                     González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99
     Trace Level Speculation (TLS)
      Two orthogonal issues
                                                                  Microarchitecture
             Static and Data
            ControlAnalysis
                                                                       Support
              Speculation
                Based on                                              TSMA
                                                                      for Trace
               Techniques
              Profiling Info
                                                                    Speculation
      Contributions
          Trace Level Speculative Multithreaded Architecture
          Compiler analysis to support TSMA
      Main results
          speedup of 1.38 with a 20% of misspeculations
            Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02
                        Molina, González, Tubella “Compiler Analysis for TSMA”, INTERACT’05

24                 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05
            Objectives & Proposals

      To improve the memory system
          Redundant store instructions
          Non redundant data cache



      To speed-up the execution of instructions
          Redundant computation buffer (ILR)
          Trace-level reuse buffer (TLR)
          Trace-level speculative multithreaded architecture (TLS)

25
                           Outline

      Motivation & Objectives
      Overview of Proposals
          To improve the memory system
          To speed-up the execution of instructions

      Non Redundant Data Cache
      Trace-Level Speculative Multithreaded Arch.
      Conclusions & Future Work


26
                           Motivation

      Caches spend close to 50% of total die area
                       L1 Dcache   L1 Icache   L2 Cache   Total Area

          Pentium 4      2%          3%         20 %        25 %

          Mips R20k      23 %        26 %        none       54 %

         Crusoe 5400     10 %        9%         27 %        46 %

           Power 4       2%          1%         50 %        53 %

         Alpha 21364     4%          3%         36 %        43 %




      Caches are responsible of a significant part
       of total power dissipated by a processor

27
              Data Value Repetition

             100
              90
              80
              70
              60
              50
              40
              30
              20
              10
               0
                    0     10     20     30     40     50      60   70   80   90 100
                                              percentage of time



28   Spec CPU2000, 1 billion instructions, 256KB data cache
            Conventional Cache



                Tag X      1234
                          Value A    FFFF
                                    Value B

                Tag Y      0000
                          Value C    1234
                                    Value D



      If (Value A==Value D) then
                        Value Repetition
29
     Non Redundant Data Cache

       Pointer Table            Value Table


             Tag X     1234   0000
                              FFFF
                              1234
             Tag Y     0000   1234
                              FFFF



                 Die Area Reduction

30
     Non Redundant Data Cache

       Pointer Table           Value Table


       Tag X                  0000

                              1234
       Tag Y
                              FFFF



           Additional Hardware: Pointers

31
     Non Redundant Data Cache

       Pointer Table          Value Table


       Tag X                 0000      1

                             1234      2
       Tag Y
                             FFFF      1



          Additional Hardware: Counters

32
                Data Value Inlining

      Some values can be represented with a small
       number of bits (Narrow Values)
      Narrow values can be inlined into pointer area
      Simple sign extension is applied
      Benefits
          enlarges effective capacity of VT
          reduces latency
          reduces power dissipation


33
     Non Redundant Data Cache

       Pointer Table              Value Table


       Tag X           F         0000      1

                                 1234      2
       Tag Y      0
                                FFFF       1



                  Data Value Inlining

34
                    Miss Rate vs Die Area
                                  L2 Cache:   256KB          512KB      1MB          2MB   4MB

                  50%
                  45%
                  40%
     Miss Ratio




                  35%
                  30%
                  25%
                  20%
                                                             |                        |     |
                  15%
                                                        0,1                          0,5   1,0
                    1                                 cm2
                                                       10                                  100
                                          100%
                                          CONV        50%
                                                      VT50       30%
                                                                 VT30         20%
                                                                              VT20
35     Spec CPU2000, 1 billion instructions
                                 Results

      Caches ranging from 256 KB to 4 MB



                                     Energy
                  Diea Area                             Access Time      Number of Misses
                                   Consumption
                  Reduction                              Reduction         Increment
                                    Reduction

                VT50 VT30 VT20   VT50   VT30   VT20   VT50 VT30   VT20   VT50 VT30 VT20


        AMEAN   32% 47% 55%      14% 21% 27%          25% 33%     37%    5%   12% 18%




36
                           Outline

      Motivation & Objectives
      Overview of Proposals
          To improve the memory system
          To speed-up the execution of instructions

      Non Redundant Data Cache
      Trace-Level Speculative Multithreaded Arch.
      Conclusions & Future Work


37
          Trace Level Speculation

      Avoids serialization caused by data dependences
      Skips in a row multiple instructions
      Predicts values based on the past
      Solves live-input test
      Introduces penalties due to misspeculations




38
            Trace Level Speculation
      Two orthogonal issues
          microarchitecture support for trace speculation
          control and data speculation techniques
            – prediction of initial and final points
            – prediction of live output values

      Trace Level Speculative Multithreaded
       Architecture (TSMA)
          does not introduce significant misspeculation penalties

      Compiler Analysis
          based on static analysis that uses profiling data
39
     Trace Level Speculation with
           Live Output Test
                     Live Output Update &
                       Trace Speculation



     ST
                             Instruction Flow



     NST
                                  Miss Trace Speculation Detection
                                        & Recovery Actions

           INSTRUCTION EXECUTION
           INSTRUCTION VALIDATION

40         INSTRUCTION SPECULATION
                 TSMA Block Diagram
                  Branch
                                                  ST I Window
                 Predictor
                                                 NST I Window


       I           Fetch           Decode &      ST Ld/St Queue
                                                                             Functional
     Cache        Engine           Rename                                      Units
                                                NST Ld/St Queue


                                               ST Reorder Buffer
                 Trace
               Speculation                     NST Reorder Buffer




                                              Look Ahead Buffer




                                                                    L1SDC        Data
              NST Arch.       ST Arch.           Verification                   Cache
             Register File   Register File         Engine
                                                                    L1NSDC     L2NSDC
41
                 Compiler Analysis

      Focuses on
          developing effective trace selection schemes for TSMA
          based on static analysis that uses profiling data

      Trace Selection
          Graph Construction (CFG & DDG)

          Graph Analysis




42
                     Graph Analysis

      Two important issues
          initial and final point of a trace
            – maximize trace length & minimize misspeculations

          predictability of live output values
            – prediction accuracy and utilization degree

      Three basic heuristics
          Procedure Trace Heuristic

          Loop Trace Heuristic

          Instruction Chaining Trace Heuristic
43
           Trace Speculation Engine

      Traces are communicated to the hardware
          at program loading time
          filling a special hardware structure (trace table)

      Each entry of the trace table contains
          initial PC
          final PC
          live-output values information
          branch history
          frequency counter

44
             Simulation Parameters
      Base microarchitecture
          out of order machine, 4 instructions per cycle
          I cache: 16KB, D cache: 16KB, L2 shared: 256KB
          bimodal predictor
          64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps

      TSMA additional structures
          each thread: I window, reorder buffer, register file
          speculative data cache: 1KB
          trace table: 128 entries, 4-way set associative
          look ahead buffer: 128 entries
45
          verification engine: up to 8 instructions per cycle
                                       Speedup


     1.45
     1.40
     1.35

     1.30
     1.25
     1.20

     1.15
     1.10

     1.05
     1.00
                   i




                                  n
                            ty




                                               cc
             p




                                                    cf



                                                           a


                                                                 id
                                         ke




                                                                                          r




                                                                                                    n
                                                                            ck



                                                                                     x
                    s




                                                                                         Vp
                                 Eo




                                                         es
            m




                                                                                                  ea
                                                                                  rte
                 Ap



                          af




                                                               gr
                                                    M
                                              G
                                       ua




                                                                          ra
        Am




                                                         M
                        Cr




                                                                                                 M
                                                               M




                                                                                 Vo
                                                                        xt
                                      Eq




                                                                                              A_
                                                                      Si




46   Spec CPU2000, 250 million instructions
                           Misspeculations


     100
      90
      80
      70
      60
      50
      40
      30
      20
      10
       0
                      i




                                     n
                               ty




                                                  cc
                p




                                                       cf


                                                              a


                                                                    id
                                            ke




                                                                                             r




                                                                                                       n
                                                                               ck



                                                                                        x
                       s




                                                                                            Vp
                                    Eo




                                                            es
            m




                                                                                                     ea
                                                                                     rte
                    Ap


                             af




                                                                  gr
                                                       M
                                                 G
                                          ua




                                                                             ra
           Am




                                                            M
                           Cr




                                                                                                    M
                                                                  M




                                                                                    Vo
                                                                           xt
                                         Eq




                                                                                                 A_
                                                                         Si




47   Spec CPU2000, 250 million instructions
                           Outline

      Motivation & Objectives
      Overview of Proposals
          To improve memory system
          To speed-up the execution of instructions

      Non Redundant Data Cache
      Trace-Level Speculative Multithreaded Arch.
      Conclusions & Future Work


48
                      Conclusions
      Repetition is very common in programs
      Can be applied
          to improve the memory system
          to speed-up the execution of instructions

      Investigated several alternatives
          Novel cache organizations
          Instruction level reuse approach
          Trace level reuse concept
          Trace level speculation architecture

49
                  Future Work

      Value repetition in instruction caches
      Profiling to support data value reuse schemes
      Traces starting at different PCs
      Value prediction in TSMA
      Multiple speculations in TSMA
      Multiple threads in TSMA



50
                            Publications
      Value Repetition in Cache Organizations
           Reducing Memory Traffic Via Redundant Store Instructions, HPCN'99

           Non Redundant Data Cache, ISLPED'03

           Value Compression to Reduce Power in Data Caches, EUROPAR'03


      Instruction & Trace Level Reuse
           The Performance Potential of Data Value Reuse, TR-UPC-DAC’98

           Dynamic Removal of Redundant Computations, ICS'99

           Trace Level Reuse, ICPP'99


      Trace Level Speculation
           Trace-Level Speculative Multithreaded Architecture, ICCD'02

           Compiler Analysis for TSMA, INTERACT’05

           Reducing Misspeculation Penalty in TSMA, ISHPC´05
51
      LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005)




      Microarchitectural Techniques
           to Exploit Repetitive
        Computations and Values

             Carlos Molina Clemente

      Advisors: Antonio González and Jordi Tubella
UPC

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:12/3/2011
language:Spanish
pages:52