Docstoc

A Spatial Path Scheduling Algorithm for EDGE Architectures

Document Sample
A Spatial Path Scheduling Algorithm for EDGE Architectures Powered By Docstoc
					 Register Bank Assignment For
Spatially Partitioned Processors

  Behnam Robatmili, Katherine E. Coons,
  Kathryn S. McKinley, and Doug Burger
                               Motivation
 • Spatially partitioned processors
     –   Technology scalable substrate
     –   Challenging compilation target
 • Partitioned register files
     –   Spill code
     –   Operand routing latency
     –   Bank and network link contention
 • Conflicting goals
     –   Reduce communication distances
     –   Avoid contention
     –   Avoid spills
                   Traditionally, spill costs take priority
            Now, spatial locality and contention are important
6/18/2013                                                        LCPC 2008
                         Bank Allocation Example
                                        B0              B1                    B2
  v0        v1      v2       v3                     v2                       v3
                                     v1 v0


                                    2
                                    3        3      1                    1             2


            i0      i1                                   i0                       i1

                                        E0              E1                        E2

                 Variables        Register banks              Network links
                 Instructions     Execution tiles             Flow of data



6/18/2013                                                                     LCPC 2008
                           Outline
 • Motivation
 • Background
     – TRIPS
     – Compiling for TRIPS
     – Baseline Register Allocator

 • Bank Allocation Algorithm
 • Customizing for TRIPS
 • Results
 • Conclusions

6/18/2013                            LCPC 2008
            Register Allocation for EDGE ISAs
 • Block atomic execution
     –   Instruction groups fetch, execute, and commit atomically

 • Direct instruction communication
     –   Explicitly encode dataflow graph by specifying targets

                RISC                                  EDGE
                                                 B0    B1    B2


                    Centralized
                    Centralized
                     Register
                     Register
                       File
                       File

                                                 B0    B1    B2




6/18/2013                                                           LCPC 2008
                        TRIPS Microarchitecture
                                              • TRIPS ISA
                        Register File            –   Up to 128 instructions/block
                                                 –   Instructions can be placed
             G     R0     R1     R2     R3           anywhere

             D0    E0     E1     E2     E3
                                              • TRIPS microarchitecture
                                                 –   Up to 8 blocks in flight
Data Cache




                                                 –   1 cycle latency per hop
             D1    E4     E5     E6     E7
                                              • TRIPS blocks constraints
             D2    E8     E9    E10     E11      –   Max 128 instructions
                                                 –   32 load and store instructions
                                                 –   32 register reads or writes
             D3   E12    E13    E14     E15      –   8 register reads/writes per bank

                  Single cycle
              communication latency


 6/18/2013                                                                      LCPC 2008
              Compiling for TRIPS
                 Control     Dataflow                                 Execution
               Flow Graph     Graph                                   Substrate


                                      read
                                     read
                                       R2
                                      R2
                   B1
                   B1
                                     mul
                                     mul                                   R1
                                                                           R1   R2
                                                                                R2

                                                                           add mul
                                                                           add mul
Source Code                    add
                              add           add
                                           add
Source Code
                                                                      mul add add
                                                                      mul add add
              B2
              B2        B3
                        B3
                                                     read
                                                    read
                                     mul
                                     mul              R1
                                                     R1


                                              add
                                             add

                   B4
                   B4                      write
                                                               Static
                                           write
                                            R1
                                            R1              instruction
                                                            placement


6/18/2013                                                                 LCPC 2008
             TRIPS Compiler Back End

                              If-conversion
                             If-conversion
                              Loop peeling
                             Loop peeling
      TRIPS block
                          While loop unrolling
                        While loop unrolling
       Formation          Instruction merging
                         Instruction merging         Constraints
                       Predicate optimizations
                       Predicate optimizations
                                                     128 instructions
                                                     32 load/store IDs
                                                     32 reg. read/writes
                          Register allocation
                         Register allocation         (4 banks, 8 per bank)
        Resource     Reverse if-conversion & split
                     Reverse if-conversion & split
        Allocation    Load/store ID assignment
                      Load/store ID assignment
                       SSA for constant outputs
                      SSA for constant outputs



                           Fanout insertion              Trips Assembly
        Scheduling      Instruction placement
                       Target form generation
                                                            Language


6/18/2013                                                       LCPC 2008
                Baseline Register Allocator
 • Linear scan register allocator
 • Traverse variables using standard priority function (Chow &
   Hennessy ‘90):



 • For each variable, find all available architectural registers
 • For each candidate architectural register
     –   Check for live range conflicts
     –   Check max reads/writes per block constraint
 • Spill variable if no candidate meets criteria
 • If spill code invalidates blocks, split invalidated blocks and re-
   allocate



6/18/2013                                                           LCPC 2008
                           Outline
 • Motivation
 • Background
     – TRIPS
     – Compiling for TRIPS
     – Baseline Register Allocator

 • Bank Allocation Algorithm
 • Customizing for TRIPS
 • Results
 • Conclusions

6/18/2013                            LCPC 2008
               Register Dependence Graph
 •       First introduced by Hiser et al. (HCSB ‘00)
 •       Nodes represent variables
 •                              nity
         Edge weights indicate affi between variables
 •       Use RDG to optimize the critical path
     –      Use ideal schedule to estimate execution time
     –      Estimate arrival time of instruction inputs
     –      Set edge weights based on differences between arrival times
            to instructions in critical path




6/18/2013                                                        LCPC 2008
             Register Dependence Graph

                            Dataflow Dependence Graph
          Register
        Intermediate
      Dependence Graph
       Representation        vr0
                             vr0              vr1
                                              vr1                   vr2
                                                                    vr2


        mul t0,vr0,vr1
       mul t0,vr0,vr1          1          1         1               1
           vr0
           vr0
                  0   vr1
                      vr1
                                   *
                                   *                         ~
                                                             ~
        not t1,t0
       not t1,t0
                                   4 t0                 t2
            t2,vr2
        not 2
       not t2,vr2     2                                      2
        add t3,vr1,t2
       add t3,vr1,t2
                 vr2
                 vr2
                                       ~
                                       ~                +
                                                        +
        sub t4,t1,t3
       sub t4,t1,t3                       5 t1 t3 3
                              Ideal            --
                            Schedule
                                              t4 6


6/18/2013                                                        LCPC 2008
               Bank Assignment Algorithm
 • Traverse variables in priority order:



 • For every variable
     –   Find cost for placing it in each bank
     –   Choose bank with minimum cost
     –   Allocate variable to a register in that bank
 • Bank cost
     –   Number of variables already allocated to that bank
     –   Weights of edges in the RDG




6/18/2013                                                     LCPC 2008
                    Bank Score Evaluation
 • Evaluation function
     – Bank utilization
     – Dependencies among variables


     CalculateBankCost (vr, bank)
     CalculateBankCost (vr, bank)
         Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR
          Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR



     CalculateDependenceCost (vr, bank)
      CalculateDependenceCost (vr, bank)
          cost ==00
           cost
            for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)
             for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)
                  cost ==cost ++RDG Weight(vr, nvr)
                   cost cost RDG Weight(vr, nvr)
            return cost
             return cost



6/18/2013                                                                        LCPC 2008
                           Outline
 • Motivation
 • Background
     – TRIPS
     – Compiling for TRIPS
     – Baseline Register Allocator

 • Bank Allocation Algorithm
 • Customizing for TRIPS
 • Results
 • Conclusions

6/18/2013                            LCPC 2008
                     Customizing for TRIPS
 • Fewer register/data cache banks than execution tiles
     –   Heavy traffic between registers and execution tiles
     –   Heavy traffic between data cache and execution tiles
 • Cost function should separate data cache traffic

                                                               Register File
      TieBreaker (vr, bank1, bank2)
     TieBreaker (vr, bank1, bank2)                            B0
                                                              B0   B1
                                                                   B1   B2
                                                                        B2   B3
                                                                             B3

          ifif (vr.affectedCriticalLoads +
             (vr.affectedCriticalLoads +
                vr.affectedCriticalStores 0)
              vr.affectedCriticalStores >> 0)




                                                 Data Cache
                  return min(bank1, bank2)
                 return min(bank1, bank2)
             else
            else
                 return max(bank1, bank2)
                return max(bank1, bank2)




6/18/2013                                                                LCPC 2008
                           Outline
 • Motivation
 • Background
     – TRIPS
     – Compiling for TRIPS
     – Baseline Register Allocator

 • Bank Allocation Algorithm
 • Customizing for TRIPS
 • Results
 • Conclusions

6/18/2013                            LCPC 2008
                    Implemented Allocator
 • Bank Oblivious
     –   Always assign the next available register
     –   Fills each bank before switching to the next bank
 • Round Robin
     –   Selects banks in a round robin fashion
 • HCSB
     –   Places dependent variables close together
     –   No ideal schedule
 • Spatial
     –   Uses ideal schedule to reason about critical path
     –   Customized bank assignment algorithm for TRIPS



6/18/2013                                                    LCPC 2008
                          Spill Code Size

                  Benchmark     Bank       Round
     Program                                        HCSB      Spatial
                  suite        oblivious    robin
     a2time       EEMBC          111        111       30        31
     applu        SPEC           528        514      365       382
     apsi         SPEC           328        220      183       183
     equake       SPEC            30        30        10        10
     mgrid        SPEC            44        21        8         12


 • Remaining benchmarks never spill
     –   TRIPS has 128 registers
     –   Register communication converted to intra-block temporaries



6/18/2013                                                        LCPC 2008
                   EEMBC Results
       1.33,1.39




                              Average 5% improvement




6/18/2013                                     LCPC 2008
                   EEMBC Results
       1.33,1.39




                              Average 5% improvement




6/18/2013                                     LCPC 2008
                   EEMBC Results
       1.33,1.39




                              Average 5% improvement




6/18/2013                                     LCPC 2008
                 Sample Spatial Allocations
                                                      fbital
            v0
            v0        v1
                      v1       v2
                               v2                     Spatial
                                                       HCSB

                                                 v0     v1      v2     v1
                                                        v2              0
                 st
                 st        +
                           +
                                                 st      +             st
                                                                       +




                       Separate memory traffic

6/18/2013                                                            LCPC 2008
            SPEC Results
                   1.22,1.22,1.23




                                    Average 5% improvement




6/18/2013                                          LCPC 2008
            SPEC Results
                   1.22,1.22,1.23




                                    Average 5% improvement




6/18/2013                                          LCPC 2008
                              Conclusions
 • Spatial locality among registers matters
 • Register dependence graph can help
     –   Avoids spilling critical registers
     –   Flexible tool to incorporate locality information
 • Modeling the topology is important
     –   Non-uniform distribution of registers/L1 cache banks
     –   Separate different types of traffic
 • EDGE ISA eases burden on register allocator
     –   Spills are rare
     –   Spatial locality and contention become first-order constraints




6/18/2013                                                           LCPC 2008
            Questions?




6/18/2013                LCPC 2008

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:6/19/2013
language:Unknown
pages:27