Microarchitectural Techniques to Reduce Interconnect Power in

Document Sample
Microarchitectural Techniques to Reduce Interconnect Power in Powered By Docstoc
					Microarchitectural Wire Management for
 Performance and Power in Partitioned
             Architectures




                                         Rajeev Balasubramonian
                                          Naveen Muralimanohar
                                                  Karthik Ramani
                                  Venkatanand Venkatachalapathy



  Feb 14th 2005   University of Utah                      1
                                             February 14th 2005




Overview/Motivation
 Wire delays are costly for performance and
  power
      Latencies of 30 cycles to reach ends of a
       chip
      50% of dynamic power is in interconnect
       switching (Magen et al. SLIP 04)
 Abundant number of metal layers
                                                      2
                  University of Utah
                                                                                 February 14th 2005




Wire Characteristics
 Wire Resistance and capacitance per unit length
                                                 
                Rwire 
                          (thickness  Barrier )  ( width  2  Barrier )
                          thickness               width
Cwire   0 (2 K horiz              2 vert              )  fringe ( horiz ,  vert )
                           spacing            layerspacing

                               Resistance            Capacitance Bandwidth


  (Width & Spacing)   Delay  (as delay  RC), Bandwidth 
         Width

         Spacing
                                                                                            3
                                     University of Utah
                                            February 14th 2005




Design Space Exploration
 Tuning wire width and spacing
                                       2d


                                        d



                                      Resistance
                                     Resistance
                                       Wires
                                     BCapacitance
                                     Capacitance
                                       Bandwidth
                                       L wires

                                                     4
                University of Utah
                                                       February 14th 2005




Transmission Lines
 Allow extremely low delay
   High implementation complexity and overhead!
       Large width
       Large spacing between wires
       Design of sensing circuit
       Shielding power and ground lines adjacent to each line
 Implemented in test CMOS chips
 Not employed in this study
                                                                 5
                        University of Utah
                                                  February 14th 2005




Design Space Exploration
 Tuning Repeater size and spacing

                 Power Optimal Wires
                 Smaller repeaters
                 Increased spacing




                                          Delay

                                                   Power
         Traditional Wires
         Large repeaters
         Optimum spacing
                                                           6
                     University of Utah
                                                       February 14th 2005




Design Space Exploration



Base case    Bandwidth               Power      Power and B/W
 B wires     Optimized              Optimized     Optimized
              W wires                P wires      PW wires

            Fast, low bandwidth
            L wires



                                                                7
                      University of Utah
                                      February 14th 2005




Outline
 Overview
 Wire Design Space Exploration
 Employing L wires for Performance
 PW wires: The Power Optimizers
 Results
 Conclusions

                                               8
                University of Utah
                                                February 14th 2005




Evaluation Platform

 Centralized front-end
                                        L1 D
     I-Cache & D-Cache                Cache   Cluster

     LSQ
     Branch Predictor
 Clustered back-end


                                                         9
                  University of Utah
                                                           February 14th 2005




  Cache Pipeline

                                                Cache
                                                Access
                                                   Cache
                                                  5c
                                                  Access
             Eff. Address Transfer 10c               5c
Functional      Eff. Address Transfer 10c   L
                                                  L1 D
                                            S L
   Unit                                             L1
                                                 CacheD
                                            Q S
                  8-bit Transfer 5c                Cache
                Data return at 20c            Q
                                        Mem. Dep
                  Data return at 14c Resolution
                                         Partial
                                        5c
                                       Mem. Dep
                                       Resolution
                                           3c


                                                                   10
                         University of Utah
                                                   February 14th 2005




L wires: Accelerating cache access
 Transmit LSB bits of effective address
  through L wires
   Faster memory disambiguation
        Partial comparison of loads and stores in LSQ
        Introduces false dependences ( < 9%)
   Indexing data and tag RAM arrays
        LSB bits can prefetch data out of L1$
 Reduce access latency of loads
                                                           11
                     University of Utah
                                                 February 14th 2005




L wires: Narrow Bit Width Operands
 PowerPC: Data bit-width determines FU
  latency
 Transfer of 10 bit integers on L wires
     Can introduce scheduling difficulties
     A predictor table of saturating counters
         Accuracy of 98%

 Reduction in branch mispredict penalty
                                                         12
                    University of Utah
                                                   February 14th 2005




Power Efficient Wires.
 Idea: steer non-critical data through
  energy efficient PW interconnect




      Base case                        Power and B/W
       B wires                           Optimized
                                         PW wires

                                                           13
                  University of Utah
                                                                          February 14th 2005




PW wires: Power/Bandwidth Efficient
                                                                      Regfile
 Ready Register operands
      Transfer of data at
                                                                     IQ       FU
                                                      Operand is
       instruction dispatch                           ready at cycle 90
      Transfer of input operands                                     Regfile
       to remote register file                     Rename            IQ       FU
      Covered by long dispatch to                    &
       issue latency                               Dispatch
                                                                      Regfile
 Store data
                                                                     IQ       FU
      Could stall commit process            Consumer instruction
                                             Dispatched at cycle
      Delay dependent loads                 100
                                                                      Regfile
                                                                     IQ       FU
                                                                                  14
                              University of Utah
                                      February 14th 2005




Outline
 Overview
 Wire Design Space Exploration
 Employing L wires for Performance
 PW wires: The Power Optimizers
 Results
 Conclusions

                                              15
                University of Utah
                                                     February 14th 2005




Evaluation Methodology
 Simplescalar -3.0
  augmented to simulate a
  dynamically scheduled                      L1 D
                                            Cache   Cluster
  4-cluster model
 Crossbar interconnects
  (L, B and PW wires)
         B wires (2 cycles)
         L wires (1 cycle)
         PW wires (3 cycles)



                                                              16
                       University of Utah
                                                          February 14th 2005




Heterogeneous Interconnects
 Intercluster global Interconnect
   72 B wires (64 data bits and 8 control bits)
          Repeaters sized and spaced for optimum delay
    18 L wires
        Wide wires and large spacing
        Occupies more area
        Low latencies
    144 PW wires
          Poor delay
          High bandwidth
          Low power



                                                                  17
                         University of Utah
                                                                    February 14th 2005




Analytical Model
C = Ca + WsCb + Cc/S

       1       2            3

1    Fringing Capacitance                            RC Model of the wire
2    Capacitance between adjacent metal layers
3    Capacitance between adjacent wires


   Total Power = Short-Circuit Power + Switching Power + Leakage
    Power

                                                                            18
                                University of Utah
                                                         February 14th 2005




Evaluation methodology
 Simplescalar -3.0
                                               D-cache
  augmented to simulate    I-Cache
                                                LSQ         Cluster
  a dynamically
  scheduled 16-cluster
  model                Cross bar

 Ring latencies
    B wires ( 4 cycles) Ring interconnect
    PW wires ( 6 cycles)
    L wires (2 cycles)


                                                                 19
                          University of Utah
                                                  February 14th 2005




IPC improvements: L wires




 L wires improve performance by 4.2% on four cluster
   system and 7.1% on a sixteen cluster system

                                                          20
                    University of Utah
                                                        February 14th 2005



Four Cluster System: ED2
Improvements
Link          Relative IPC        Relative Relative   Relative
              metal               processor ED2       ED2
                                            (10%)     (20%)
              area                energy
                                  (10%)
144 B           1.0     0.95           100    100       100
288 PW          1.0     0.92           97    103.4     100.2
144 PW 36 L     1.5     0.96           97     95.0      92.1
288 B           2.0     0.98           103    96.6      99.2
288 PW,36 L     2.0     0.97           99     94.4      93.2
144 B, 36 L     2.0     0.99           101    93.3      94.5
                                                                21
                      University of Utah
                                                         February 14th 2005




 Sixteen Cluster system: ED2 gains
     Link        IPC                Relative      Relative ED2
                                   Processor         (20%)
                                  Energy (20%)

144 B         1.11              100              100
144 PW, 36 L 1.05               94               105.3

288 B         1.18              105              93.1
144 B, 36 L   1.19              102              88.7
288 B, 36 L   1.22              107              88.7
                                                                 22
                     University of Utah
                                              February 14th 2005




Conclusions
   Exposing the wire design space to the
    architecture
   A case for micro-architectural wire management!
   A low latency low bandwidth network alone helps
    improve performance by up to 7%
   ED2 improvements of about 11% compared to a
    baseline processor with homogeneous
    interconnect
   Entails hardware complexity

                                                      23
                    University of Utah
                                         February 14th 2005




Future work
 3-D wire model for the interconnects
 Design of heterogeneous clusters
 Interconnects for cache coherence and L2$




                                                 24
                University of Utah
                                    February 14th 2005




Questions and Comments?



  Thank you!




                                            25
               University of Utah

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/26/2013
language:English
pages:25