; Microarchitectural Techniques to Reduce Interconnect Power in
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Microarchitectural Techniques to Reduce Interconnect Power in

VIEWS: 0 PAGES: 25

  • pg 1
									Microarchitectural Wire Management for
 Performance and Power in Partitioned
             Architectures




                                         Rajeev Balasubramonian
                                          Naveen Muralimanohar
                                                  Karthik Ramani
                                  Venkatanand Venkatachalapathy



  Feb 14th 2005   University of Utah                      1
                                             February 14th 2005




Overview/Motivation
 Wire delays are costly for performance and
  power
      Latencies of 30 cycles to reach ends of a
       chip
      50% of dynamic power is in interconnect
       switching (Magen et al. SLIP 04)
 Abundant number of metal layers
                                                      2
                  University of Utah
                                                                                 February 14th 2005




Wire Characteristics
 Wire Resistance and capacitance per unit length
                                                 
                Rwire 
                          (thickness  Barrier )  ( width  2  Barrier )
                          thickness               width
Cwire   0 (2 K horiz              2 vert              )  fringe ( horiz ,  vert )
                           spacing            layerspacing

                               Resistance            Capacitance Bandwidth


  (Width & Spacing)   Delay  (as delay  RC), Bandwidth 
         Width

         Spacing
                                                                                            3
                                     University of Utah
                                            February 14th 2005




Design Space Exploration
 Tuning wire width and spacing
                                       2d


                                        d



                                      Resistance
                                     Resistance
                                       Wires
                                     BCapacitance
                                     Capacitance
                                       Bandwidth
                                       L wires

                                                     4
                University of Utah
                                                       February 14th 2005




Transmission Lines
 Allow extremely low delay
   High implementation complexity and overhead!
       Large width
       Large spacing between wires
       Design of sensing circuit
       Shielding power and ground lines adjacent to each line
 Implemented in test CMOS chips
 Not employed in this study
                                                                 5
                        University of Utah
                                                  February 14th 2005




Design Space Exploration
 Tuning Repeater size and spacing

                 Power Optimal Wires
                 Smaller repeaters
                 Increased spacing




                                          Delay

                                                   Power
         Traditional Wires
         Large repeaters
         Optimum spacing
                                                           6
                     University of Utah
                                                       February 14th 2005




Design Space Exploration



Base case    Bandwidth               Power      Power and B/W
 B wires     Optimized              Optimized     Optimized
              W wires                P wires      PW wires

            Fast, low bandwidth
            L wires



                                                                7
                      University of Utah
                                      February 14th 2005




Outline
 Overview
 Wire Design Space Exploration
 Employing L wires for Performance
 PW wires: The Power Optimizers
 Results
 Conclusions

                                               8
                University of Utah
                                                February 14th 2005




Evaluation Platform

 Centralized front-end
                                        L1 D
     I-Cache & D-Cache                Cache   Cluster

     LSQ
     Branch Predictor
 Clustered back-end


                                                         9
                  University of Utah
                                                           February 14th 2005




  Cache Pipeline

                                                Cache
                                                Access
                                                   Cache
                                                  5c
                                                  Access
             Eff. Address Transfer 10c               5c
Functional      Eff. Address Transfer 10c   L
                                                  L1 D
                                            S L
   Unit                                             L1
                                                 CacheD
                                            Q S
                  8-bit Transfer 5c                Cache
                Data return at 20c            Q
                                        Mem. Dep
                  Data return at 14c Resolution
                                         Partial
                                        5c
                                       Mem. Dep
                                       Resolution
                                           3c


                                                                   10
                         University of Utah
                                                   February 14th 2005




L wires: Accelerating cache access
 Transmit LSB bits of effective address
  through L wires
   Faster memory disambiguation
        Partial comparison of loads and stores in LSQ
        Introduces false dependences ( < 9%)
   Indexing data and tag RAM arrays
        LSB bits can prefetch data out of L1$
 Reduce access latency of loads
                                                           11
                     University of Utah
                                                 February 14th 2005




L wires: Narrow Bit Width Operands
 PowerPC: Data bit-width determines FU
  latency
 Transfer of 10 bit integers on L wires
     Can introduce scheduling difficulties
     A predictor table of saturating counters
         Accuracy of 98%

 Reduction in branch mispredict penalty
                                                         12
                    University of Utah
                                                   February 14th 2005




Power Efficient Wires.
 Idea: steer non-critical data through
  energy efficient PW interconnect




      Base case                        Power and B/W
       B wires                           Optimized
                                         PW wires

                                                           13
                  University of Utah
                                                                          February 14th 2005




PW wires: Power/Bandwidth Efficient
                                                                      Regfile
 Ready Register operands
      Transfer of data at
                                                                     IQ       FU
                                                      Operand is
       instruction dispatch                           ready at cycle 90
      Transfer of input operands                                     Regfile
       to remote register file                     Rename            IQ       FU
      Covered by long dispatch to                    &
       issue latency                               Dispatch
                                                                      Regfile
 Store data
                                                                     IQ       FU
      Could stall commit process            Consumer instruction
                                             Dispatched at cycle
      Delay dependent loads                 100
                                                                      Regfile
                                                                     IQ       FU
                                                                                  14
                              University of Utah
                                      February 14th 2005




Outline
 Overview
 Wire Design Space Exploration
 Employing L wires for Performance
 PW wires: The Power Optimizers
 Results
 Conclusions

                                              15
                University of Utah
                                                     February 14th 2005




Evaluation Methodology
 Simplescalar -3.0
  augmented to simulate a
  dynamically scheduled                      L1 D
                                            Cache   Cluster
  4-cluster model
 Crossbar interconnects
  (L, B and PW wires)
         B wires (2 cycles)
         L wires (1 cycle)
         PW wires (3 cycles)



                                                              16
                       University of Utah
                                                          February 14th 2005




Heterogeneous Interconnects
 Intercluster global Interconnect
   72 B wires (64 data bits and 8 control bits)
          Repeaters sized and spaced for optimum delay
    18 L wires
        Wide wires and large spacing
        Occupies more area
        Low latencies
    144 PW wires
          Poor delay
          High bandwidth
          Low power



                                                                  17
                         University of Utah
                                                                    February 14th 2005




Analytical Model
C = Ca + WsCb + Cc/S

       1       2            3

1    Fringing Capacitance                            RC Model of the wire
2    Capacitance between adjacent metal layers
3    Capacitance between adjacent wires


   Total Power = Short-Circuit Power + Switching Power + Leakage
    Power

                                                                            18
                                University of Utah
                                                         February 14th 2005




Evaluation methodology
 Simplescalar -3.0
                                               D-cache
  augmented to simulate    I-Cache
                                                LSQ         Cluster
  a dynamically
  scheduled 16-cluster
  model                Cross bar

 Ring latencies
    B wires ( 4 cycles) Ring interconnect
    PW wires ( 6 cycles)
    L wires (2 cycles)


                                                                 19
                          University of Utah
                                                  February 14th 2005




IPC improvements: L wires




 L wires improve performance by 4.2% on four cluster
   system and 7.1% on a sixteen cluster system

                                                          20
                    University of Utah
                                                        February 14th 2005



Four Cluster System: ED2
Improvements
Link          Relative IPC        Relative Relative   Relative
              metal               processor ED2       ED2
                                            (10%)     (20%)
              area                energy
                                  (10%)
144 B           1.0     0.95           100    100       100
288 PW          1.0     0.92           97    103.4     100.2
144 PW 36 L     1.5     0.96           97     95.0      92.1
288 B           2.0     0.98           103    96.6      99.2
288 PW,36 L     2.0     0.97           99     94.4      93.2
144 B, 36 L     2.0     0.99           101    93.3      94.5
                                                                21
                      University of Utah
                                                         February 14th 2005




 Sixteen Cluster system: ED2 gains
     Link        IPC                Relative      Relative ED2
                                   Processor         (20%)
                                  Energy (20%)

144 B         1.11              100              100
144 PW, 36 L 1.05               94               105.3

288 B         1.18              105              93.1
144 B, 36 L   1.19              102              88.7
288 B, 36 L   1.22              107              88.7
                                                                 22
                     University of Utah
                                              February 14th 2005




Conclusions
   Exposing the wire design space to the
    architecture
   A case for micro-architectural wire management!
   A low latency low bandwidth network alone helps
    improve performance by up to 7%
   ED2 improvements of about 11% compared to a
    baseline processor with homogeneous
    interconnect
   Entails hardware complexity

                                                      23
                    University of Utah
                                         February 14th 2005




Future work
 3-D wire model for the interconnects
 Design of heterogeneous clusters
 Interconnects for cache coherence and L2$




                                                 24
                University of Utah
                                    February 14th 2005




Questions and Comments?



  Thank you!




                                            25
               University of Utah

								
To top