Prediction Router by liaoqinmei

VIEWS: 2 PAGES: 41

									 Prediction Router:
  Yet another low-latency
on-chip router architecture

 Hiroki Matsutani   (Keio Univ., Japan)
 Michihiro Koibuchi        (NII, Japan)
 Hideharu Amano     (Keio Univ., Japan)
 Tsutomu Yoshinaga        (UEC, Japan)
       Why low-latency router is
               needed?
• Tile architecture
   – Many cores (e.g., processors & caches)
   – On-chip interconnection network [Dally, DAC’01]

     Core      Router
                                     router   router   router



                                     router   router   router



                                     router   router   router



                                       Packet switched network
  16-core tile architecture
On-chip router affects the performance and cost of the chip
          Why low-latency router is
                  needed?
   System           Topology          Routing      Switching Flow ctrl
  MIT RAW         2D mesh (32bit)     XY DOR        WH, no VC       Credit
 UPMC SPIN        Fat Tree (32bit)   Up*/down*      WH, no VC       Credit
QuickSilver ACM   H-Tree (32bit)     Up*/down*     1-flit, no VC    Credit
   UMass             2D mesh         Shortest-     Pipelined CS,   Timeslot
Amherst aSOC                           path            no VC
    Sun T1           Crossbar         Number of cores increases
                                         -           -      Handshake
                     (128bit)          (e.g., 64-core or more?)
 Cell BE EIB       Ring (128bit)     Shortest- Pipelined CS,    Credit
                                            Number of hops increases
                                       path        no VC
    TRIPS            2D mesh         Their communication latency
                                      YX DOR     1-flit, no VC On/off
   (operand)         (109bit)             is a crucial problem
TRIPS (on-chip)      2D mesh          YX DOR        WH, 4 VCs       Credit
                     (128bit)
  Intel SCC       2D torus (32bit)   XY,YX DOR,     WH, no VC      Stall/go
                                     odd-even TM
Low-latency router architecture has been extensively studied
Outline: Prediction router for low-latency NoC
• Existing low-latency routers
   – Speculative router
   – Look-ahead router
   – Bypassing router
• Prediction router
   – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations
   –   Hit rate, gate count, and energy consumption
   –   Case study 1: 2-D mesh (small core size)
   –   Case study 2: 2-D mesh (large core size)
   –   Case study 3: Fat tree network
Wormhole router:                          Hardware structure
                                2) arbitration for the
         1) selecting an       selected output channel
Input ports
         output channel                                       Output ports

                                          ARBITER
      X+               FIFO                                       X+
                                   GRANT

      X-               FIFO                                       X-


      Y+               FIFO                                       Y+

                                          3) sending the packet
      Y-               FIFO               to the output channel Y-
                                             5x5
                                         CROSSBAR
   CORE                FIFO                                     CORE



Routing, arbitration, & switch traversal are performed in a pipeline manner
 Pipeline structure: 3-cycle parallel
Speculative router: VA/SA inrouter
                                                                [Peh,HPCA’01]
 • At least 3-cycle for traversing a router
    – RC (Routing computation)
    – VSA (Virtual channel & switch allocations)
    – ST (Switch traversal)             VA & SA are speculatively
                                                    performed in parallel

 • A packet transfer from router (a) to router (c)
           @Router A       @Router B       @Router C
  HEAD RC VSA ST           RC VSA ST       RC VSA ST
  DATA 1               SA ST           SA ST           SA ST
  DATA 2                   SA   ST         SA   ST         SA   ST
  DATA 3                        SA   ST         SA    ST        SA    ST

           1    2      3   4    5      6   7    8      9   10    11   12
At perform RC and VSA in parallel, look-ahead routing is used
To least 12-cycle for transferring aTIME [CYCLE]
                          ELAPSED packet from router (a) to router (c)
Look-ahead router:RC/VA                                      in parallel
• At least 3-cycle for traversing a router
  – NRC (Next routing computation)
  – VSA (Virtual channel & switch allocations)
  – ST (Switch traversal)                 VSA can be performed
                                                     w/o waiting for NRC
     Routing computation for the next hop
      Output port of router (i+1) is selected by router i

         @Router A       @Router B       @Router C
HEAD NRC VSA ST NRC VSA ST NRC VSA ST
DATA 1               SA ST           SA ST           SA ST
DATA 2                   SA   ST         SA   ST         SA    ST
DATA 3                        SA   ST         SA    ST         SA   ST

         1    2      3   4    5      6   7     8     9   10    11   12
                         ELAPSED TIME [CYCLE]
Look-ahead router:RC/VA                                  in parallel
• At least 2-cycle for traversing a router
    – NRC + VSA (Next routing computation / arbitrations)
    – ST        (Switch traversal)

         No dependency between NRC & VSA  NRC & VSA in parallel
                                                        [Dally’s book,
                                                        2004]
         @Router A @Router B @Router C

  HEAD
           NRC
               ST
                    NRC
                        ST
                                NRC
                                    ST                Typical example of
           VSA      VSA         VSA                   2-cycle router
  DATA 1
  DATA 2
  DATA 3

           1    2     3     4    5     6     7    8     9
                          into a single [CYCLE]
Packing NRC,VSA,ST ELAPSED TIMEstage  frequency harmed
 At least 9-cycle for transferring a packet from router (a) to router (c)
Bypassing router:                        skip some stages
• Bypassing between intermediate nodes
  – E.g., Express VCs   [Kumar, ISCA’07]
                                            Virtual bypassing paths


       SRC      Bypassed                   Bypassed     DST
      3-cycle   3-cycle
                 1-cycle       3-cycle      1-cycle
                                            3-cycle    3-cycle
  Bypassing router:                        skip some stages
 • Bypassing between intermediate nodes
    – E.g., Express VCs   [Kumar, ISCA’07]
                                                Virtual bypassing paths


         SRC      Bypassed                    Bypassed      DST
        3-cycle   3-cycle
                   1-cycle       3-cycle       1-cycle
                                               3-cycle     3-cycle

 • Pipeline bypassing utilizing the regularity of DOR
    – E.g., Mad postman       [Izu, PDP’94]
 • Pipeline stages on frequently used are skipped
    – E.g., Dynamic fast path     [Park, HOTI’07]

 • Pipeline stages on user-specific paths are skipped
   – E.g., Preferred path [Michelogiannakis, NOCS’07]
   – E.g., DBP            [Koibuchi, NOCS’08]
We propose a low-latency router based on multiple predictors
Outline: Prediction router for low-latency NoC
• Existing low-latency routers
   – Speculative router
   – Look-ahead router
   – Bypassing router
• Prediction router
   – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations
   –   Hit rate, gate count, and energy consumption
   –   Case study 1: 2-D mesh (small core size)
   –   Case study 2: 2-D mesh (large core size)
   –   Case study 3: Fat tree network
Prediction router                          for 1-cycle transfer
                                                           [Yoshinaga,IWIA’06]
 • Each input channel has predictors                       [Yoshinaga,IWIA’07]
 • When an input channel is idle,
   – Predict an output port to be used (RC pre-execution)
   – Arbitration to use the predicted port(SA pre-
     execution)
 RC & VSA are skipped if prediction hits  1-cycle transfer
           @Router A       @Router B       @Router C
  HEAD RC VSA ST           RC VSA ST       RC VSA ST
  DATA 1                   ST              ST              ST
  DATA 2                        ST              ST              ST
  DATA 3                             ST              ST              ST

           1    2      3   4    5      6    7   8      9   10   11   12

                       cycle transfer if 70%
E.g, we can expect 1.6ELAPSED TIME [CYCLE] of predictions hit
Prediction router                         for 1-cycle transfer
                                                          [Yoshinaga,IWIA’06]
 • Each input channel has predictors                      [Yoshinaga,IWIA’07]
 • When an input channel is idle,
   – Predict an output port to be used (RC pre-execution)
   – Arbitration to use the predicted port(SA pre-
     execution)
 RC & VSA are skipped if prediction hits  1-cycle transfer
               MISS       @Router B       @Router C
  HEAD RC VSA ST          RC VSA ST       RC VSA ST
  DATA 1                  ST              ST              ST
  DATA 2                       ST              ST              ST
  DATA 3                            ST              ST              ST

           1    2     3   4    5      6    7   8      9   10   11   12

                       cycle transfer if 70%
E.g, we can expect 1.6ELAPSED TIME [CYCLE] of predictions hit
Prediction router                         for 1-cycle transfer
                                                         [Yoshinaga,IWIA’06]
 • Each input channel has predictors                     [Yoshinaga,IWIA’07]
 • When an input channel is idle,
   – Predict an output port to be used (RC pre-execution)
   – Arbitration to use the predicted port(SA pre-
     execution)
 RC & VSA are skipped if prediction hits  1-cycle transfer
               MISS       HIT   @Router C
  HEAD RC VSA ST          ST    RC VSA ST
  DATA 1                  ST    ST             ST
  DATA 2                        ST   ST             ST
  DATA 3                             ST   ST             ST

           1    2     3    4    5    6     7   8    9    10   11   12

                       cycle transfer if 70%
E.g, we can expect 1.6ELAPSED TIME [CYCLE] of predictions hit
Prediction router                        for 1-cycle transfer
                                                       [Yoshinaga,IWIA’06]
 • Each input channel has predictors                   [Yoshinaga,IWIA’07]
 • When an input channel is idle,
   – Predict an output port to be used (RC pre-execution)
   – Arbitration to use the predicted port(SA pre-
     execution)
 RC & VSA are skipped if prediction hits  1-cycle transfer
               MISS       HIT HIT
  HEAD RC VSA ST          ST   ST
  DATA 1                  ST   ST ST
  DATA 2                       ST   ST   ST
  DATA 3                            ST   ST   ST

           1    2     3    4   5    6     7   8    9   10   11   12

                       cycle transfer if 70%
E.g, we can expect 1.6ELAPSED TIME [CYCLE] of predictions hit
Prediction router: Prediction algorithms
                                                [Yoshinaga,IWIA’06]
• Efficient predictor is key
          1. Random                             [Yoshinaga,IWIA’07]

            2. Static Straight
       Single predictor isn’t enough (SS)
                An output channel on the same dimension
         for applications with different traffic patterns
             is selected (exploiting the regularity of DOR)
• Prediction router
            3. Custom
   – Multiple predictors for each
               User
     input channel can specify which output channel is
               accelerated
         Predictors
            4. Latest Port (LP)
      A             C
             B Previously used output channel is selected
            5. Finite Context Method (FCM) [Burtscher, TC’02]
               The most frequently appeared pattern of
   – Select onenof them insequence (n = 0,1,2,…)
                 -context
     response to a given network Match (SPM) [Jacquet, TIT’02]
            6. Sampled Pattern
     environment
               Pattern matching using a record table
Basic operation                     @ Correct prediction
 Idle state: Output port X+ is selected and reserved
 1st cycle: Incoming flit is transferred to X+ without RC and VSA
 1st cycle:   RC is performed  The prediction is correct!
 2nd cycle:   Next flit is transferred to X+ without RC and VSA


                       Predictors
                   A      B    C         ARBITER
                                                              Correct
      X+                FIFO                                    X+
                                     Crossbar is reserved
     X-                                                         X-
     Y+                                                         Y+
     Y-                                                         Y-
   CORE                                  5x5 XBAR               CORE

1-cycle transfer using the reserved crossbar-port when prediction hits
    Basic operation                       @ Miss prediction
   Idle state: Output port X+ is selected and reserved
   1st cycle: Incoming flit is transferred to X+ without RC and VSA
   1st cycle:  RC is performed  The prediction is wrong! (X- is correct)
                                   Kill signal to X+ is asserted
   2nd/3rd cycle: Dead flit is removed; retransmission to the correct port
                                                         KILL
                          Predictors
                      A      B    C          ARBITER

        X+                 FIFO                 Dead flit             X+
                                                                   Correct
        X-                                                           X-
       Y+                                                            Y+
                                       More energy for
       Y-                              retransmission                Y-
     CORE                                    5x5 XBAR                CORE

Even with miss prediction, a flit is transferred in 3-cycle as original router
Outline: Prediction router for low-latency NoC
• Existing low-latency routers
   – Speculative router
   – Look-ahead router
   – Bypassing router
• Prediction router
   – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations
   –   Hit rate, gate count, and energy consumption
   –   Case study 1: 2-D mesh (small core size)
   –   Case study 2: 2-D mesh (large core size)
   –   Case study 3: Fat tree network
   Prediction hit rate analysis
• Formulas to calculate the prediction hit rates on
  – 2-D torus (Random, LP, SS, FCM, and SPM)
  – 2-D mesh (Random, LP, SS, FCM, and SPM)
  – Fat tree (Random and LRU)

  – To forecast which prediction algorithm is suited for a
    given network environment w/o simulations


• Accuracy of the analytical model is confirmed
  through simulations
    Derivation of the formulas is omitted in this talk
      (See “Section 4” of our paper for more detail)
Outline: Prediction router for low-latency NoC
• Existing low-latency routers
   – Speculative router
   – Look-ahead router
   – Bypassing router
• Prediction router
   – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations
   –   Hit rate, gate count, and energy consumption
   –   Case study 1: 2-D mesh (small core size)
   –   Case study 2: 2-D mesh (large core size)
   –   Case study 3: Fat tree network
                     Evaluation items
  How many cycles ?
                               FIFO                      Astro (place & route)
                     hit
                                                         NC-Verilog
                               FIFO         XBAR             (simulation)
                                                            SAIF       SDF
    miss hit       hit    Design compiler(synthesis)
Flit-level net simulation   Fujitsu 65nm library            Power compiler
Hit rate / Comm. latency      Area (gate count)         Energy cons. [pJ / bit]

Table 1: Router & network parameters               Table 2: Process library
Packet length          4-flit (1-flit: 64          CMOS process        65nm
                       bit)                        Core voltage        1.20V
Switching technique wormhole                       Temperature         25C
Channel buffer size    4-flit / VC
Number of VCs          1 or 2VCs                   Table 3: CAD tools used
Cycle / hop (miss)     3 stage                     Design compiler 2006.06
                                                   Astro 2007.03
Cycle / hop (hit)
*Topology and traffic 1 stage
                       are mentioned later
   3 case studies                     of prediction router
  How many cycles ?
                              FIFO                      Astro (place & route)
                    hit
                                                        NC-Verilog
                              FIFO     XBAR                  (simulation)
                                                            SAIF          SDF
    miss hit       hit    Design compiler(synthesis)
Flit-level net simulation   Fujitsu 65nm library            Power compiler
Hit rate / Comm. latency     Area (gate count)         Energy cons. [pJ / bit]

        2-D mesh network                         Fat tree network
                                 • The most popular network topology
                                     MIT’s RAW         [Taylor,ISCA’04]
                                     Intel’s 80-core [Vangal,ISSCC’07]
                                 • Dimension-order routing (XY routing)
                                  Here, we show the results of case
                                 studies 1 and 2 together
         Case study 1 & 2                          Case study 3
Case study 1:                                    Zero-load comm.latency
• Original router
                                                     Uniform random traffic on
• Pred router (SS)
                                                           4x4 to 16x16 meshes
• Pred router (100% hit)
(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction
                                         48.2% reduced
                                         for 16x16 cores
                         35.8% reduced
Comm. latency [cycles]




                         for 8x8 cores                            Simulation results

                                                                 (analytical model also
                                                                 shows the same result)




More latency reduced k(48% 2-mesh) as network size increases
         Network size ( -ary for k=16)
                          Case study 2:              Hit rate @ 8x8 mesh
                          • SS: go straight Efficient for long straight comm.
                          • LP: the last one
                          • FCM: frequently used pattern
Prediction hit rate [%]




                            7 NAS parallel benchmark programs   4 synthesized traffics
                          Case study 2:              Hit rate @ 8x8 mesh
                          • SS: go straight Efficient for long straight comm.
                          • LP: the last one Efficient for short repeated comm.
                          • FCM: frequently used pattern
Prediction hit rate [%]




                            7 NAS parallel benchmark programs   4 synthesized traffics
                          Case study 2:                 Hit rate @ 8x8 mesh
                          • SS: go straight Efficient for long straight comm.
                          • LP: the last one Efficient for short repeated comm.
                          • FCM: frequently used pattern All arounder !
Prediction hit rate [%]




                          • Existing bypassing routers use
                            – Only a static or a single bypassing policy
                                     However, effective bypassing policy depends on
                                     traffic patterns…

                          • Prediction router supports
                            – Multiple predictors which can be switched in a cycle
                            – To accelerate a wider range of applications
                             7 NAS parallel benchmark programs       4 synthesized traffics
       Case study 2:                 Area & Energy
• Area (gate count)                • Energy consumption
   – Original router                 Light-weight
                                   (small overhead)
   – Pred router (SS + LP)
   – Pred router
     (SS+LP+FCM)
                                     FCM is all-arounder,
                                    but requires counters



                                    Verilog-HDL designs


     Router area [kilo gates]       Synthesized with 65nm library
6.4 - 15.9% increased, depending
on type and number of predictors
         Case study 2:                    Area & Energy
• Area (gate count)                     • Energy consumption
      – Original router                   – Original router
      – Pred router (SS + LP)             – Pred router (70% hit)
      – Pred router                       – Pred router (100% hit)
        (SS+LP+FCM)

      This estimation is pessimistic.
 1.     More energy consumed in links
         Effect of router energy
        overhead is reduced
 2. Application will be finished
    early  More energy saved
     Router area [kilo gates]            Flit switching energy [pJ / bit]
                              Miss prediction consumes power;
6.4 - 15.9% increased, depending
                              9.5% increased if hit rate is 70%
on type and number of predictors
Latency 35.8%-48.2% saved w/ reasonable area/energy overheads
   3 case studies                     of prediction router
  How many cycles ?
                              FIFO                      Astro (place & route)
                    hit
                                                        NC-Verilog
                              FIFO     XBAR                 (simulation)
                                                           SAIF       SDF
    miss hit       hit    Design compiler(synthesis)
Flit-level net simulation   Fujitsu 65nm library           Power compiler
Hit rate / Comm. latency     Area (gate count)         Energy cons. [pJ / bit]

        2-D mesh network                         Fat tree network




         Case study 1 & 2                          Case study 3
    Case study 3:                Fat tree network



             Down
  Up




1. LRU algorithm
LRU output port is selected
  for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
    Case study 3:                                   Fat tree network
                                                  • Comm. latency @uniform
                                                    – Original router
                                                    – Pred router (LRU)
             Down                                   – Pred router (LRU + LP)
  Up



                         Comm. latency [cycles]

1. LRU algorithm
LRU output port is selected
  for upward transfer
2. LRU + LP algorithm                                Network size (# of cores)
Plus, LP for downward transfer
Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)
    Summary              of the prediction router
•   Prediction router for low-latency NoCs
    –   Multiple predictors, which can be switched in a cycle
    –   Architecture and six prediction algorithms
    –   Analytical model of prediction hit rates
•   Evaluations of prediction router
    –   Case study 1 : 2-D mesh (small core size)
    –                             Area core size)
        Case study 2 : 2-D mesh (large overhead: 6.4% (SS+LP)
    –   Case study 3 : Fat tree network overhead: 9.5% (worst)
                                  Energy
            From three case studies Latency reduction: up to 48%

•   Results                              (from Case studies 1 & 2)
    1. Prediction router can be applied to various NoCs
    2. Communication latency reduced with small overheads
    3. Prediction router with multiple predictors can
       accelerate a wider range of applications
                  Thank you
      for your attention

It would be very helpful if you would speak slowly. Thank you in advance.
Prediction router:                       New modifications
• Predictors for each input channel
• Kill mechanism to remove dead flits
• Two-level arbiter
  – “Reservation”  higher priority
  – “Tentative reservation” by the pre-execution of VSA
                                              KILL signals
                  Predictors
              A      B    C              ARBITER

    X+             FIFO                                       X+
                               Currently, the critical path
    X-                          is related to the arbiter     X-
    Y+                                                        Y+
    Y-                                                        Y-
  CORE                                   5x5 XBAR             CORE
Prediction router:                      Predictor selection
• Static scheme                       • Dynamic scheme
  – A predictor is selected             – A predictor is
    by user per application               adaptively selected
                                              Predictors
          Predictors

      A       B        C                    A         B         C

                                       Count up if each predictor hits
     Configuration table
                                          Predictor A     100
 Application 1 Predictor B
                                          Predictor B     80
 Application 2 Predictor A
                                          Predictor C     120
 Application 3 Predictor C
                                        A predictor is selected every
 …                …                     n cycles (e.g., n =10,000)
 Simple      Pre-analysis is needed        Flexible       More energy
Case study 1:                                 Router critical path
• RC: Routing comp.    ST can be occurred in these
                       stages of prediction router
• VSA: Arbitration
• ST: Switch traversal        6.2% critical path delay
                                                      increased compared with
                                                      original router
       Stage delay [FO4s]




                            Original router    Pred router (SS)
                          Case study 2:                Hit rate @ 8x8 mesh
                          •   SS: go straight Efficient for long straight comm.
                          •   LP: the last one Efficient for short repeated comm.
                          •   FCM: frequently used pattern All arounder !
                          •   Custom: user-specific path Efficient for simple comm.
Prediction hit rate [%]




                              7 NAS parallel benchmark programs   4 synthesized traffics
   Case study 4:                  Spidergon network
• Spidergon topology              • Hit rate @ Uniform
  – Ring + across links
             [Coppola,ISSOC’04]




  – Each router has 3-port
  – Mesh-like 2-D layout
  – Across first routing
   Case study 4:                                        Spidergon network
• Spidergon topology                                    • Hit rate @ Uniform
  – Ring + across links                                   – SS: Go straight
             [Coppola,ISSOC’04]                           – LP: Last used one
                                                          – FCM: Frequently used one




                              Prediction hit rate [%]
                                                           Hit rates of SS & FCM are
                                                           almost the same

  – Each router has 3-port
  – Mesh-like 2-D layout          Network size (# of cores)
  – hit rate is achieved
HighAcross first routing (80% for 64core; 94% for 256core)
   4 case studies                     of prediction router
  How many cycles ?
                              FIFO                      Astro (place & route)
                      hit
                                                        NC-Verilog
                              FIFO     XBAR                 (simulation)
                                                           SAIF       SDF
    miss hit       hit    Design compiler(synthesis)
Flit-level net simulation   Fujitsu 65nm library           Power compiler
Hit rate / Comm. latency     Area (gate count)         Energy cons. [pJ / bit]

  2-D mesh network           Fat tree network            Spidergon network




   Case study 1 & 2             Case study 3                Case study 4

								
To top