High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic

					           A Classic Asynchronous
              Dynamic Pipeline

    Williams and Horowitz’s PS0 pipeline:
     Structure
     Operation
     Performance




1
   A Classic Approach: PS0 Pipeline
Williams/Horowitz (Stanford U.) [1986-91]:
    successfully used in fabricated chips [Stanford ’87] [HAL ’90s]


          Stage 1            Stage 2             Stage 3
                     ack


Data                                                             Data
 in                                                              out
                                        data

                  Processing        Completion
                    Block            Detector

   Implemented using “dynamic logic”                                    2
PS0 Pipeline Stage

A PS0 stage consists of dynamic gates and a
  completion detector:
                                     ack
                                              Completion
                                              Detector
            PC
                                “keeper”




 data            Pull-down
 inputs          network
                                               data
                                              outputs


                         Processing Block                  3
Dual-Rail Completion Detector
 Combines dual-rail signals
 Indicates when all bits are valid (or reset)
                                C-element:
                                   if all inputs=1, output  1
                                   if all inputs=0, output  0
            bit0      OR           else, maintain output value

                                             Done
            bit1      OR              C


            bitn      OR


 OR together 2 rails per bit
 Merge results using “C-element”
                                                                  4
PS0 Protocol
  PRECHARGE N: when N+1 completes evaluation
     delete data: after next stage has copied it
  EVALUATE N:       when N+1 completes precharging
     accept new data: after next stage is emptied


          indicates “done”          indicates “done”
                  6          3                   4



      N                 N+1 5              N+2

              1                 2                    3
  evaluates           precharges
                       evaluates         evaluates
      Complete Precharge: another
      Evaluate  Evaluate: 3
      Prechargecycle: 6 events events3 events            5
PS0 Performance
                  6
                                         4

                             5

               1              2              3


   Cycle Time =    3 TEVAL  TPRECH  2 TDETECT

          TE VA L    Evaluation Time
          TP RE CH  Precharge Time
          TDE TECT  Completion Detection Time
                                                    6
Summary: PSO Pipelining
Datapaths are latch-free:
    dynamic gates themselves provide implicit latches
      +: chip area savings
      +: extremely low latency

Data items kept separate by control
    stage deletes data: only after next stage has copied it
    stage accepts new data: only if next stage is empty
    distinct data items always separated by “spacers”


Control is extremely simple: each controller = single wire
    completion detector directly controls previous stage
      +: chip area savings
      +: low control overhead
                                                               7
Comparison to a Clocked Pipeline
How would you design the pipeline if you actually had a clock?
1. Replace handshaking with “magic clocking”
    each stage gets its own clock
    successive clocks are slightly skewed
          essentially, clocked simulation of asynchronous handshaking!
   – need multiple clock phases!


    Ck



                            latch
       Ck’

2. Use a single clock, but insert latches between stages
    latches are simple, level-sensitive
    consecutive stages receive complementary clock signals
                                                                          8
Comparison … (contd.)
Cycle Times?




                        9
Drawbacks of PSO Pipelining
1.   Poor throughput:
      long cycle time: 6 events per cycle
      data “tokens” are forced far apart in time

2.   Limited storage capacity:
        max only 50% of stages can hold distinct tokens
        data tokens must be separated by at least one spacer


Our Research Goals: address both issues
        still maintain very low latency




                                                                10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/2/2011
language:English
pages:10