Docstoc

Sequential Logic

Document Sample
Sequential Logic Powered By Docstoc
					                                 Pipelining

                       Between 411 problems
                       sets, I haven’t had a            Now that’s what I
                       minute to do laundry             call dirty laundry




                                                                             Read Chapter 4.5-4.6


Comp 411 – Fall 2009                       11/16/2009                                      L16 – Pipelining 1
         Forget 411… Let’s Solve a “Relevant Problem”


                          INPUT:
                                                    Device: Washer
                       dirty laundry
                                                    Function: Fill, Agitate, Spin
                                                    WasherPD = 30 mins

                         OUTPUT:
                       4 more weeks                 Device: Dryer
                                                    Function: Heat, Spin
                                                    DryerPD = 60 mins




Comp 411 – Fall 2009                   11/16/2009                            L16 – Pipelining 2
                          One Load at a Time
          Everyone knows that the real
          reason that UNC students put                   Step 1:
          off doing laundry so long is *not*
          because they procrastinate, are
          lazy, or even have better things to
          do.

                                                         Step 2:
          The fact is, doing laundry one load
          at a time is not smart.

          (Sorry Mom, but you were wrong
          about this one!)
                                                             Total = WasherPD + DryerPD
                                                                      90
                                                                 = _________ mins


Comp 411 – Fall 2009                            11/16/2009                          L16 – Pipelining 3
                       Doing N Loads of Laundry
          Here’s how they do laundry at                     Step 1:
          Duke, the “combinational” way.

          (Actually, this is just an urban                  Step 2:
          legend. No one at Duke actually
          does laundry. The butler’s all
          arrive on Wednesday morning, pick                 Step 3:
          up the dirty laundry and return it
          all pressed and starched by
                                                            Step 4:
                                                             …
          dinner)



                                                     Total = N*(WasherPD + DryerPD)
                                                                  N*90
                                                            = ____________ mins


Comp 411 – Fall 2009                           11/16/2009                     L16 – Pipelining 4
                       Doing N Loads… the UNC way
          UNC students “pipeline”                               Step 1:
          the laundry process.
                                                                Step 2:

          That’s why we wait!
                                                                Step 3:

             Actually, it’s more like N*60 +
                                                                 …
             30 if we account for the startup
             transient correctly. When doing
             pipeline analysis, we’re mostly    Total = N * Max(WasherPD, DryerPD)
             interested in the “steady state”
             where we assume we have an                            N*60
                                                             = ____________ mins
             infinite supply of inputs.




Comp 411 – Fall 2009                            11/16/2009                         L16 – Pipelining 5
            Recall Our Performance Measures
          Latency:
                The delay from when an input is established until the
                output associated with that input becomes valid.

                                          90
                       (Duke Laundry = _________ mins)     Assuming that the wash
                                          120
                       ( UNC Laundry = _________ mins)     is started as soon as
                                                           possible and waits (wet)
                                                           in the washer until dryer
                                                           is available.
          Throughput:                                                       Even though

                The rate at which inputs or outputs are processed.          we increase
                                                                            latency, it
                                                                            takes less
                                                                            time per load

                                         1/90
                       (Duke Laundry = _________ outputs/min)
                                         1/60
                       ( UNC Laundry = _________ outputs/min)

Comp 411 – Fall 2009                      11/16/2009                         L16 – Pipelining 6
                       Okay, Back to Circuits…

                       F
                                                     For combinational logic:
                                                      latency = tPD,
                                                      throughput = 1/tPD.
               X                  H        P(X)
                                                     We can’t get the answer faster,
                                                     but are we making effective use of
                       G                             our hardware at all times?


                          X
                       F(X)
                       G(X)
                       P(X)

                              F & G are “idle”, just holding their outputs
                              stable while H performs its computation

Comp 411 – Fall 2009                           11/16/2009                           L16 – Pipelining 7
                                  Pipelined Circuits
                            use registers to hold H’s input stable!

                                                 Now F & G can be working on input Xi+1
                       F                         while H is performing its computation
                       15
                                                 on Xi. We’ve created a 2-stage pipeline :
           X                       H      P(X)   if we have a valid input X during clock
                                   25
                                                 cycle j, P(X) is valid during clock j+2.
                       G
                       20

                  Suppose F, G, H have propagation delays of 15, 20, 25 ns and we
                  are using ideal zero-delay registers (ts = 0, tpd = 0):
                                                                             Pipelining uses
                                             latency throughput               registers to
                           unpipelined         45       1/45                  improve the
                                                50                           throughput of
                       2-stage pipeline                 1/25
                                             ______ ______                   combinational
                                                 worse        better
                                                                                circuits

Comp 411 – Fall 2009                             11/16/2009                           L16 – Pipelining 8
                                               Pipeline Diagrams
              F                                                                                       This is an example
              15
                                               Clock cycle                                            of parallelism. At
     X                           H      P(X)                                                          any instant we are
                                 25

              G
                                                             i          i+1    i+2        i+3         computing 2 results.
              20




                                           Input         Xi            Xi+1    Xi+2     Xi+3      …
               Pipeline stages




                                           F Reg                      F(Xi)   F(Xi+1)   F(Xi+2)
                                                                                                  …
                                          G Reg                      G(Xi)    G(Xi+1) G(Xi+2)

                                          H Reg                                H(Xi)    H(Xi+1) H(Xi+2)


                                      The results associated with a particular set of input
                                      data moves diagonally through the diagram,
                                      progressing through one pipeline stage each clock cycle.

Comp 411 – Fall 2009                                             11/16/2009                                L16 – Pipelining 9
                              Pipelining Summary
          Advantages:
                  – Higher throughput than combinational system
                  – Different parts of the logic work on different parts of the
                    problem…


          Disadvantages:
                  – Generally, increases latency
                  – Only as good as the *weakest* link                            This bottleneck
                                                                                    is the only
                    (often called the pipeline’s BOTTLENECK)                          problem




          Isn’t there a way around this “weak link” problem?


Comp 411 – Fall 2009                            11/16/2009                          L16 – Pipelining 10
            How do UNC students REALLY do Laundry?

                            They work around the bottleneck.
                              First, they find a place with
         Step 1:              twice as many dryers as
                              washers.
         Step 2:

                            Throughput =    1/30
                                           ______ loads/min
         Step 3:


         Step 4:
                                        90
                            Latency = ______ mins/load




Comp 411 – Fall 2009          11/16/2009                  L16 – Pipelining 11
                        Better Yet… Parallelism

                  Step 1:                      We can combine interleaving
                                               and pipelining with parallelism.

                  Step 2:                      Throughput =
                                                       = 1/15
                                                  2/30 _______ load/min

                  Step 3:                                  90
                                               Latency = _______ min

                  Step 4:


                  Step 5:




Comp 411 – Fall 2009              11/16/2009                          L16 – Pipelining 12
                         “Classroom Computer”
              There are lots of problem sets to grade, each with six problems.
              Students in Row 1 grade Problem 1 and then hand it back to Row 2
              for grading Problem 2, and so on… Assuming we want to pipeline
              the grading, how do we time the passing of papers between rows?




       Psets in         Row 1     Row 2      Row 3        Row 4   Row 5     Row 6


Comp 411 – Fall 2009                         11/16/2009                          L16 – Pipelining 13
           Controls for “Classroom Computer”
                               Synchronous                        Asynchronous

                       Teacher picks time interval          Teacher picks variable time
                       long enough for worst-case           interval long enough for
                       student to grade toughest            current students to grade
     Globally
                       problem. Everyone passes             current set of problems.
      Timed
                       psets at end of interval.            Everyone passes psets at
                                                            end of interval.

                       Students raise hands when            Students grade current
     Locally           they finish grading current          problem, wait for student
     Timed             problem. Teacher checks              in next row to be free, and
                       every 10 secs, when all hands        then pass the pset back.
                       are raised, everyone passes
                       psets to the row behind.
                       Variant: students can pass
                       when all students in a
                       “column” have hands raised.
Comp 411 – Fall 2009                           11/16/2009                             L16 – Pipelining 14
                       Control Structure Taxonomy
 Easy to design but fixed-sized                                        Large systems lead to very
 interval can be wasteful (no                                          complicated timing
 data-dependencies in timing)                                          generators… just say no!
                           Synchronous                          Asynchronous

     Globally           Centralized clocked                Central control unit tailors
      Timed             FSM generates all                  current time slice to
                        control signals.                   current tasks.


                        Start and Finish signals           Each subsystem takes
     Locally            generated by each major            asynchronous Start,
     Timed              subsystem,                         generates asynchronous
                        synchronously with                 Finish (perhaps using local
                        global clock.                      clock).
                                                                     The “next big idea” for the last
                                                                     several decades: a lot of design
        The best way to build large
                                                                     work to do in general, but extra
        systems that have independent
                                                                     work is worth it in special cases
        components.
Comp 411 – Fall 2009                               11/16/2009                                 L16 – Pipelining 15
                        Review of CPU Performance

                                                        MIPS = Millions of Instructions/Second
                                      Freq
                           MIPS =                       Freq = Clock Frequency, MHz
                                      CPI
                                                        CPI = Clocks per Instruction

             To Increase MIPS:
                       1. DECREASE CPI.
                        - RISC simplicity reduces CPI to 1.0.
                        - CPI below 1.0? State-of-the-art multiple instruction issue
                       2. INCREASE Freq.
                        - Freq limited by delay along longest combinational path; hence
                        - PIPELINING is the key to improving performance.


Comp 411 – Fall 2009                                11/16/2009                         L16 – Pipelining 16
                                                CLK
                                                                      miniMIPS Timing
                                               New PC


                          PC+4                          Fetch Inst.
                                                                                            The diagram on the left
                                                                                            illustrates the Data Flow
                                                                            Control Logic   of miniMIPS
                                                                                            Wanted: longest path

                                                Read Regs         Sign Extend

                                                                                            Complications:
                            +OFFSET    ASEL mux               BSEL mux
                                                                                            •some apparent paths
                                                       ALU                                  aren’t “possible”
                                                                                            •functional units have
                                                                 Fetch data                 variable execution times
                       PCSEL mux   WASEL mux     WDSEL mux
                                                                                            (eg, ALU)

                       PC setup                   RF setup            Mem setup
                                                                                            •time axis is not to scale
                                                                                            (eg, tPD,MEM is very big!)
                                                       CLK


Comp 411 – Fall 2009                                                  11/16/2009                                 L16 – Pipelining 17
                       Where Are the Bottlenecks?
         0x80000000
                               PC<31:29>:J<25:0>:00
         0x80000040
         0x80000080                JT         BT

             PCSEL     6   5   4    3   2     1    0                                        Pipelining goal:
                                         PC        00
                                                                                                Break LONG combinational paths
                                                               A      Instruction
                                                                        Memory
                                                                                                 memories, ALU in separate stages
                                                                        D
                                        +4

                                                                                                           Rs: <25:21>                      Rt: <20:16>
                                                                                           WASEL
                                                           J:<25:0>
                                                                            Rd:<15:11>
                                                                            Rt:<20:16>
                                                                                            0
                                                                                            1
                                                                                                       RA1
                                                                                                                   Register             RA2
                                                                                                                                                 WD
                                                                                    “31”               WA
                                                                                                      WA
                                                                                    “27”
                                                                                            2
                                                                                            3          RD1
                                                                                                                     File               RD2      WE       WERF

                                                                             Imm: <15:0>
                                              RESET
                                                                                                JT            SEXT           SEXT
                                                       IRQ Z N V C
                                                                                            x4            shamt:<10:6>

                                                        Control Logic
                                                                                           +                  “16”

                                                                                                          0 1 2          ASEL   1       0       BSEL
                                                                PCSEL
                                                                                           BT
                                                                WASEL
                                                                SEXT                                  A                             B
                                                                BSEL                ALUFN                      ALU                                        WD     R/W
                                                                                                                                                                       Wr
                                                                WDSEL
                                                                ALUFN                                                                              Data Memory
                                                                Wr                                        NV C Z                                   Adr    RD
                                                                WERF
                                                                ASEL

                                                                                                     PC+4

                                                                                                               0     1   2      WDSEL




Comp 411 – Fall 2009                                                                                 11/16/2009                                                             L16 – Pipelining 18
                Ultimate Goal: 5-Stage Pipeline

                  GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to
                   barely include slowest components (mems, regfile, ALU)
                  APPROACH: structure processor as 5-stage pipeline:

                                     Instruction Fetch stage: Maintains PC, fetches
                        IF             one instruction per cycle and passes it to
                                     Instruction Decode/Register File stage: Decode
                       ID/RF           control lines and select source operands
                                     ALU stage: Performs specified operation,
                       ALU             passes result to
                                     Memory stage: If it’s a lw, use ALU result as an
                       MEM             address, pass mem data (or ALU result if not
                                       lw) to
                        WB           Write-Back stage: writes result back into
                                       register file.

Comp 411 – Fall 2009                            11/16/2009                       L16 – Pipelining 19
                                   miniMIPS Timing
          Different instructions use various parts of the data path.
                                                          1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS
      Program
      execution        Time
      order
                       CLK
         add $4, $5, $6
         beq $1, $2, 40
         lw $3, 30($0)
         jal 20000
         sw $2, 20($4)


6 nS         Instruction Fetch     This is an example of a “Asynchronous
2 nS         Instruction Decode    Globally-Timed” control strategy (see
2 nS         Register Prop Delay
                                   Lecture 18). Such a system would vary the
5 nS         ALU Operation
4 nS         Branch Target
                                   clock period based on the instruction
6 nS         Data Access           being executed. This leads to complicated
 1 nS        Register Setup        timing generation, and, in the end, slower
                                   systems, since it is not very compatible
                                   with pipelining!

Comp 411 – Fall 2009                              11/16/2009                                         L16 – Pipelining 20
                              Uniform miniMIPS Timing
          With a fixed clock period, we have to allow for the worse case.
                                                                       1 instr EVERY 20 nS
      Program
      execution        Time
      order
                       CLK
         add $4, $5, $6
         beq $1, $2, 40
         lw $3, 30($0)
         jal 20000
         sw $2, 20($4)


6 nS         Instruction Fetch     By accounting for the “worse case” path
2 nS         Instruction Decode    (i.e. allowing time for each possible
2 nS         Register Prop Delay
                                   combination of operations) we can
5 nS         ALU Operation
4 nS         Branch Target
                                   implement a “Synchronous Globally-Timed”             Isn’t the
                                                                                        net effect
6 nS         Data Access           control strategy. This simplifies timing             just a
 1 nS        Register Setup        generation, enforces a uniform processing            slower
                                                                                        CPU?
                                   order, and allows for pipelining!


Comp 411 – Fall 2009                             11/16/2009                            L16 – Pipelining 21
                           Step 1: A 2-Stage Pipeline
         0x80000000
                               PC<31:29>:J<25:0>:00
         0x80000040
                                   JT
                                                                                                                                                                 IF
         0x80000080                           BT

             PCSEL     6   5   4    3   2     1    0



                                         PC        00


                                                               A      Instruction
                                                                        Memory                                                                                 EXE
                                                                        D
                                        +4
                                        PCEXE      00                   IREXE


                                                           J:<25:0>                        WASEL           Rs: <25:21>                      Rt: <20:16>
                                                                            Rd:<15:11>
                                                                            Rt:<20:16>
                                                                                            0
                                                                                            1
                                                                                                       RA1
                                                                                                                   Register             RA2
                                                                                                                                                   WD
                                                                                    “31”               WA
                                                                                                      WA
                                                                                    “27”
                                                                                            2
                                                                                            3          RD1
                                                                                                                     File               RD2        WE     WERF

                                                                             Imm: <15:0>
                                              RESET
   IR stands for                                                                                JT            SEXT           SEXT
                                                        IRQ Z N V C
   “Instruction Register”.                                                                  x4            shamt:<10:6>
   The superscript “EXE”
   denotes the pipeline
                                                        Control Logic
                                                                                           +                  “16”

                                                                                                          0 1 2          ASEL   1       0        BSEL
   stage, in which the PC                                       PCSEL
                                                                                           BT
   and IR are used.                                             WASEL
                                                                                                      A                             B
                                                                SEXT
                                                                BSEL                ALUFN                      ALU                                        WD     R/W
                                                                                                                                                                       Wr
                                                                WDSEL
                                                                ALUFN                                                                                Data Memory
                                                                Wr                                        NV C Z                                    Adr   RD
                                                                WERF
                                                                ASEL

                                                                                                     PC+4

                                                                                                               0     1   2      WDSEL




Comp 411 – Fall 2009                                                                                 11/16/2009                                                             L16 – Pipelining 22
                                    2-Stage Pipe Timing
          Improves performance by increasing instruction throughput.
             Ideal speedup is number of pipeline stages in the pipeline.
      Program
      execution        Time
      order
                       CLK
         add $4, $5, $6
         beq $1, $2, 40
         lw $3, 30($0) During this, and all subsequent
         jal 20000      clocks two instructions are in
         sw $2, 20($4) various stages of execution


6 nS         Instruction Fetch              By partitioning each instruction cycle into
2 nS         Instruction Decode             a “fetch” stage and an “execute” stage, we
2 nS         Register Prop Delay
                                            get a simple pipeline. Why not include the
5 nS         ALU Operation
4 nS         Branch Target
                                            Instruction-Decode/Register-Access time           Latency?
                                                                                               2 Clock
6 nS         Data Access                    with the Instruction Fetch? You could. But         periods =
                                                                                               2*14 nS
 1 nS        Register Setup                 this partitioning allows for a useful variant     Throughput?
                                            with 2-cycle loads and stores.                     1 instr
                                                                                               per
                                                                                               14 nS


Comp 411 – Fall 2009                                        11/16/2009                      L16 – Pipelining 23
         2-Stage w/2-Cycle Loads & Stores
         Further improves performance, with slight increase in control
              complexity. Some 1st generation (pre-cache) RISC processors used
      Program this approach.
      execution        Time
      order
                       CLK
         add $4, $5, $6                                                              This design is very similar
         beq $1, $2, 40                                                              to the multicycle CPU
         lw $3, 30($0)                                                               described in section 5.5 of
         jal 20000                                                                   the text, but with
                                                                                     pipelining.
         sw $2, 20($4)

                                                                                                   Clock:
6 nS         Instruction Fetch     The clock rate of this variant is over twice                      8 nS!
2 nS         Instruction Decode    that of our original design. Does that
2 nS         Register Prop Delay
                                   mean it is that much faster?
5 nS         ALU Operation         Not likely. In practice, as many as 30% of
4 nS         Branch Target
6 nS
                                   instructions access memory. Thus, the
             Data Access
 1 nS        Register Setup        effective speed up is:
                                           speedup  newclockperiod( 0 .72*0 .3 )
                                                                  old clockperiod


                                                    8 (203 )  1923
                                                         1.      .
Comp 411 – Fall 2009                                 11/16/2009                                   L16 – Pipelining 24
                          2-Stage Pipelined Operation

         Consider a sequence
                                                           ...
         of instructions:                                  addi               $t2,$t1,1
                                                           xor                $t2,$t1,$t2
                                                           sltiu              $t3,$t2,1
                                                           srl                $t2,$t2,1
                                                           ...
                                                                                             Recall
                                                                                             “Pipeline Diagrams”
                                                                                             from an earlier slide.
          Executed on our 2-stage pipeline:
                                                        TIME (cycles)                                  It can’t be
                                          i+1
                                                                                                       this easy!?
                                    i            i+2      i+3           i+4      i+5   i+6
                   Pipeline




                               IF addi   xor    sltiu     srl          ...
                              EXE        addi   xor       sltiu        srl       ...



Comp 411 – Fall 2009                                      11/16/2009                                       L16 – Pipelining 25
0x80000000
                       PC<31:29>:J<25:0>:00




                                                            Step 2: 4-Stage miniMIPS
0x80000040
0x80000080                 JT         BT

    PCSEL      6   5   4    3   2     1    0



                                 PC        00              Instruction
                                                             Memory
                                                    A

       Instruction              +4
                                                             D
                                                                                                                                                                        Treats register file
       Fetch
                                PCREG
                                                                                                                                                                        as two separate
                                           00                IRREG

                                                                                                         Rs: <25:21>                      Rt: <20:16>
                                                                                                                                                                        devices:
                                                J:<25:0>
                                                                                                    RA1
                                                                                                                 Register             RA2                               combinational
                                                                                                    WA
                                                                                                    RD1
                                                                                                                   File               RD2
                                                                                                                                                                        READ, clocked
                                                                  Imm: <15:0>
                                                                                               JT
                                                                                                                       =                                                WRITE at end of
                                                                     SEXT           SEXT                               BZ                                               pipe.
                                                                                      x4                shamt:<10:6>


                                                                                +                           “16”
                                                                                                                                                                        What other
                                                                                                                                                                        information do we
                                                                                                        0 1 2          ASEL   1       0        BSEL
       Register
       File                                                                     BT
                                PCALU      00                IRALU                                  A                             B                      WDALU          have to pass down
                                                                                                                                                                        pipeline?
                                                                                                    A                             B
                                                                            ALUFN                            ALU                                                          PC
                                                                                                                                                                          (return addresses)
       ALU                                                                                              NV C Z                                                            instruction fields
                                PCMEM                        IRMEM                                                                                       WDMEM
                                                                                                                                                                             (decoding)
                                           00                                                                      Y


                                                                                                                                                             R/W
                                                                                                                                                                   Wr
                                                                                                    PC+4                                          Adr   WD
                                                                                                                                                   Data Memory
                                                                 Rt:<20:16>
                                                                 Rd:<15:11>
                                                                                                                                                        RD              What sort of
                                                                                    “31” “27”
                                                                                                                                                                        improvement should
                                                           WASEL            0   1    2     3                 0     1    2     WDSEL
                                                                                                                                                                        expect in cycle time?
       Write                                                                    WA
                                                                                         Register                  WD                 (NB: SAME RF
       Back                                                WERF         WA
                                                                       WE
                                                                                           File                                            AS ABOVE!)

Comp 411 – Fall 2009                                                                                             11/16/2009                                                         L16 – Pipelining 26
                       4-Stage miniMIPS Operation

         Consider a sequence
                                                            ...
         of instructions:                                   addi        $t0,$t0,1
                                                            sll         $t1,$t1,2
                                                            andi        $t2,$t2,15
                                                            sub         $t3,$0,$t3
                                                            ...
          Executed on our 4-stage pipeline:
                                                             TIME (cycles)

                                        i     i+1    i+2          i+3      i+4    i+5    i+6

                                   IF addi    sll   andi         sub      ...
                       Pipeline




                                   RF        addi    sll         andi     sub    ...

                                  ALU               addi          sll     andi   sub    ...

                                  WB                            addi       sll   andi   sub


Comp 411 – Fall 2009                                       11/16/2009                          L16 – Pipelining 27

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:9
posted:6/23/2012
language:
pages:27