Docstoc

Computer Science 152

Document Sample
Computer Science 152 Powered By Docstoc
					                                                                                                                  Jack Kang
                                                                                                               Benjamin Lee
Computer Science 152                                                                                              David Lee
Report – Final                                          15 May 2003                                             Lyle Takacs

Contents

Abstract
Division of Labor
Strategy
         Superscalar Processor
                  Superscalar Datpath [Design | Test]
                  Instruction Cache Block [Design | Test]
                  Issue Unit [Design | Test]
                  Forwarding [Design | Test]
         Branch Prediction
                  Branch Translation Buffer [Design | Test]
                  Branch History Table [Design | Test]
                  Static Branch Prediction [Design]
         General Testing
Results
Conclusion

Note: Appendices included in supplemental files. All hyperlinks reference these files.
Appendix I – Notebooks
        Jack Kang
        Benjamin Lee
        David Lee
        Lyle Takacs
Appendix II – Schematics
        Top Pipeline Schematic
        Bottom Pipeline Schematic
        Cache Block Schematic
        Decode Block Schematic
        Execute Block Schematic
        Instruction Cache Block Schematic
        Memory Block Schematic
        Superscalar Datapath (Lower Right) Schematic
        Superscalar Datapath (Upper Left) Schematic
        Superscalar Datapath (Lower Left) Schematic
        Superscalar Datapath (Upper Right) Schematic
        Superscalar Datapath (Overall) Schematic
Appendix III – Verilog
Appendix IV – Testing
        Issue Unit
        Forwarding
        Branch Prediction
        General

Abstract

The purpose of this lab is to enhance the 5-stage pipelined processor from previous lab work with a major sub-project and
one or more minor sub-projects as stated in the lab specifications. This team’s selection of projects included a 2-way
superscalar processor with two pipelines, one instruction stream, and one cache. The team also implemented dynamic
branch prediction with a branch target buffer and a branch history table.
The performance gains achieved by a superscalar processor result from the ability to exploit instruction level parallelism and
achieve a greater instruction throughput than would be possible in our previous implementations of our pipelined processor.
The ability to execute instructions in parallel is limited by the ability to issue instructions in parallel. A significant design
issue and functional component of our superscalar processor is an issue unit that checks dependencies between instructions
and determines the number and sequence of instructions to be executed on the two pipelines. In addition to the
implementation of a dual issue fetch/decode unit, the two pipelines must also allow for forwarding within each pipeline as
well as forwarding between pipelines.

The number of delay slots increase with a superscalar implementation, compounding the effects of data and control hazards.
This provides the motivation for our minor sub-project. Branch prediction is intended to reduce the performance impact of
the extra delay slots. The branch target buffer will buffer the target addresses of past branch instructions and allows the
target address to be available for the PC in the fetch stage. The branch history table contains the taken history of branches
and allows the taken signal to be available for the PC source mux in the fetch stage. By adding these buffers, the number of
delay slots is reduced and the performance penalties due to control hazards are reduced.

Static branch prediction is also enabled as a result of a stream buffer supplying instructions to the issue unit. In the event that
a branch is issued, the instructions in the buffer will continue to be supplied to the issue unit. In the event that a branch is
taken, the stream buffer will be flushed. The entire scheme represents an implementation of static branch prediction such that
a branch is always predicted not taken and the stream buffer is flushed when a branch is taken.

The final result of this lab is a five-stage two-way superscalar processor implemented as a Harvard architecture. The design
of the pipeline stages were derived from our past implementations of these stages. These stages have been grouped in
schematic and blocked such that the final superscalar processor was assembled in schematic. The new forwarding, control
signals, and logic for dual issue were implemented in Verilog. The design process included dependency checking and
forwarding as well as dynamic branch prediction.

Division of Labor

Jack Kang
             Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer.
             General testing and debugging in simulation and board.

Benjamin Lee
        Design and implementation of branch history table and branch target buffer
        Datapath assembly
        General testing and debugging in simulation

David Lee
             Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer
             General testing and debugging in simulation and board

Lyle Takacs
             Design and implementation of forwarding unit.
             Modifications to the monitor module.
             Datapath assembly.
             General testing and debugging in simulation.


Strategy

Part I – Superscalar Processor

         A. Superscalar Datapath
Superscalar Datapath (Lower Right) Schematic
Superscalar Datapath (Upper Left) Schematic
Superscalar Datapath (Lower Left) Schematic
Superscalar Datapath (Upper Right) Schematic
Superscalar Datapath (Overall) Schematic

Design – Superscalar Datapath
The superscalar datapath requires the duplication of the execution stage in the pipeline. Specifically, the
processor has one instruction fetch stage, one instruction decode stage, two execute stages, one memory stage
and one write back stage. Although instructions can execute in parallel, only one memory stage exists which
means that only one of the two instructions issued in parallel may access memory. The corresponding stage in
the other pipeline has no functionality but must still propagate its data and control signals through a “memory”
stage in order to keep the same number of stages in each pipeline. Since there is only one pipeline has a
memory stage, there is only one DRAM and two caches (instruction and data) and no modifications were
necessary for the memory interface.

The final high level schematic includes all five stages of the pipeline modularized into several larger inclusive
blocks:

    Decode Stage Verilog

    Decode Stage
    The decode stage module is the symbol corresponding to the Verilog instantiations of a FIFO queue and the
    issue unit. The issue unit checks the dependencies between instructions prior to issuing them to the
    pipelines (a detailed discussion of dual issue follows). The FIFO queue is used to buffer issued
    instructions, trying to ensure that the pipelines are always issued instructions. This queue serves to hide the
    fact that instructions may be stalled and/or swapped by the issue unit from the pipelines, trying to provide a
    continuous stream of decoded instructions to the pipelines. The decode stage also interfaces with the
    RegFile in order to decode the instruction and fetch the corresponding operands and the instruction cache
    block to receive the instructions from memory. The instruction cache block has been modified to support
    fetching enough instructions to support the issue unit (a detailed discussion of the modified instruction
    cache block follows)

    Decode Block
    Execute Block

    Decode Block
    In prior implementations of our five-stage pipelined processor, a portion of the execution of R-type
    instructions occurred in the decode stage. Specifically, a branch is resolved and its address calculated in
    this stage. A shift instruction is also executed in this stage. The superscalar implementation of our
    processor required a separation of the decoding portion of this stage (in the Decode Stage block) and the
    executing portion of this stage (in the Decode Block block). The Decode Block contains an extender and a
    shifter that was originally used for branch calculations but is now used to support the shift operation in the
    next pipeline stage. It also contains forwarding muxes for the decode stage such that values can be
    forwarded and used for branch target calculations. The actual branch calculations now occur in the issue
    unit and these forwarded values are passed into the issue unit accordingly.

    Execute Block
    The execute block encapsulates the ALU, the forwarding muxes for ALU operands, the SLT unit, and the
    Shifter unit. This block is included in both the top and bottom pipeline.

    Memory Block Schematic

    Memory Block
    The memory block encapsulates memory-mapped I/O, the data cache, and the associated read and write
    logic.
Top Pipeline
Bottom Pipeline

Pipeline Top
The top pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB). Between
the ID/EX and EX/MEM registers is the execute block. Between the EX/MEM and MEM/WB registers is
a single mux that combines the results from the execute block into a single value to forward into the decode
and execute stage.

Pipeline Bottom
The bottom pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB).
Between the ID/EX and EX/MEM registers is the execute block and two muxes taking in the original
register values and the next PC as seen by a jump register instruction. These muxes provide support for a
jump register instruction followed by an instruction that uses $R31 (e.g. jr loop, addiu $31, $31, 4).

The memory block is located between the EX/MEM and MEM/WB registers. It is also supported by a new
multiplexor that takes the executed values from the top pipeline and uses them as input to memory. This
allows for an instruction to execute and write a value into the register file in the top pipeline and a store of
the same value to occur in the bottom pipeline (e.g. addu $1, $2, $3, sw $1, 0($0))

Forward Verilog
Forward Schematic

Forward
The forwarding unit is positioned between the two pipelines, taking inputs from various pipeline stages to
check for dependencies (a detailed discussion of the forwarding unit follows). It also outputs the select
signals to both pipelines to control the forwarding in various stages.

Branch Translation Buffer Verilog
Branch History Table Verilog

Branch Translation Buffer
The branch target buffer connects to the decode stage block to provide the PC with a branch address in the
fetch stage before the actual address is calculated in the decode stage (a detailed discussion of the branch
target buffer follows). The target buffer holds the last target address of the same branch. The buffer is
direct mapped and will replace entries upon a conflict miss. Our team has implemented a 32-entry, 256-
entry, and 2048-entry buffer. We had decided against the 2048-entry buffer since such a large module
would take a significant amount of time mapping to board.

Branch History Table
The branch history table connects to the decode stage to provide the taken signal to the PC source
multiplexer in the fetch stage before the actual comparator generates the true taken signal in the decode
stage (a detailed discussion of the branch history table follows). The history table is a 2-bit predictor and
employs hysteresis to improve prediction accuracy. Our team has implemented a 32-entry, 256-entry, and
2048-entry table. Again, we had decided against the 2048-entry table since such a large module would take
a significant amount of time mapping to board.
Register File
The regfile was changed considerably. There had to be two sets of Ra and Rb outputs, one for the top
pipeline and one for the bottom pipeline. Correspondingly, there were two sets of inputs selecting which
registers to read. During writes, if both pipelines tried writing to the same register, the bottom pipeline
would take precedence, as the bottom pipeline contained the later instruction. Additionally, there was a
special write input port for register $31 and an output port for register $31. This was necessary so we could
run the jal instruction in the decode stage and immediately write in to register 31. This was done because
otherwise the forwarding unit would have to come through the issuer in order to take care of instructions
immediately after the jal that modify register 31, (since the address is calculated in decode). This created a
        WAW hazard; that is, if another instruction that was modifying register 31 was already in the pipeline, it
        would overwrite this value. We fixed this by disabling pipeline writes to register 31 for 2 cycles after a jal.


    Testing – Datapath
    Testing of the superscalar datapath was only possible after the issue unit was attached to the two pipelines.
    These two components were thoroughly tested with various small assembly files to test specific cases in
    conjunction with more comprehensive tests. A detailed discussion of datapath testing may be found in the part
    on general testing (III).


B. Instruction Cache Block

    Instruction Cache Block Schematic
    Instruction Cache Controller Verilog
    Two Way Cache Verilog

    Design – Instruction Cache Block
    Because the superscalar processor must issue 2 instructions per cycle, the instruction cache was changed to have
    a 128-bit output, or 4 instructions per cycle. The cacheline was striped across four 32-bit SRAM blocks so that
    upon reading, each instruction could be accessed simultaneously. The instruction cache fetches 4 instructions
    each time because this will keep the processor busy while the cache is fetching new instructions. It also allows
    for the issue unit to analyze the instruction stream so that it knows if a branch or jump is coming, then it should
    begin fetching the delay slot if the delay slot is not in the current block.

    The instruction cache is now direct-mapped as opposed to the two-way set associative cache in Lab6 because in
    general the instruction stream is sequential so having a two-way set associative cache would not give us much
    benefit. Rather, it would complicate the cache because we would need to have muxes that take in 256 bits and
    output 128 bits and slow down the cache.

    Testing – Instruction Cache Block
    To test the instruction cache, we loaded up the instruction cache with blocks of instructions by telling the
    instruction cache to load each instruction individually. Then we tried to read from the cache by giving it an
    address and reading the block. To make sure that no two blocks overlapped we loaded several blocks next to
    each other and read out the final block values.


C. Issue Unit

    Issue Unit Verilog
    FIFO Verilog

    Design – Issue Unit
    The issuer is broken into 2 pieces: a FIFO buffer and an issue unit. The FIFO unit can store up to 8 instructions
    (2 blocks of instructions), which it grabs from the instruction cache. The reason that it stores 2 blocks of
    instructions is because if a branch or jump is the last instruction of a block then its delay slot must be in the next
    block. By having 2 blocks in the FIFO, we can be sure that when the branch or jump instruction reaches the top
    of the buffer, then the delay slot will have been fetched. The top 2 instructions in the FIFO are visible to the
    issue unit at all times. The issue unit analyzes these 2 instructions, and upon deciding to issue one, two or zero
    instructions to the rest of the datapath, will choose to then remove one, two, or zero instructions from the top of
    the queue.

    The FIFO is implemented as 8 registers with 2 pointers. One pointer points to the current top entry in the FIFO.
    This and the next entry are the visible entries to the issue unit. The other pointer keeps track of the next empty
    slot. This slot, plus the 3 slots immediately after it, are the 4 slots that are filled when 128 bits are read from the
    cache. Finally, there is a counter that keeps track of the number of valid entries in the FIFO. When this drops to
a certain level, then the FIFO unit will grab more instructions from the cache. The command to grab more
instructions comes from the issue unit, which is also responsible for sending the proper pc to the instruction
cache.

The issue unit is basically a controller that grabs two instructions from the top of the FIFO and analyzes them.
There are a few cases worth noting:

         If there are no structural hazards or data forwarding hazards between the two instructions then we can
          issue 2 instructions, with the first instruction in the top pipeline and the second in the bottom pipeline.
         Because we only have one memory stage, memory instructions (lw and sw) must be issued in the
          bottom pipeline, which has the memory stage. If we get two memory instructions in a row, we can only
          issue one at a time. If, however, the memory instruction comes with a 2 nd instruction that has no
          dependency on the memory instruction, we can swap them so that the later instruction executes in the
          top pipeline and the memory instruction goes into the bottom pipeline. There was a small optimization
          done here in the case of two load instructions loading to the same register. If we saw that case, we
          would only issue the 2nd instruction and discard the first.
         Branches and jumps must be issued in the top pipeline with its delay slot instruction in the bottom
          pipeline. This is so that we can preserve the number of delay slots after a branch or jump instruction.
          No matter where the branch or jump is located in a block of instructions, the number of delay slots
          after it is a constant of one slot.
         If there are any dependencies between 2 instructions, only one instruction is issued, otherwise 2
          instructions are issued. A signal is sent back to the FIFO to remove one, two, or zero instructions and
          output the next pair of instructions.

Testing – Issue Unit
In order to do testing on the Issuer (issue unit and FIFO), a dummy instruction cache was created that simulated
the instruction cache previously described. We could not use the instruction cache because it required that the
whole processor must be put together first which would make testing impossible. The fake cache allowed us to
load our test code and check the output from the issuer without having to connect up the rest of the processor.
The two most important thing in testing the issuer are 1) testing that the issue unit and FIFO properly
communicate to each other and 2) testing that all instruction hazards are handled before they are issued (ie issue
only one, two, or zero instructions). The testing is done in the following order:

                      Tests                                    Expected Results
        1) have instructions that have    1) The issue unit should always issue 2 instructions/cycle
        no dependencies or structural
        hazards between them
        2) have instructions that only    2) The issue unit should only issue 1 instruction/cycle
        have data forwarding
        dependencies (ie. the second
        instruction depends on the
        result of the first)
        3) Testing memory                 3) Tested that memory instructions were always issued in
        instructions                      the bottom pipeline. If it was possible, the issue unit would
                                          swap the memory instruction with the instruction after it so
                                          that it could still issue 2 instructions. Otherwise, if there are
                                          2 memory instructions, then only the first one is issued. If a
                                          memory instruction was paired with a following instruction
                                          that writes to the same register, the memory instruction
                                          would not be issued. Similarly is 2 load instructions wrote
                                          the same register, the later load would be allowed to run
                                          and the previously load would be ignored.
        4) Testing for jumps: jumps       4) A jump must always be issued with its delay slot. If the
        were placed in different parts    delay slot is not present, then cannot issue any instruction.
        of the block so that we could     If the jump is in the second instruction slot, then we can
         test that it did not matter         only issue the first instruction and on the next cycle issue
         where the jump was, it would        the jump with its delay slot. Also we made sure that the
         still be properly issued. All       issue unit would output nops if we jump to a block that is
         sorts of instructions were          not in the FIFO and wait for the new block to be written in.
         placed in the delay slot.
         (jumps include j, jal, and jr)
         5) Testing for branches:            5) Similar to a jump. A branch must always be issued with
         branches were placed in             its delay slot Also we made sure that the issue unit would
         different parts of the block so     output nops if we mispredict on a branch and wait for the
         that we could test that it did      new block to arrive. Otherwise if the prediction was correct,
         not matter where the branch         it would just continue to issue.
         was, it would still be properly
         issued. All sorts of instructions
         were placed in the delay slot.
         (branches include beq, bne,
         bltz, bgtz)

   Index of Issue Unit Tests

   Tests used:
   Part 1: noDependecy.s
   Part 2: forwardDependency.s
   Part 3: swlwNoDepend.s, swlwDepend1.s, swlwDepend2.s, swlwDepend3.s
   Part 4: jump.s, simpleJump.s, harderJump.s
   Part 5: simpleBranch.s, branch.s beq.s, bgez.s, bltz.s, bne.s


D. Forwarding

   Forward Verilog

   Design – Forwarding
   There are two types of hazards in our superscalar datapath: control hazards and data hazards. We avoided
   structure hazards with the addition of necessary hardware. As for previous labs, this was a reasonable decision,
   as we have plenty of space available on the board. We avoid control hazards by issuing only instructions that
   have no control hazards and only data hazards that can be resolved by forwarding.

   The forwarding unit will check for data hazards between stages in the pipeline and for data hazards between the
   two pipelines. The forwarding within the “pipeline_top” and “pipeline_bottom” are essentially the same as the
   forwarding in the original five-stage pipeline. The jump register instruction will need to use a forwarded value
   if the return address register ($ra) was modified but not yet written into the register file. Thus, the PC to be
   written into $ra is propagated through the pipeline and a multiplexor added in the next PC logic to detect a data
   hazard and select the forwarded PC if necessary. The comparator logic in each pipeline also needs forwarded
   values in the event a control instruction uses a register that has been modified but not yet written to the register
   file. Thus, a multiplexor chooses between the normal value, the value forwarded from the memory stage, and
   the value forwarded from the write-back stage. The ALU and shifter needs forwarded values in each pipeline
   need forwarded values in the event that these units use a register value that has been modified but not yet
   written to the register file. Thus, a multiplexor chooses between the normal value, the value forwarded from the
   memory stage, and the value forwarded from the write-back stage. In summary, each pipeline must support
   forwarding the PC to the instruction fetch stage, forwarding memory or write-back data to the instruction
   decode stage, and forwarding memory or write-back data to the execute stage.

   In addition to the forwarding for each pipeline, the superscalar processor must also account for data hazards
   between pipelines. For this reason, the comparator logic in one pipeline will also need the forwarded values
   from the other pipeline. Thus, the multiplexors in the decode stage have been expanded to include data
   forwarded from the memory or write-back stage of the other pipeline. The ALU and Shifter units in one
pipeline will also need the forwarded values from the other pipeline. Thus, the multiplexors in the execute stage
have been expanded to include data forwarded from the memory or write-back stage of the other pipeline.
Finally, it is possible for the top pipeline to execute a value that needs to be stored into memory by a store
instruction in the bottom pipeline (e.g. addu $1, $2, $3, sw $1, 0($0)). Thus, the forwarding unit passes the
results of the ALU, SLT, and Shifter units from the top pipeline into a multiplexor which selects from these
inputs or the regular value. The output of this multiplexor will then be the data stored into memory for that
store word instruction.

The forwarding unit also handles forwarding priorities. In the event of multiple data dependencies, the
instruction making the most recent modification to the register value should be forwarded. This essentially
means the cases in the forwarding unit should check for a hazard in the memory stage first and the write-back
stage second. Furthermore, when checking a particular stage, the forwarding unit should check for a hazard in
the stage of the other pipeline first and the stage in the same pipeline second. The priority should be as follows:

             1.   Memory stage in the other pipeline
             2.   Memory stage in the same pipeline
             3.   Write-back stage in the other pipeline
             4.   Write-back stage in the same pipeline

There are several cases where a sequence of instructions will require a stall followed by the forwarding of data.
The issue unit and the forwarding unit will be able to handle these cases independently since both control units
monitor the same state of the processor.

Index of Forward Tests

Testing – Forwarding
The testing of the forwarding control units included a sequence of instructions that tested the permutations of
various types of instructions including arithmetic/logical/compare/shift operations, control operations, and data
operations. These tests were the same tests ran on the original five-stage pipeline to test its forwarding. In
addition to single pipeline forwarding tests, we added new tests to check for forwarding between pipelines.
This amounted to generating instructions in sets of four and five that would be in the same buffer and would be
issued in pairs. An example of such a test would be <add $1, $2, $3>, <sll $0, $0, 0>, <sll $0, $0, 0 >, < add $4,
$1, $10>. Since the two add instructions have a data dependency and they will execute on consecutive cycles,
but on different pipelines (top and bottom), the forwarding between the pipelines would be needed. The cases
below are the forwarding cases for a single pipelined datapath. Each of the forwarding cases (with the
exception of Jr Program Counter and Arithmetic – Store) are duplicated to forward values to the other pipeline
in the case of a superscalar processor.


             Jr Program Counter
                      Pass register $31 through the pipeline such that the jump register instruction
                        propagates through all five stages in the pipeline
                      Pass register $31 from the decode stage into the execute stage such that operations
                        can use $31 as an operand.

             Arithmetic – Store
                     (Arithmetic -> Store) Forward Execute Result from Ex/Mem (top) to Memory
                         (bottom)

             Arithmetic – Arithmetic
                     (Arithmetic -> Arithmetic) Forward VAL from Ex/Mem to Ex stage
                     (Arithmetic -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
                     (Arithmetic -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

             Arithmetic – Logical
                      See Arithmetic – Arithmetic
           (Arithmetic -> Logical) Forward VAL from Ex/Mem to Ex stage
           (Arithmetic -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage
           (Arithmetic -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

           (Logical -> Arithmetic) Forward VAL from Ex/Mem to Ex stage
           (Logical -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
           (Logical -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Shift
        See Arithmetic – Arithmetic

           (Arithmetic -> Shift) Forward VAL from Ex/Mem to Ex stage
           (Arithmetic -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage
           (Arithmetic -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage
        
           (Shift -> Arithmetic) Forward VAL from Ex/Mem to Ex stage
           (Shift -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
           (Shift -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Control
        (Arithmetic -> Branch) Stall one cycle
        (Arithmetic -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage
        (Arithmetic -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage
        (Branch -> Arithmetic) Delay slot executed

Arithmetic – Compare
        See Arithmetic – Arithmetic

           (Arithmetic -> Compare) Forward VAL from Ex/Mem to Ex stage
           (Arithmetic -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage
           (Arithmetic -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

           (Compare -> Arithmetic) Forward VAL from Ex/Mem to Ex stage
           (Compare -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
           (Compare -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage


Arithmetic – Data Transfer
        (Arithmetic -> SW) Forward VAL from Ex/Mem to Ex stage
        (Arithmetic -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage
        (Arithmetic -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

           (LW -> Arithmetic) Stall one cycle
           (LW -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
           (LW -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Logical – Logical
         See Arithmetic – Logical

           (Logical -> Logical) Forward VAL from Ex/Mem to Ex stage
           (Logical -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage
           (Logical -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Shift
         See Arithmetic – Shift
           (Logical -> Shift) Forward VAL from Ex/Mem to Ex stage
           (Logical -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage
           (Logical -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

           (Shift -> Logical) Forward VAL from Ex/Mem to Ex stage
           (Shift -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage
           (Shift -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Control
         See Arithmetic – Control

           (Logical -> Branch) Stall one cycle
           (Logical -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage
           (Logical -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage
           (Branch -> Logical) Delay slot executed

Logical – Compare
         See Arithmetic – Compare

           (Logical -> Compare) Forward VAL from Ex/Mem to Ex stage
           (Logical -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage
           (Logical -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

           (Compare -> Logical) Forward VAL from Ex/Mem to Ex stage
           (Compare -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage
           (Compare -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Data Transfer
         See Arithmetic – Data Transfer

           (Logical -> SW) Forward VAL from Ex/Mem to Ex stage
           (Logical -> Nopx1 -> SW) Forward VAL from Mem/Wb to Ex stage
           (Logical -> Nopx2 -> SW) Forward VAL from Mem/Wb to Id stage

           (LW -> Arithmetic) Stall one cycle
           (LW -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage
           (LW -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Shift – Shift
          See Arithmetic – Shift

           (Shift -> Shift) Forward VAL from Ex/Mem to Ex stage
           (Shift -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage
           (Shift -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

Shift – Compare
          See Arithmetic – Compare

           (Shift -> Compare) Forward VAL from Ex/Mem to Ex stage
           (Shift -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage
           (Shift -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

           (Compare -> Shift) Forward VAL from Ex/Mem to Ex stage
           (Compare -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage
           (Compare -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage
Shift – Control
          See Arithmetic – Control

           (Shift -> Branch) Stall one cycle
           (Shift -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage
           (Shift -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage
            (Branch -> Shift) Delay slot executed

Shift – Data Transfer
          See Arithmetic – Data Transfer

           (Shift -> SW) Forward VAL from Ex/Mem to Ex stage
           (Shift -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage
           (Shift -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

           (LW -> Shift) Stall one cycle
           (LW -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage
           (LW -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

Compare – Compare
       See Arithmetic – Arithmetic

           (Compare -> Compare) Forward VAL from Ex/Mem to Ex stage
           (Compare -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage
           (Compare -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

Compare – Control
       See Arithmetic – Control

           (Compare -> Branch) Stall one cycle
           (Compare -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage
           (Compare -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage
           (Branch -> Compare) Delay slot executed

Compare – Data Transfer
       See Arithmetic – Data Transfer

           (Compare -> SW) Forward VAL from Ex/Mem to Ex stage
           (Compare -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage
           (Compare -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

           (LW -> Compare) Stall one cycle
           (LW -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage
           (LW -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

Control – Control
         No dependencies

Control – Data Transfer
         No dependencies

Data Transfer – Data Transfer
        (LW -> SW) Stall one cycle
        (LW -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage
        (LW -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage
Part II – Branch Prediction

    A. Branch Translation Buffer (BTB)

         Branch Translation Buffer Verilog

         Design – Branch Translation Buffer
         The branch history table is 256-entry direct mapped cache for branch target addresses. The 256-entry buffer is
         constructed from eight 32-entry translation buffers. The buffer takes in the PC of the first of four fetched
         instructions, the four 32-bit instructions, and outputs the predicted target PC, asserting a predictedHit signal if an
         entry’s tag matches the incoming PC. It also asserts a special flag if the fourth instruction of the set is a branch. The
         decode stage handles this case specially.

         Given a set of four instructions, there can be at most two branch instructions. Furthermore, in the case where there
         are two branch instructions, one branch must be in the first two instructions and one branch must be in the last two
         instructions. Each 32-entry translation buffer is constructed from 32 btb-entry modules. Each btb-entry contains a
         32-bit tag (used to match with the PC corresponding to the first of the four instructions) and two target addresses
         corresponding to the two possible branch instructions.

         The 32-entry buffer contains 32 instances of the btb-entry module. Each module has a hit signal which is asserted
         when the incoming PC matches the tag of the corresponding entry and at least one of the four incoming instructions
         is a branch. The write enable for any particular entry is asserted only if the index (as determined by the lower 5 bits
         of the PC) corresponds to the entry number, the entry’s bank is enabled, and one of the four instructions is a branch.
         The output of the 32-entry buffer is a result of the outputs of all 32 entries going into a series of multiplexors. All 32
         outputs are put into four 8-input multiplexors selected by PC [2:0]. The outputs of these multiplexors are put into
         one 4-input multiplexor selected by PC [4:3]. It is the output from this last multiplexor that is also the output of the
         32-entry buffer.

         The 256-entry buffer is easily constructed from eight 32-entry buffers and multiplexors to select the outputs of these
         eight buffers. The wrapper around the 256-entry buffer takes each of the four instructions fetched and decodes each
         instruction to determine if they are branches. The wrapper also takes the immediate fields of each instruction and
         computes the branch target address. The signals ith “br” signal is asserted if the ith instruction is a branch. These 4
         “br” signals are sent into the sub-buffers which, in turn, are sent into the btb-entries.

         The 256-entry buffer was tested independently with test vectors. The inputs to the buffer were designed to simulate
         all possible branch locations in a set of four instructions. Our test vectors included tests for one branch in each of
         the four instruction slots, tests for two branches in various instruction slots. The tests for functionality verified that
         cache hits occurred correctly, cache misses occurred correctly, and the target address contained in the entries were
         replaced correctly.

         Branch Translation Buffer TestBench

         Testing – Branch Translation Buffer
         The incremental testing for the branch translation buffer involved creating test vectors for sets of four instructions
         with a single beq instruction in various positions in the set (i.e. a beq instruction in the first, second, third, and fourth
         instruction slot). The other instructions in the set are nops (sll $0, $0, imm) with varying immediates in order to
         differentiate different nops. These tests were used to ensure the basic functionality of the branch translation table.
         The following are the test cases checked by the test vectors in the testbench (also noted in the test-bench as
         comments).

             1.   Beq instruction in instruction slot 1
             2.   Beq instruction in instruction slot 2
             3.   Beq instruction in instruction slot 3
             4.   Beq instruction in instruction slot 4
       5.   Beq instruction in instruction slots 2 and 4, use top branch
       6.   Beq instruction in instruction slots 1 and 3, use bottom branch

   Under the interface specified during the design phase, the issue unit would provide a signal to specify whether the
   top or bottom branch target address was to be predicted in the event that there were two branches in a set of four
   instructions.


B. Branch History Table

   Branch History Table Verilog

   Design – Branch History Table
   The branch translation table is a 256-entry direct mapped cache for branch taken signals. The 256-entry history
   buffer is constructed from eight 32-entry history tables. The buffer takes in the PC of the first of four fetched
   instructions, and a mispredict signal asserted if the predicted taken doesn’t match the comparator’s taken in the
   decode stage and outputs the predicted taken signal.

   The 256-entry history table is constructed from eight 32-entry tables and multiplexors to select the outputs of these
   eight tables. Furthermore, the 32-entry tables are constructed from 32 bht-entries and multiplexors to select the
   outputs of these entries. Both sets of multiplexors are selected by bits in the PC.

   The 32-entry table contains 32 instances of a bht-entry. The 32 entries are index by bits in the PC. Each entry
   implements a 2-bit dynamic branch prediction scheme where prediction is changed only when the misprediction
   occurs twice. This team’s implementation involved creating a four state FSM. The four states are described below:

            Taken 1
            The FSM resets to Taken 1. If the prediction was incorrect, the FSM transitions to Not Taken 1. If the
            prediction was correct, the FSM transitions to Taken 2. In this state, the prediction is that the branch will
            be taken.

            Taken 2
            The FSM transitions from Taken 1 to Taken 2 if a prediction was correct. If the prediction was incorrect,
            the FSM transitions to Taken 1. If the prediction was correct, the FSM stays in the same state. In this state,
            the prediction is that the branch will be taken.

            Not Taken 1
            The FSM transitions from Taken 1 to Not Taken 1 if a prediction was incorrect. If the prediction was
            incorrect, the FSM transitions to Not Taken2. If the prediction was correct, the FSM transitions to Taken 1.
            In this state, the prediction is that the branch will not be taken.

            Not Taken 2
            The FSM transitions from Not Taken 1 to Not Taken 2 if a prediction was incorrect. If the prediction was
            incorrect, the FSM stays in the same state. If the prediction was correct, the FSM transitions to Not Taken
            1. In this state, the prediction is that the branch will not be taken.

   Branch History Table TestBench

   Testing – Branch History Table
   The 256-entry table was tested independently with test vectors. The inputs to the buffer were designed to simulate a
   sequence of predictions followed by signals from the decode stage signaling correct and incorrect branches. These
   tests were used to verify that the entries are updated correctly on a misprediction and that the state transitions are
   correct depending on correct/incorrect predictions.

   The test-bench for the branch history table checks the basic functionality and semantics of the four state finite state
   machine. There are a series of three program counters, each with different “mispredict” and “branch signals.” A
         simulated “mispredict” signal from the decode stage would inform the FSM whether or not the prediction was
         correct and cause a state change in the FSM. A simulated “branch” signal would indicate whether or not the
         instruction at the given program counter was a branch. Two of the program counters were provided different
         mispredict signals over time. The third program counter was provided a deasserted branch signals, testing that the
         FSM would not change regardless of this program counters mispredict signal. All three instructions are interleaved
         to simulate repeating loops.

          PC                 Test Case
          32’h00000002       A series of branches that mispredicts on a “random” basis
          32’h00000004       A series of branches that mispredicts on an alternating basis
          32’h00000008       A series of non-branch instructions


    C. Static Branch Prediction

         Stream Buffer Verilog

         Static branch prediction is enabled by the stream buffer. The issue unit essentially takes the fetched instructions
         from the stream buffer and performs the data dependency analysis. If the issue unit issues a branch, the instructions
         following the branch remain in the buffer and will be issued unless the branch is taken. Thus, the stream buffer
         allows static branch prediction, always predicting the branch is not taken. If the branch is taken, the stream buffer
         will be flushed and the instructions immediately following the branch instruction would not be executed.

         The alternative to this form of static branch prediction would be the addition of extra delay slots which is not an
         option for this project. The project specifications require a single delay slot after each branch instruction.

Part III - General Testing

Index of General Tests

The team followed a policy of incremental testing with the implementation of new modules before integrating these modules
to assemble the final datapath. The branch prediction components were tested individually to ensure basic functionality of
the direct mapped tables and, in the case of the branch history table, the correct functioning of the finite state machine. The
decode stage was tested independently. The FIFO queue buffering the fetched instructions was integrated the issue unit and
connected to a fake instruction cache. The fake instruction cache contained particular sequences of instructions that the issue
unit would analyze before issuing to the pipeline. The tests involved observing the issued instructions to ensure that no pairs
of instructions issued had data dependencies that could not be handled by the two pipelines.

After integrating the decode stage with the two pipelines, the first major test of the processor was the boot loader. The boot
loader actually contains many corner cases requiring the pipelines to forward between themselves. After the processor
executed the boot loader successfully, testing proceeded with the test files used for lab 5 to test forwarding and hazards. This
same set of instructions would excite similar data hazards and test the forwarding logic between the pipelines. Additional
tests were added to the original set of tests to ensure that certain cases that may not have occurred in a single pipeline would
be tested for the superscalar pipelines. Lastly, the corner tests from lab 5 were run to ensure that those cases would still pass.

Although these general tests were comprehensive with regard to forwarding between the stages of the pipeline and
forwarding between the two pipelines themselves, these tests failed to explicitly check the forwarding priorities. Ultimately,
the various cases in the forwarding logic needed to be reorder such that each pipeline checks for a data hazard occurring in
the other pipeline before checking for a data hazard occurring within the same pipeline. This is necessary since the
instructions are essentially interleaved between the two pipelines. In the event of a data hazard, the most recent hazard would
occur in the other pipeline instead of occurring in the same pipeline from the previous cycle. Refer to the design of the
forwarding unit for details.

On a general note regarding the test trace files, the nops used by our tests were <sll $0, $0, 14>. This differentiated the nops
that we had entered into the code versus the nops generated within the pipelines, which were <sll $0, $0, 0>
Results

Our team was able to pass all dumb (corner2), dumber (prime), and quicksort (quicksort1). In addition, our superscalar
processor was able to pass all the previous tests, as well as our own tests, including but not limited to base, verify, and corner.
All of these passed in simulation. Our clock cycle results for the more interesting tests are:

Dumb: 960 clock cycles
Dumber:
        5th prime number: 1,124 clock cycles.
        80th prime number: 1,138,615 clock cycles.

Quicksort1: 9,154 clock cycles. Quicksort 1 had 5,307 instructions, which means we had a CPI of 1.72 for a very sw/lw
intensive program.

Our clock cycle is: 14.358 Mhz.

While the team did not have time to map it down to board, one try at synthesis revealed that a final push to board would have
yielded numbers very similar to these:

Block Rams: 67 out of 160 (42%)
Slices: 6669 out of 19200 (35%)
Critical Path: 69.646 ns
Max Clock Frequency: 14.358 Mhz

Conclusion

The strength of this team’s implementation of the superscalar processor is modularity of the design. Compared to previous
lab assignments that included all datapath components in a single schematic file, the size of the superscalar processor, with its
two pipelines, required encapsulating the various components of our design into single schematic symbols when assembling
the datapath. This design facilitates modular and incremental testing of the datapath components and also allows for a more
compact design.

In general, designing the dual issue unit required the most time and effort, while implementing the forwarding for the
superscalar pipelines was a relatively straight-forward extension of the forwarding from lab 5. The dual issue unit required
the consideration of many possible pairs of instructions that could be fetched and considered for issue. The forwarding unit
required modifications to include forwarding paths between pipelines. The forwarding priorities also caused some problems
that were ultimately resolved by forwarding from the other pipeline first and within the same pipeline second (in addition to
forwarding from the memory stage first and the write back stage second). This subtlety was the cause of the correctness
issues in running the TA’s tests, but were ultimately resolved by correcting the forwarding priorities.

The greatest difficulty encountered in this team’s work on the final project were the time constraints. Each member had put
in a great deal of time on previous labs, thereby falling behind in their other classes. Work on the final project extended too
far into the final schedules for the members of the group, who had finals to prepare for. The resulting time constraints
resulted in the design, implementation, verification and presentation of the project in only one week. Given more time, we
would have been able to put the design to the board, but but given the time constraints, mapping the processor to hardware
was not feasible. As it turned out, our team worked until Thursday, spent an additional 3-5 hours over the extra three days
given, and resumed work on Monday night to resolve a correctness issue excited by Corner 2. The team was able to resolve
all known correctness issues in simulation Tuesday afternoon after team members had completed their finals.

Given more time, the team would have tried to integrate the branch history table, the branch translation buffer, and the
superscalar processor more effectively in order to implement the more optimal 2-bit dynamic branch prediction scheme over
the static branch prediction scheme currently in operation.
The final result of this project is a two-way superscalar processor with five-stage pipelines and branch prediction. The team
has verified full functionality in simulation for a variety of tests and programs with the exception of the quicksort2. The team
has been able to map the processor to hardware due to time constraints.
Appendix I – Notebooks

Jack Kang


       Total Hours: 98 hours over 9 days.

       ---------
       Thursday 5/8 12:00am

       Goal: Think about how to implement superscalar

       Have a decent idea, still some bugs to work out. 4 pages of work ready to be shown to the guys
       tomorrow.

       Thursday 2:30 am

       --------------------

       Thursday 5/8 11:00am

       Goal: Discuss design with partners

       Ran into some problems with branch, pc, arbiter. Talked to kubi, we need to have fetch grab
       more than just two instructions per cycle. This is complicated and needs more working out...

       Divided up the work, plan is to finish most things by Saturday for final testing.

       Design is on paper.

       Thursday 5/8 1pm

       -----------------------

       Thursday 5/8 1pm

       Goal: Dual port the reg file!

       Done and tested. Files in U:\newfilesforlab7

       Thursday 5/8 2:20pm

       -----------------------

       Thursday 5/8 9:10pm

       Goal: Design Issue Arbiter!

       lots of thinking, lots of paper cases. We think we know how to do the buffer side of things.
       Will come in tomorrow to code, and to also do the issuing part of the state machine.

       Friday 5/9 1:20am

       --------------------

       Friday 5/9 3:15pm
Goal: Finish coding speical FIFO buffer part of issue arbiter

finished coding. Need to test. Breaking for dinner.

Friday 5/9 7:09pm
-------------------------

Friday 5/9 9:00pm

Goal: Finish testing FIFO. Begin issue unit.

-Finished issue unit with david. WE ahve linked them together and have tested the easiest case
of instructions with no dependencies or control or memory usage and it all works.
-We will be back tomorrow to test further, including memory and dependency. We have tests written up
but it is hard to test due to the lack of a cache at the moment. We will be creating a new verilog module
tomorrow to fake this.

Meanwhile, forwarding assumptions made:

addu $4, $5, $6
sw $4, 0($0)

should work because we will forwards the $4 value from teh top pipeline down to the bottom pipeline. (Ex->Mem)

Sat 5/10 2:30AM
-------------------

Sat 5/10 4pm

Goal: Finish Issue Unit

Jumps are giving us a very ahrd time. Offsets (if we don't jump to a mod 4) are a big issue. Also, the fifo has
to know when to write in, and that part is difficult. Our prior implementation needs some work...

Integrated issue unit with fifo buffer. Fixed some bugs. Testing various cases to see if it issues correctly.
We are doing this in steps, first doing just r=type, then r-type with dependencies, then sw/lw, and then sw/lw
with dependencies, then finally control (branches and jump).

David and I are now splitting up the work, he is continuing to test branches and jumps while I create the PC for
actually jumping.

We have almost completely tested the issue for non control instructions and believe that it all works. I am working
on the jump/branch pc stuff, and am running into some problems. I am going home to sleep and coming back
tomorrow
to do that part.

Sun 5/11 4:00am
--------------

Sun 5/11 4:00pm

Goal: Fix up issue unit to work with jumps.

I think i figured out what to do last night at 5am. I wrote it down and will implement it. It should hopefully work.
Finished jumps! Had to make issue unit smarter, as it now tells the fifo when to load or not. Also, fifo had to be
changed to be smarter when an invalidate signal came along.

Mon 5/12 1:45am

---------------

Mon 5/12 1:45am

Goal: Figure out how to do branches.

we've written in the branches code. We have begun testing, using waves, put together with a reg file. There are some
problems that we ened to fix tomorrow.

Mon 5/12 5:00am

----------

Mon 5/12 6:00pm

Goal: Finish doing branches

We have finished branches, also added in jal and jr. We are now putting everything together to begin testing.

We had to increase the regfile to always output the data that the jr is alwasy pointing to. This will save us a
cycle when we issue jr.

Also, we still have to look at break and stalls later on.

Mon 5/12 10:45pm
-------

Mon 5/12 10:45pm

Goal: Put everything together.

Lots of pins not connected in the schematic. This was a huge waste of time and the stuff should of been tested
before it was given to us, especially since they told us it was tested.

the branch in the boot loader seems to be broken, it seems to not take it even though it should...

Coming back tomorrow at 11 to do this.

Tue 5/13 4:50am
-------------

Tue 5/13 11:00am

Goal: Debug stuff we put together

Problems with some timing, we can't get past boot loader. Fixed some bugs with the pointers in the fifo.

Stuff works except for break, and stalls.

Tue 5/13 4:00pm
-------------
Tue 5/13 9:30pm

Goal: Fix breaks and stalls

LOTS OF TIMING ISSUES! hard...we may have to re think our design. We need to deal with instruction stalls and
data stalls seperately.

Wed 5/14 3:00am
--------------

Wed 5/14 11:45am

Goal: Fix data stalls vs. instruction stalls

write some stuff....


can we do a beq to a beq? w/ weird delay slots? + inst stall?!
beq
break

beq

Okay, verify and base work now. We stil ahve to get corner to work.

There are problems with some branches going to the wrong place. This seems very peculiar as it has been
working.....
Specifically, the taken signal is not going high. This makes no sense...

MODELSIM is broken!...Tried to put to board during down time and it didn't work. Going home now.

The taken signal may have been an artifact of the broken modelsim.

Thu 5/15 4:00am
----------------

Thu 5/15 10:45am

Goal: Fix the processor to pass corner...

Time ran out, we do not do jal->jr correctly if $31 is modified too quickly within the instructions. Time to go
do the presentation

Thu 5/15 1:30pm
--------------

Thu 5/15 4:20pm

Goal: Fix the jal-> jr case

Breaking for dinner.

Thu 5/15 5:00pm

-------------
Thu 5/15 8:00pm

Goal: FINISH FOR GOOD!

We can do base and corner.

final half ass has quicksort semi working, and one minor bug it seems in base. It is the closest we have to all tests
working.

final_allbutQS has base working, and quicksort just not sorting in the correct order.

jr with a delay slot to somethign in it's own block? Will that work??

We have problems in that we write in too many things when there is an offset. This is causing quicksort to not work.

final_imtired is the last state of what we have. Basically, we will write in an offset of 3 into block 1-4. Then
we will have the next free entry be 3, rather than 4. tHis causes major problesm!
final_imtired zip has a hacked verison of fifo that was supposed to fix this by writing smarte but it didn't work
on the first try.

Fri 5/16 7:10am
-------------------

Fri 5/16 9:30pm

Goal: fix the damn bugs

-changes what happens when we come back from an instruction stall to make it more restrictive
-changes how we hold the invalidate signal inside a stall
currently: corner almost works...
-added currentTopEntry hack to force it to 0 in certain cases.
these above this line are good for sure
-hold offset during numEntry case = 0, fixed branches where we would branch but hit a data stall during the branch
that sohuld be good, but not fixed yet completely...
-hold invalidate during numEntry case =0

during numEntry case, why are we checking for branchstall2 and jumpstall2?????
I got rid of those, and now it works.

OH HEEEELLLLL YEA! quicksort works! 1:10am.

Even the worm tested, provided to us by Ray, works.

Fri 5/16 11pm
--------------
Mon 5/19 8:15pm

Goal: watch the race

There’s a bug…can’t seem to fix it. I don’t have any extra time to finish it right now, maybe tomorrow after my
final I can fix it for whatever credit.

Mon 5/19 10:15pm
------------
Tue 5/20 10:30am
       Goal: Fix the bug that was found yesterday

       This is a forwarding bug that happens during stalls. Unfortunate, as I am not familiar with the forwarding unit.
       Found it, priority between pipelines was wrong. Additionally, we also have to forceload again during a stall if our
       first forceload gets cut off due to the stalling.

       Tue 5/20 4:30pm
Benjamin Lee

       Total Time: ~56.5 hours

       Thu May 8 12:02:13 PDT 2003 (~1 hour)
       Goals: Design review for the final project

               Decided to implement superscalar with branch prediction
               Two distinct components with [fetch, decode] and [execute, mem, write back].
               Distributed work (Lyle and I are working on the [execute, mem, write back] section)

       Thu May 8 13:05:21 PDT 2003


       Thu May 8 15:23:10 PDT 2003 (~ 5 hours)
       Goals: Complete the schematic implementation of the superscalar pipeline

               Began building blocks for each pipeline stage
               Completed the block implementation of the memory stage
               Completed the block implementation of the top and bottom pipelines
               Halfway through wiring up the new block components

       Thu May 8 20:10:42 PDT 2003


       Fri May 9 09:05:10 PDT 2003 (~2 hours)
       Goals: Complete the wiring for the schematic implementation of the pipeline

               Wiring completed.
               Began to write basic tests vectors to test the later stages of the pipeline independently of the fetch and
                decode stage. We’ll attach the top and bottom pipelines to the earlier stages when Jack and David work out
                the dual-issue in the fetch stage.

       Fri May 10:57:22 PDT 2003


       Fri May 9 13:30:01 PDT 2003 (~ 12 hours)
       Goals: Complete the testing for the dual pipelines.

               Test vectors are complete with several basic tests. We may need to look at the forwarding again to account
                for the corner cases. There are significantly more corner cases with two instructions issued at the same
                time.
               Began the implementation of the branch target buffer (BTB). Implemented a 2048 entry BTB. There are
                concerns about the size of the BTB when mapping to board. We may opt for a smaller BTB for our final
                implementation.

       Sat May 10 01:30:22 PDT 2003
Sat May 10 09:31:21 PDT 2003 (~ 10 hours)
Goals: Complete testing for the BTB

       The BTB had some syntax errors that Lyle and I fixed.
       The hit signal has been modified to include a condition for the branch. In order to get a hit, we must have a
        matching PC and tag, in addition to having a branch instruction in one of the four instructions in the buffer.

       Began implementation of branch history table (BHT)
       The BHT is implemented with hysteresis (a four state FSM). The transitions are made on the clock cycle.
       The BHT takes the taken signal from the decode to make the appropriate transition in the FSM.
       NOTE: The predicted PC from the BHT will be available 2 mux delays (2 ns) after the PC changes. The
        PC is used as the selectors for the internal muxes. Therefore, the predicted PC cannot be available at the
        beginning of the clock cycle, but will be ready after 2ns.


Sat May 10 19:30:15 PDT 2003


Sun May 11 10:47:23 PDT 2003 (~2 hours)
Goals: Complete the debuggin of the BHT.

       The predicted PC from the BHT will be available 2 mux delays (2 ns) after the PC changes. The PC is used
        as the selectors for the internal muxes. Therefore, the predicted PC cannot be available at the beginning of
        the clock cycle, but will be ready after 2ns.

Sun May 11 12:43:21 PDT 2003


Sun May 11 13:42:17 PDT 2003 (~ 3 hours)
Goals: Check forwarding for the two pipelines in the superscalar implementation.

       Added hardware and logic to handle forwarding of the reg $31. A code sequence that has a <jal loop; add
        $31, $31, 2> would need $31 forwarded to the execute stage.

       Need to add logic for an <add $1, $2, $3; sw $1, 0($0)>. The ALU, SLT, Shift output from the EX/MEM
        pipeline register in the top pipeline need to be muxed in the lower pipeline (memory stage). The mux in the
        memory bottom stage takes: (1) MEM_RegB_bottom, (2) MEM_ALU_top, (3) MEM_SLT_top, (4)
        MEM_Shift_top. This will require added logic in the forwarding unit and the pipeline_top schematic.

        New Files for Integration: (U:\final.zip)
                Bht.v
                Btb.v
                Decode_block.sch
                Execute_block.sch
                M1x2.v
                M32x2.v
                M32x5.v
                M32x8.v
                Mem_block.sch
                Pipeline_bottom.sch
                Pipeline_top.sch
                Superscalar.sch
       Sun May 11 16:49:20 PDT 2003


       Monday May 12 16:04:22 PDT 2003 (~2 hours)
       Goals: Complete modifications to the forwarding unit in pipeline.

               Finished modifications to the forwarding, top and bottom pipelines. These modifications include a
                multiplexor in the bottom pipeline to allow for forwarding of executed values from the top pipeline into the
                memory stage of the bottom pipeline to allow a store to execute.

       Monday May 12 18:11:11 PDT 2003


       Tuesday May 13 09:59:35 PDT 2003 (~3.5 hours)
       Goals: Complete the integration of Jack/David’s issue unit with the superscalar pipelines.

               Modified the forwarding unit to display the case and values being forwarded.

       Tuesday May 13 13:32:22 PDT 2003


       Tuesday May 13 15:10:21 PDT 2003 (~ 8 hours)
       Goals: Begin drafting the report

               Note: An error in the input the branch history table. The BHT decTaken signal should actually be a
                mispredict signal, asserted only when the predTaken signal doesn’t match the taken signal from the
                comparator in the decode stage.

       Tuesday May 13 23:01:11 PDT 2003


       Wednesday May 14 10:02:52 PDT 2003 (~3 hours)
       Goals: Complete power point slides for presentation

               There are a few more slides than the eight Prof. Kubi requested, but we should be well under the time limit.
               Completed the slides and have e-mailed the group. Heading out to Berkeley to discuss and rehearse
               Completed the final version of the slides, pending performance data to be collected tomorrow.

       Wednesday May 14 20:01:19 PDT 2003


       Saturday May 17 11:14:23 PDT 2003 (~4 hours)
       Goals: Integrate the branch prediction with the superscalar processor

               Seems like there are fundamental issues in the interface between the processor and the branch prediction
                hardware.
               Leaving to study for finals.

       Saturday May 17 15:12:01 PDT 2003


David Lee

       * Index ==============================================================
       Estimated time spent in lab 115.5 hours
+ ====================================================================
Fri May 09 13:15:28 PST 2003

Goal: Finish making new instruction cache

new files
instboardRAM.v
instboardRAM_tf.tf
instdirectMappedCache.v
instdirectMappedCache_tf.tf
instcache_block.sch
instcache_controller.v
instcache_controller.sym
instdirectMappedCache.sym
IssueUnit.v
fifo.v
fifotest.v
issuetest.v

does addu forward to sw if they are issued at same time?

hopefully it should

Sat May 10 02:35:45 PST 2003
- ====================================================================
+ ====================================================================
Sat May 10 12:06:33 PST 2003

Goal: Finish testing and debugging issue unit and the fifo buffer

new files
fake_icache.v

fake_icache simulates the instruction cache and outputs blocks of instructions.
Must put in your instructions manually into the if statements in the
fake_icache.v.

In the fifo, the write signals were off by one cycle because it was using
the curState to determine the values. I now set it to nextState so that
it would output as soon as we want to write.

caused some glitching in the we signal. So changed back to curstate to rethink the problem.

rt register of branches assigned to rb in issue unit because they act more like the
Rtype operand registers than a destination register. might have to do it for sw but not sure.
the rw is set to 0.

Sun May 11 03:45:16 PST 2003
- ====================================================================
+ ====================================================================
Sun May 11 15:21:57 PST 2003

Goal: Finish testing issue unit with branching and implement PC stuff

for bgez we will stall one cycle if the previous instruction write to
register 1 because in register 1 is used to determine whether it is bgez
or bltz. Bltz uses 0 so it does not stall. This is no big deal. I just means
lose one cycle for bgez, but functionally correct. Otherwise we need to
change the issue unit so that it checks that for bgez don't check the
rb == busy_rw or busy_lw_rw registers.

Issue unit works for branches too.

Need to figure out how to handle break.

for jal add case where the add depends on the jal, then we need forward
the jal calculated address to the add in the execute stage in the second
pipeline.

figure out how to do branches
zzzZZZzzzZZZ
Mon May 12 04:57:22 PST 2003
- ====================================================================
+ ====================================================================
Mon May 12 18:31:18 PST 2003

Goal: get jal, branch, jr, break working and hopefully integrate tonite.

fixed the jal branch and jr maybe. only tested the branches. Not sure
what to do with break. When do we stall?

okay we are going to put stuff together.

In the Decode Stage, the CLK and RESET signals were flipped in the symbol

MAKE SURE ALL PINS ARE CONNECTED IN SCHEMATIC AND SYMBOL!!! VERY IMPORTANT
WASTE OF 1 HR

Branch taken signal changes too quickly don't know why yet

Tue May 13 04:44:11 PST 2003
- ====================================================================
+ ====================================================================
Tue May 13 11:03:59 PST 2003

***ignore***
In the fifo unit, in the logic for next free entry, the curstate was
changing too quickly to idle before you add. If you move the delay into
the if else statement it works, but it seems strange because it might be
a violation of the hold time for the nextfreeentry register.

Ignore previous statement. It failed.
***ignore***

we added stalling to the nextstate and write enable logic so that we don't
write or change stage if we are stalling. Need to verify with Jack.

We integrated all the pieces. It seems to be working for the most part, but
there are lots of bug especially with the issue unit and the fifo talking
with the rest of the pipeline. Main problems are branches and jumps

Tomorrow we will fix all these bugs and hopefully integrate the branch prediction unit
Wed May 14 04:12:44 PST 2003
- ====================================================================
+ ====================================================================
Wed May 14 10:04:11 PST 2003

goal: must get everything working

added new states to stall when there is branch. The problem is that because
the issue unit and the decode is done in the same stage, we have a very long
critical path.

Problem 2: because we must wait to issue before we can read from the registers, we
have trouble doing branching really quickly. Instead we must issue the branch,
set a flag, and then on the next time around, we check the flag and know that
the registers are ready for comparison. This is very inefficient but we don't
have time to redesign.

Problem 3: we have a forwarding problem with jal. Because of the way the datapath was originally
designed, the jal cannot forward. It passes through different registers so either
we change the forwarding unit to take this into account, change the datapath (no time for this)
or we figure out someway to automatically write into the register file.

We decided to go for the register file change because it is simple and quick.

Seems to work fine.

Thu May 15 05:55:31 PST 2003
- ====================================================================
+ ====================================================================
Thu May 15 12:15:02 PST 2003

I need to sleep. I need lots of sleep. Cannot not keep functioning like
this.

In our implementation, jr acts more like a branch than a jump. It gets the
register value after it is issued so we must add new flags like branch
and make jr act like a branch.

We stall unnecessarily on instruction stalls. This sucks because it ruins our
speedup. We don't have much time to figure out how to fix it, but if it we could
we would never stall unless the fifo buffer is empty. It would take quite a bit
of redesigning or a terrible hack to get it to work.

all of our tests pass. Corner, verify, and base also work. Quicksort does not.

A change in quicksort causes quicksort to work but base to fail. We cannot keep
fooling around with stalling. The biggest and only problem in our implementation
is dealing with the stalls because it ruins our issue timing.

This really bites we can't finish right now. Jack and I are dead tired and cannot
function much more.

We are going to sleep and maybe try only a few more hours tomorrow to get it working
we are so close.

Fri May 16 07:33:51 PST 2003
       - ====================================================================
       + ====================================================================
       Sat May 17 13:27:20 PST 2003
       Yay Jack got everything working. Now i must put in the branch prediction unit.

       crap, the branch prediction unit does not interface properly with our
       issue unit. It depends too much on control from issue unit. Also, we need
       to figure out how to bypass the fetch unit because we don't have enough control
       signal to switch between the blockpc and the predpc.

       Okay it doesn't look like the branch prediction unit is going to work yet.
       We had to add in new ports. Basically the branch prediction needs to
       look at the block and by itself check for branches and predict if it is going to branch.

       We still have some sort of branch prediction because we always predict fall through
       and if it is right we don't stall.

       Sat May 17 15:15:15 PST 2003
       - ====================================================================
       + ====================================================================
       Sun May 18 21:52:22 PST 2003
       Stupid jal WAW hazard.

       the globalpc is missing an 'f' in the beginning. This did not affect anything
       because we recovered to the correct pc after we jump.

       Can't fix jal WAW hazard


       Mon May 18 02:04:56 PST 2003
       - ====================================================================
       + ====================================================================
       Mon May 18 11:24:31 PST 2003
       Stupid jal WAW hazard again.

       changed it so that if stalling, jaldont_1 still gets updated to jaldont
       in regfile

       changed the stall from inststall to data_stall because only then
       do we actually stall the entire pipeline. Otherwise instructions
       after the jal will not get to commit their values.

       All tests pass except jack jal test for the jal WAW. It mostly works
       but does not get to the last line. Don't know why. But I think it could
       be just the test code.

       Mon May 18 13:37:33 PST 2003
       - ====================================================================

Lyle Takacs

       +==================================
       Thu May 8 11:21:12 PDT 2003
       work on creating blocks:
       decode, execute, memory, pipeline
Top pipeline does not need any memory forwarding, but needs to pass data to
bottom pipeline for memory stores
+==================================


+==================================
Fri May 9 10:48:23 PDT 2003
Build forwarding symbol block, 'duplicate' code for cross pipeline
forwarding

Need forwarding to decode blocks
+==================================


+==================================
Sat May 10 11:01:47 PDT 2003
Put all blocks together into datapath, test

Still need PC, issue, inst cache, regfile


Done initial tests, now going to work on BHT and BTB with Ben

We went with 256 entries to refrain from thrashing too much and to minimize
map time to the board.

Done with BTB, but BHT needs a little more work and to be tested.
+==================================



+==================================
Mon May 12 10:25:54 PDT 2003
fix up monitor
clock cycle, more data info, better formating

integrate issue unit - add regfile, cache, decode stage

test!
oops forgot to put in synthesis drams...
+==================================


+==================================
Tue May 13 11:12:43 PDT 2003
more monitor fixes
added more testing stuff, better stall handling

lots of testing! mostly all fixes with issue unit. Most detail in Jack
and/or David's logs
+==================================

+==================================
Thu May 14 12:20:34
jal data should go through normal data path, not handled in wb stage
makes forwarding much more difficult
i dont think the mem input mux should be on right side of register
-it's ok, we only write into buffer so data doesn't need to be ready at
posedge clk

i'm done... need to study for final tomorrow, but first need to finish take
home final due tomorrow
rest of group cannot work past tonight so I hope they have good luck!
+==================================
Appendix II – Schematics

Top Pipeline Schematic
Bottom Pipeline Schematic
Cache Block Schematic
Decode Block Schematic
Execute Block Schematic
Instruction Cache Block Schematic
Memory Block Schematic
Superscalar Datapath Lower Right Schematic
Superscalar Datpath Upper Left Schematic
Superscalar Datapath Lower Left Schematic
Superscalar Dapath Upper Right Schematic
Superscalar Datapath Overall Schematic
Appendix III – Verilog


alu.v                     lvlZeroBoot.v
arbiter.v                 m1x2.v
bht.v                     m32x2.v
branchCalc.v              m32x3.v
btb.v                     m32x5.v
bts32.v                   m32x6.v
cache_controller.v        m32x8.v
ClockDivider.v            m5x3.v
comp.v                    mem_write.v
constants.v               memory_control.v
controller.v              memoryio.v
counter.v                 monitor.v
CPU.v                     pipeline_registers.v
debouncer.v               rdBuffer.v
decode_stage.v            regfile.v
directMappedCache.v       releaseUnit.v
extend.v                  shifter.v
fifo.v                    sll2.v
forward.v                 SLT.v
inBoot.v                  ss_test.v
instcache_controller.v    TwoWayCache.v
instdirectMappedCache.v   upedge_detector
IssueUnit.v               wbBuffer.v
jconcat.v
Appendix IV – Testing

Issue Unit Test Files

beq.s                                                       fake_icache_beq.v
bgez.s                                                      fake_icache_bgez.v
bltz.s                                                      fake_icache_bltz.v
bne.s                                                       fake_icache_bne.v
branch.s                                                    fake_icache_dave_forwardDependency.v
forwardDependecy.s                                          fake_icache_harderJump.v
forwardDependecy_dave.s                                     fake_icache_jump.v
harderJump.s                                                fake_icache_nodependency.v
jalWAW.s                                                    fake_icache_simpleBranch.v
jump.s                                                      fake_icache_simpleJump.v
noDependecy.s                                               fake_icache_swlsDepend1.v
simpleBranch.s                                              fake_icache_swlsDepend2.v
simpleJump.s                                                fake_icache_swlwNodepend.v
swlwDepend.s
swlwDepend1.s
swlwDepend2.s
swlwDepend3.s
swlwNoDepend.s

Note: Trace files are very large. Open them with WordPad.

Forwarding Test Files
Forwarding Hazard Test Code
Forwarding Hazard Test Trace


Branch Prediction
Branch History Table Test-bench
Branch Translation Buffer Test-bench


General
Base Trace
Base I/O Output

Corner 1 Trace
Corner 1 I/O Output

Quicksort 1 Trace
Quicksort 1 I/O Output

Corner 2 Trace (aka Dumber)
Corner 2 I/O Output

Prime Trace (aka Dumb)
Prime I/O Output

				
DOCUMENT INFO