Victim Cache

Document Sample
Victim Cache Powered By Docstoc
					Final Project
   Super-Scalar
  Stream Buffer
   Victim Cache
 CAM-based Cache



    Group   
    Steve Fang
     Kent Lin
     Jeff Tsai
     Qian Yu
Abstract



The final design is a super-scalar MIPS microprocessor. This allows the processor to

handle two instructions in parallel if there are no dependencies between them.

Furthermore, because more instructions are being executed per cycle as compared to the

single-feed processor, instruction cache misses become a significant bottleneck for the

processor. Therefore, a stream buffer is included to relieve this problem. Additionally, a

victim cache is also added to reduce the miss rate of the data-cache. Overall, these two

features are added as solutions to the problem of a very costly DRAM access. Finally, a

CAM style cache organization has been added to the first level cache.



Division of Labor


Implementation of the super-scalar design was split up into three parts: schematic

drawing and wiring, control logic and hazard handling, and memory. Memory consists of

the memory controller for the new DRAM component, the stream buffer and the victim

cache. The assignment breakdown was Jeff handled the memory controller, Qian worked

on the datapath drawing/wiring, Steve worked on the stream buffer, and Kent worked on

the victim cache. Each member was responsible for testing and verification of their

assigned task or component. In addition everyone contributed to the control logic and

hazard handling due to the complexity of the subject.
Detailed Strategy



Datapath

The structure of the datapath itself is fairly simple; much like the single-issue, 5-stage

pipeline covered in class, the super-scalar version features the same stages, but many

compoments are duplicated to allow for complete parallelism. Moreover, the datapath is

also aligned in a way such that only even instructions can go through the top pipeline,

while only odd instructions can enter the odd one. A high-level diagram is given below.



                                  Decode            Execution
Even
Pipeline
               Fetch                                              Memory          Write-back

                                  Decode            Execution
Odd
Pipeline



In general, many components of the datapath were duplicated for the parallel execution of
the even and the odd instructions, while some components were modified with more
ports. As a results, the instruction bandwidth of the datapath increases, and so does the
CPI of the processor.


The following table presents a big picture of how the classic 5 stage datapath was
changed at each pipeline stage.
                     Table1. Datapath Modification
PIPELINE      COMPONENTS              COMPONENTS WITH WIDER
STAGES        DUPLICATED                          BANDWIDTH
IF            Pipeline registers                  Instruction Cache outputs
ID            Instruction Decode Controller       Register File outputs
              Extender                            Hazard Controller inputs, outputs
              Branch Comparator                   Stall Unit inputs
              Forwarding muxes                    Forwarding muxes inputs
              Branch PC bus
              Jump PC bus
              Pipeline registers
EX            ALU                                 None
              SLT instructions Unit
              Shifter
              Pipeline registers
ME            Pipeline register                   None
              Memory Source Muxs
WB            Register File Source Select         Register File inputs
              Mux                                 Monitors


In the instruction fetch stage, the output bandwidth of the cache is doubled so that the
datapath can fetch both the even and odd instructions in parallel. Since the even and the
odd instructions are in the same 2-word cache line of the cache, the change was made in
the cache datapath. On the other hand, the data cache of MEM stage does not need to be
modified because our memory system only allows one memory access at a time. And the
Stall units of the decode stage will separate two parallel memory-access instructions.


The major modifications of the datapath are in the instruction decode stage and the
execution stage. The components, which operate only for an individual instruction, are
duplicated, such as branch comparator, extender and ALU. Hazard and stall units
however have to take instructions of ID, EX, and MEM stages from both even and odd
pipelines in order to determine the dependency. One special case is the register file. In
order to work with supers-scalar scheme, it should be able to write and read two different
registers as well as the same register in parallel. If both the even and the odd instructions
write back the same register file location, we must let odd instruction win in order to
prevent WAW hazard.


Beyond the general modifications, 2-way super-scalar pipeline adds complexity in
calculating new PCs. The following table shows the PC calculation of the different types
of instructions.
                                 Table 2. PC Calculation
INSTRUCTION TYPES                              NEW PC CALCULATIONS
Arithmetic, Logic, Memory       PC = PC + 8
Even Branch                     PC = PC + 4 + Branch Value
Odd Branch                      PC = PC + 8 + Branch Value
Jump, Jr                        PC = Jump Value
Even JAL                        PC = Jump Value; $R31 = PC + 4
Odd JAL                         PC = Jump Offset; $R31 = PC + 8


To give a more intuitive view of the implementation, schematic captures of branch and
Jump PC calculation are presented as following:


Picture 1 shows the implementation of branch PC calculation. We cannot use PC[31:0]
for the calculation of even branch address since PC = PC + 8 in the super-scalar pipeline.
Therefore, PCD[31:3] (PC of the decode stage, the “current” PC) are used as the base PC.
The result signal of the even branch comparator is used as the third bit which in effect
add 4 to the base PC.
                           Picture 1. Branch PC calculation




Picture 2 shows the implementation of the Jump PC calculation. The first level selects
between Jump and Jr instructions, whereas the second level mux selects the even and odd
instructions.
                          Picture 2. Jump PC calculation
Super-scalar Structure
As mentioned before, the microprocessor features two pipelines for parallel operation.

Thus, complicated hazard and stall cases arise from this.



Stalls
There are six different stall cases that must be handled. The first two occur when either

one of the instructions in the execution stage is a branch or jump instruction. When that

is the case, the delay slot must be handled. More precisely, if the branch occurred in the

even location, then the delay slot has already been executed in parallel with it. Thus, the

instructions in the decode stage (both even and odd) must be replaced with no-ops. (See

figure 1)
 Even
 Pipeline                                           BRANCH


 Odd
 Pipeline                                             Delay
                                                       Slot

       Figure 1 – Both instructions in the Decode stage needs to be ignored.




However, if the branch is in the odd pipeline, then the even instruction at the decode

stage needs to be executed while the odd one must be swapped with a no-op. (See figure

2)
 Even
 Pipeline                            Delay          Instruction
                                      Slot

 Odd
 Pipeline                                           BRANCH



 Figure 2 – Since the branch now occurs in the odd pipeline, the delay slot comes after,
                      but the following instruction must be ignored.




The third case occurs when the instruction in the execution stage is a JAL or a LW and

the odd, decode instruction depends on it. When that is the case, forwarding cannot solve

this case and both instructions in the decode stage must be stalled.

                               ADDIU $1, $3, 2
 Even
 Pipeline



 Odd                                               LW $1, 0($10)
 Pipeline


Figure 3 – Load word followed by an instruction that uses the loaded data forces a stall
                                  in both pipelines.




The fourth stall case is when the same thing happens for the odd instruction at the decode

stage. Here, the even instruction is allowed to execute while the odd instruction is stalled

for one cycle. Afterwards, the even instruction is replaced with a no-op since it has

already been executed once, and the odd instruction is permitted to enter the execution
stage. The fifth case is when both instructions in the decode stage accesses the memory.

Since there is only one data cache, a parallel access is impossible and so the even

instruction will go first while the odd will be stalled until the following cycle. Finally,

the last case occurs when the odd instruction in the decode stage has a dependency on the

even instruction in the same stage. Because both instructions are in the decode stage,

nothing has been calculated so forwarding is impossible. Therefore, this is handled

identically to the previous case, that is, the odd instruction is stalled for one cycle.

(Figure 4)

 Even                                SW $1, 0($10)
 Pipeline


                                     SW $3, 4($10)
 Odd
 Pipeline



Figure 4 – Here, both of these instructions cannot be executed in parallel, so the even
   instruction is executed first while the odd one is stalled until the following cycle.




Because of the alignment of the pipelines, a major problem occurs when there is a branch

or jump to an odd instruction. This is because instructions come in pairs for this

processor, and so when the target of the branch is the odd instruction, it is crucial that the

even instruction associated with it (which is the instruction before the target one) is not

executed. Thus, there is also a branch handler which watches for branches and jumps to

odd addresses. Once the branch is handled, the stall handler previously discussed will

handle the delay slot such that only one instruction is executed there. Next, after the new
PC is calculated and its corresponding instruction fetched, the branch handler will check

the target PC. If the target PC is even, then the pipeline acts like normal whereas if the

target is odd, then the handler will ignore the even instruction. (see figure 5)


                                                        Branch
 Even
 Pipeline


                                                      Delay Slot
 Odd
 Pipeline           Target of
                     Branch


     Figure 5 – If a branch/jump targets an odd instruction, the corresponding even
 instruction must be ignored. This is handled by a special branch/jump handler that is
                              separate from the stall handler.



Stream Buffer

Stream buffer locates in between the first-level cache and the DRAM. With a slow

memory access time, an instruction fetch on every cycle will be very inefficient. The

addition of a stream buffer will improve the efficiency because it will do a burst to get

four instructions on every miss in the stream buffer. One normal read from the DRAM

will take 9 cycles, but one burst (two consecutive reads) will take 10 cycles. There are

three cases in which a burst will occur.

Case 1: At the very beginning of the program, the stream buffer has nothing in it and

needs to fetch first four instructions in the DRAM.

Case 2: When you need to branch/jump to a new instruction that is not in the stream

buffer or in the first-level cache.
Case 3: Two hits in the stream buffer followed by a miss in the stream buffer, meaning

that the next instruction has not been prefetched.



Stream Buffer Controller

The stream buffer controller works on the positive edge of a phase-shifted clock. The

delay of the phase shift is 10 ns. The reason for this phase-shifted clock signal is that the

registers in the first-level cache work on the negative edges of the clock. The request

signal from the first-level cache goes up after 5 ns if there is a cache miss. Therefore, by

using the phase-shifted clock, we can send a fake wait_sig to the cache to tell it to do an

instruction stall (before the negative edge of the normal clock) if there is a miss in the

stream buffer. The trick is to operate between the phase-shifted clock for the stream

buffer controller and the negative edge of the normal clock for the cache and pipelined

datapath.



Without the stream buffer, the first-level cache operates mainly by looking at the wait_sig

from the DRAM. However, now with the stream buffer, the stream buffer controller will

take in the wait_sig from the DRAM and send a fake wait_sig to the first-level cache.

This is necessary because during the burst mode, the DRAM controller will send a

wait_sig low for one cycle. This wait_sig from the DRAM cannot be sent directly to the

first-level cache. Therefore, the fake wait_sig is required. The fake wait_sig is a

continuous high signal like the specification. The dip is important for the stream buffer

controller because it uses that to enable the first half of the stream buffer to store the first

two instructions. And when the wait_sig goes down for the second time, the controller
will enable the second half of the stream buffer to store the next two instructions. An

example of the wait_sig timing from the DRAM during a burst is shown below.




1

0                                                                              time

             Figure 6 – example of wait signal behavior during burst read.

The following is the pseudo code:

If (no match in buffer) then

    If (request_from_cache = „0‟) then

       Wait_sig_to_cache := 0;

       Request_to_DRAM := 0;

    Elsif ((request_from_cache = „1‟) and (wait_sig_from_DRAM = „0‟)) then

       Request_to_DRAM := „1‟;

       Wait_sig_to_cache := „1‟;

       If (wait_sig_from_DRAM = „0‟ for the first time) then

          Enable_for_first_half_register := „1‟;

          Enable_for_second_half_register := „0‟;

      Elsif (wait_sig_from_DRAM = „0‟ for the second time) then

          Enable_for_first_half_register := „0‟;

          Enable_for_second_half_register := „1‟;

      Else
          Enable_for_first_half_register := „0‟;

          Enable_for_second_half_register := „0‟;

       End if;

  End if;

Else

   Wait_sig_to_cache := 0;

   Request_to_DRAM := 0;

End if;



Once the four instructions are in the stream buffer and every time the first-level cache

gets a miss, the miss penalty will only be one cycle. Every time the cache controller

sends a request and the PC to the stream buffer, the stream buffer will compare the PC

with the address tag associated with each half of the buffer and select the matched half to

output to the cache based on the results of the comparators.
                          High-Level Schematic of the Stream Buffer

Address tag from 1st                                   Compare with both
level Cache (10-bit bus)                               address tags to see if
                                                       there is a Hit




Data from DRAM
(even)
                                                                      Use the results of
Data from DRAM (odd)                                                  the comparators to
                                                                      select the mux
(64-bit wide bus)

                                  4 32-bit Registers


    1st even address tag

                                                                           To 1st level Cache
                                                                           (Both are 32-bit buses)
    2nd even address tag



                                                                      Wait_sig to 1st
                    Hit                Stream                         level Cache
                                       buffer
      Wait_sig                         Controller
      from DRAM
                                                                      Request to DRAM
  Request from
  1st level Cache                                                     Register enable
                                                                      signals
    Address from
    the 1st level                                                     Address to the
    Cache                                                             DRAM and to the
                                                                      internal tag
                                                                      registers
Victim Cache

The implementation of the victim cache is similar to the first level cache in terms of

control logic and schematic design. The main differences are in the handling of data

input and output. The victim cache contains four cache lines worth of information, is

fully-associative, and uses a FIFO (first in, first out) replacement policy. The reason for

using FIFO instead of random as the first level cache is that random replacement is more

effective for larger cache sizes. The small size of the victim cache makes it probable that

certain cache blocks will be replaced more often. Because the victim cache outputs to the

DRAM on a cache miss, replacing blocks on a more frequent basis will result in a higher

AMAT (determined by hit time + miss rate * miss penalty). A FIFO replacement policy

ensures each block has the same chance to be replaced. Theoretically a LRU is the best

replacement policy, but is too difficult to implement with more than two sets.
             Top Level Schematic of Victim Cache in Memory Hierarchy


                                                                              Datapath


                                        1st Level Cache
                      Valid / Dirty / Tag[9:0]   Word 0           Word 1




             Victim Cache

   reg            Cache
    1
                   lines

                                    reg                                    Arbiter and
                                     2                                       DRAM




Because the victim cache is fully associative, each cache block component contains a

comparator to determine a hit with the address and tag. However, the input to each cache

block only comes from the first level cache. Thus the victim cache blocks are simplified

version of the first level cache blocks. Schematic-wise, the victim cache sits between the

first level cache and the DRAM. The goal is to make the victim cache transparent to

system so the first level cache thinks it is requesting directly to the DRAM. Thus the

victim cache intercepts all intermediate signals and must output data as well as control

lines. Muxes are used to select which cache block to output and where to output (first

level cache or memory). To avoid losing the cache data while swapping the two cache
lines, two additional cache registers are added to hold the output from the first level cache

and victim cache. In addition to sending data back to the first level cache, the victim

cache must also send the dirty bit. The first level cache needs to know whether the data

being sent up is dirty or not, so the correct replacement behavior is used on subsequent

memory accesses.



Changes to the datapath are local to the memory system. The first level cache needs

extra output signals to send the tag, valid and dirty bits to the victim cache. In addition, it

needs an extra input to receive the dirty bit from the victim cache. The DRAM needs a

mux to choose the correct burst signal from the victim cache or the stream buffer.



Victim Cache Control

The first level cache control works on the rising edge of the clock while the memory

controller works on the falling edge. To act transparently in the system, the victim cache

must work within this time to output control signals at the correct time for both first level

cache and memory controller to work. A delayed clock is used to set up this timing

situation. The following RTL explains the behavior of the victim cache control.



if (first_level_cache_request = 1) then

   reg1    <= 1st level cache line

   reg2    <= victim cache line (Hit or Replace)

   if (replace_block = dirty) then

       DRAM        <= reg2
    if (victim_cache_hit = 1) then

       1st level cache <= reg2

    else

       1st level cache <= DRAM

    victim cache <= reg1



Control Signals

-   To first level cache

Because the first level cache interacts with the DRAM solely on the wait signal, the

victim cache must simulate the wait signal to get the appropriate response from the first

level cache. The wait signal must be set high when the victim cache is writing a cache

line to the DRAM or when the first level cache is trying to read from the DRAM.



-   To DRAM

Because the victim cache only has one memory access at a time, the burst signal is set

low. The request signal comes from only the victim cache. Any request from the first

level cache is first interpreted by the victim cache to determine which memory requests to

perform if any.
CAM-based cache

The major difference between the CAM-based cache and standard cache is how the

address tags are stored and handled. With the CAM-based cache, the addresses are store

in a completely different register from the data words. Each tag address register is

accompanied by a comparator, the function of which is to determine whether the register

contains a match with the input address. The match signals for each register are used to

quickly output a hit data word instead of waiting in the control.



Top Level Schematic of CAM-Based Cache


            Address                      Data (32 or 64 bits)
            (10 bits)




                                            CAM array
            Control                                                  Priority
             Logic                    8 cache lines x 10 bits        encoder




The general CAM-based design utilizes the above scheme. At the very least, the CAM-

based cache must output a hit signal and a line select signal. All comparators work in

parallel and output a single bit match signal. All the match signals pass to an 8 to 3
priority encoder. The hit signal (“or” of the match signals) is used by the cache control to

determine whether the data resides in the cache or a request to DRAM is necessary. The

line select signal is used with an 8 to 1 mux to choose which of the cache line data to

output. In an effort to make the cache output a hit value as quickly as possible, both

encoder and 8 to 1 mux are built out of gates.



The CAM-based cache was already mostly built during lab 6 so most functional testing

was done on it. Tests for the final project implementation were done at the cache level to

measure delay time and the improvement over the previous vhdl comprised components.

Improvement times varied according to the input patterns, but output delay averaged a

few nanoseconds.



Results


Overall, the super-scalar processor works on simple code where there are no

dependencies. It offers support for all the instructions that are required in the final

processor, but can only run reliably for simple arithmetic programs. However, when

there are complicated dependencies that overlap, the controller will fail to no-op correctly

at times and so certain tests fail. Therefore, it seems to be unable to exit from the partial

sums program.



Currently our processor minus the super-scalar design has a minimum cycle time of 76

ns. The critical path is located in the memory stage beginning from the control logic of

the first level cache running to the victim cache, passing data through the arbiter and
ending at the input to the dram controller. Because our design has the first level cache

working on the rising edge of the clock, we effectively have only one-half clock period to

get the data stable at the memory controller side. The exact timing of the critical path is

determined by the phase shifter, delay of the victim cache control and load delay of

several muxes and gates in the arbiter. The total time reaches 38 ns for the half clock

cycle and thus results in a 78 ns clock cycle. Compared with lab 6 which had a 42 ns

clock cycle, it is noted that the lab 6 did not have a victim cache which operated on the

delayed clock. This victim cache most likely is the main reason for the increase in cycle

time.



Referring to the performance improvements of the super-scalar design and added

memory components, the results are mixed. Already the longer cycle time is a

disadvantage to the new processor. In addition, the provided test program, partial_sum,

runs on tight loops with many memory loads, but few memory stores. In fact, our

processor is most optimized for long branching, but sequential code and high volume of

memory stores. The reason for this is because of our use of the stream buffer to pre-fetch

sequential data and the victim cache to increase the cache line capacity. The following

shows the results of our processor at various stages compared with the lab 6 processor.




                                                                 Original processor (lab 6)
                                                                 Processor + stream buffer
  Cycles
                                                                 Processor + victim cache
                                                                 Processor + SB + VC




           0    500     1000    1500     2000    2500     3000
Conclusion


In conclusion, this final project took us around 200 hours as a group to complete. The

strength of the project is that we tried to keep everything simple. When there is a bug, we

know how to attack the problem. However, when the problem occurs with VHDL and

Viewlogic, we were often stuck for a while before we can think of a way around it. Two

weeks to do the final project is really limited. If we had more time, we definitely would

improve our hardware components. Right now, one improvement for the stream buffer is

that it can start prefetching for the next two instructions after two instructions have been

fetched by the first-level cache. Another improvement on the first-level instruction cache

that we did not have time to do is to make it a FIFO.

Victim cache currently uses four states during a write-back to memory and a read. The

fourth state is a reset state and probably could be pushed to the falling edge of the third

state, thus saving one cycle. Finally, branching and jumping performance can be increase

by using a more aggressive scheme of stalls. I am sure a lot of people encountered

problems with VHDL and Viewlogic. Sometimes the clock will not run after cycling for

a certain time. Another problem is that the clock would be undefined at the very

beginning for no reason. The biggest challenge still lies in getting the VHDL

components to work properly. In the stall controller, there is one case which the VHDL

code never picks up. This was eventually relieved by removing the specific cases and

using a more general scheme to deal with that one particular case. (The LW dependency

when the current instruction was even was combined with the even case)
Appendix

See the attached zip file

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:20
posted:4/27/2010
language:English
pages:23