Preparatory Report

Document Sample
Preparatory Report Powered By Docstoc
					                                 Submission Form for Final-Year

   54F                           PROJECT REPORT
                                 CIIT-WAH-CE-DP-54F (revision 1.1)




      Enhancements In MIPS32 Architecture
                 Using Multiple Execution Units




                                                 July 2009




Engineering Department




COMSATS Institute of Information Technology, Wah Campus              Computer Engineering Department
                                Acknowledgements



In the name of God, the most kind and most merciful


We would like to thank our parents and teachers who kept backing us, up all the time, both
financially and morally…


We would also like to thank Sir Rameez Naqvi for his guidance and encouraging us to work
hard and smart. We have found him very helpful while discussing the optimization issues in this
dissertation work. His critical comments on our work have certainly made us think of new ideas
and techniques in the fields of optimization.


We are grateful to the God Almighty who provides all the resources of every kind to us, so that
we make their proper use for the benefit of mankind. May He keep providing us with all the
resources, and the guidance to keep helping the humanity
                                      Abstract
The modern world is getting even more modern in sense of technology. As the
technology advances the design becomes more complex and the performance in term of
time matters a lot. The technology has to be nearly real time to provide the best of its
services. Over the last decades the performance of the microprocessors has increased
tremendously. The current Market is focusing more on embedded microprocessors as
compared to common desktop Computers. Various products in the market sector are
dependent on these embedded processors. Many Features of a modern device can be
controlled with one or more embedded processor, providing reliability, accessibility and
efficiency to the user. There is a considerable scope for research in microprocessors as
they provide greater functionality at lower costs.


The main goal of this project to enhance a simple MIPS32 based pipelined processor
with two execution units and reduces the CPI (Clock per Instruction) less than one, with
using the same components and memories available in the processor.




                                             ii
                                                       Table of contents
1       INTRODUCTION ............................................................................................................................. 1

2       LITERATURE REVIEW ................................................................................................................. 3

    2.1 DEDICATED PROCESSORS .................................................................................................................... 3
    2.2 RISC ................................................................................................................................................... 3
    2.3 PIPELINING .......................................................................................................................................... 3
    2.4 PIPELINED STAGES .............................................................................................................................. 4
    2.5 PIPELINED REGISTERS ......................................................................................................................... 5
    2.6 PIPELINED CONTROL SIGNALS ............................................................................................................. 6
    2.7 HAZARDS ........................................................................................................................................... 6
    2.8 RESOLVING HAZARDS ......................................................................................................................... 8
    2.9 SUPERSCALAR PROCESSOR ............................................................................................................... 12
    2.10 PIPELINED PERFORMANCE .............................................................................................................. 14
    2.11SPECULATIVE EXECUTION ................................................................................................................ 14

3       PROJECT DESIGN ........................................................................................................................ 16

    3.1METHODOLOGY ................................................................................................................................. 16
    3.2ARCHITECTURE OVERVIEW ............................................................................................................... 20
    3.3DESIGN DESCRIPTION ........................................................................................................................ 21
        3.3.1       Module 1 ............................................................................. Error! Bookmark not defined.21
        3.3.2       Module 2 ............................................................................. Error! Bookmark not defined.21
        3.3.3      Module 3 ................................................................................................................................ 21
        3.3.4      Module 4 ................................................................................................................................ 21
        3.3.5 Module 5 ................................................................................................................................. 21
        3.3.6 Module 6 ................................................................................................................................. 22

4       IMPLEMENTATION ..................................................................................................................... 25

    4.1DEVELOPMENT STAGES ..................................................................................................................... 25
        4.1.1       Development of components ................................................................................................. 25
        4.1.2       Development of a MIPS32 Pipelined processor ................................................................... 25
        4.1.3       Enhancements in MIPS32 Architecture ................................................................................ 25

5       EVALUATION & PERFORMANCE ........................................................................................... 26

    5.1UNITTESTING..................................................................................................................................... 26
    5.2RESULTS ............................................................................................................................................ 32
        5.2.1       Comparison of simple and enhanced version ....................................................................... 32
    5.3 SYNTHESISREPORT ..................................................................................................................... 34




                                                                               iii
REFERENCES ......................................................................................................................................... 36




                                                                         iv
                                                  Table of Figures
FIGURE 1-1 DESIGN PROCESS CYCLE ............................................................................................................. 1
FIGURE 2-1 PIPELINED STAGES AT EACH CLOCK CYCLE ................................................................................ 5
FIGURE 2-2 INSTRUCTION EXECUTION WITH PIPELINING .............................................................................. 5
FIGURE 2-3 TRANSFER OF DATA BETWEEN PIPELINED REGISTER................................................................... 6
FIGURE 2-4CONTROL SIGNALS PASSING TO THE NEXT STAGE ...................................................................... 6
FIGURE 2-5 DATA DEPENDENCIES AMONG THE INSTRUCTION ....................................................................... 8
FIGURE 2-6 FORWARDING REQUIRED DATA FOR SOLVING DEPENDENCIES .................................................. 10
FIGURE 2-7ADDING STALL TO THE INSTRUCTION FLOW .............................................................................. 12
FIGURE 2-8 INSTRUCTION EXECUTION WITH SUPERSCALAR PIPELINING ..................................................... 12
FIGURE 2-9 INSTRUCTION EXECUTION WITH SUPERSCALAR PIPELINING ..................................................... 12
FIGURE 3-1 ARCHITECTURE OVERVIEW DIAGRAM ..................................................................................... 20
FIGURE 3-2 PROGRAM COUNTER VIEW ....................................................................................................... 22
FIGURE 3-3 INPUTS AT THE MULTIPLEXERS AT ALU .................................................................................. 23
FIGURE 3-4 DATA FORWARDING AT MEM/WB .......................................................................................... 24
FIGURE 5-1 DATA MEMORY VIEW ON MODEL SIM SIMULATION ................................................................ 26
FIGURE 5-2 REGISTER MEMORY VIEW ....................................................................................................... 28
FIGURE 5-3 PROGRAM COUNTER VIEW AFTER BRANCH ............................................................................. 30
FIGURE 5-4 REGISTER FILE VIEW AFTER SIMULATION OF SORTING PROGRAM............................................. 32




                                                                      v
1 Introduction
Over the last decades the performance of the microprocessors has increased tremendously. The
current Market is focusing more on embedded microprocessors as compared to common
desktop Computers. Various products in the market sector are dependent on these embedded
processors. Many Features of a modern device can be controlled with one or more embedded
processor, providing reliability, accessibility and efficiency to the user. There is a considerable
scope for research in microprocessors as they provide greater functionality at lower costs.
The mobile device sector has a great trend towards embedded systems, these devices provide all
at one place like video/audio play back and recording, Video games, Internet, Communication
Facilities Taking Photos and so much more. To perform this device must have a processing
element that can perform every thing at real time, keeping the cost in view. Many of the
techniques used in general processor (i.e. Desktop Computers) are not suitable for dedicated
processors due to higher power and space requirements to so alternate techniques are required.
The cost of the hardware and performance are directly proportional to each other, as the
performance increase the cost of the hardware will also increase and when the hardware has to
decrease the performance will decrease. The main goal of this project is to increase hardware to
some extent and almost double the throughput. CPI (clock per instruction) has been a scale to
compare different machines/designs.
Our design belongs to the category of dedicated processor this processor is capable of
performing Arithmetic and logical instructions with the assistance of jumps and branches
(Loops), these instructions will allow the processor to do almost every thing any piece of can be
run able directly or indirectly (may require some logical/syntax changes).The design and
implementation of Super Scalar processor is complex and time consuming. The engineering
design was an iterative process of specification, Analysis, Implementation and synthesis.




                                    1.1 Design process cycle
In this project a typical MIPS32 Pipelined Processor is enhanced by adding a new Execution
unit (ALU), all the registers have been doubled except the memories used up in it, by adding
lesser than a new processor the throughput of the new machine can be increased. Our goal was

                                                1
to design a new processor with multiple execution units and to simulate the hardware using up
VHDL.
Chapter 2 covers the background material and literature reviewed to understand the intricacies
of this project
Chapter 3 describes the design formulated for the successful execution of the suggested
techniques. The design explains the architecture of our Super Scalar Processor In the end, this
chapter gives detailed information for each module, explaining their critical methods and
properties required for successful execution.
Chapter 4 explains the approach taken and issues confronted while implementing the intended
goals. It explains the temporal stages experienced while implementing the design, and also the
key functions that needs special consideration from the viewpoint of implementation. In the end,
the author has explained the user-interface of the implemented program, and the details of each
command how it is handled and used.
The testing and evaluation of the developed hardware and software is discussed in Chapter 5. It
explains the testing procedure followed and then the various types of tests executed on the
application to confirm its proper functioning and meeting the acceptance criteria. The results of
these tests are summarized in the end, with possible results concluded from these tests.
In the end, we briefly present the conclusions from this project and also the possible future
improvements and additions for better design/implementation and investigation of
Enhancements In MIPS32 Architecture Using Multiple Execution Units




                                                2
2 Literature Review
2.1 Dedicated Processors
The processors that are deliberately designed to perform specific tasks are dedicated processors,
or in other words are built and designed for specific/dedicated purpose. These processors can
perform almost all the basic operations like loops and arithmetic and logical operations. A
processor that can perform logical operations like and, or and not and can loop can handle any
kind of software related problem.
2.2 RISC
RISC stands for Reduced Instruction set Computers, a group of scientists believe that a
computer might be able to solve big problems through very small and simple instructions.
Means Small instructions together perform a big operation.
On the other hand there is another concept of CISC (Complex Instruction Set
Computers) ,which quotes that a single complex instruction should perform big complex
operations ,is this concept the hardware might increase a lot but the instruction set will decrease
deliberately.
2.3 Pipelining
Pipelining is a common technique used now days to improve the through put of the processor in
terms of time. A single cycle processor Duty clock cycle might depend upon the sum of all the
delays of the components and other latencies.
         Clock Cycle Time (CCT) =∑ (All Delays of components) + Other Latencies
But a single cycle processor implementation is not common as the performance is not good
enough. Another approach is to use the concept of Pipelining. In this technique the t Clock
Cycle Time (CCT) is determined by the slowest component in the processor. Pipelining is a
technique that is used to make processors faster.
Pipelining is a natural concept in everyday life, e.g. on an assembly line. Consider the assembly
of a car: assume that certain steps in the assembly line are to install the engine, install the hood,
and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line
can have only one of the three steps done at once. After the car has its engine installed, it moves
on to having its hood installed, leaving the engine installation facilities available for the next car.
The first car then moves on to wheel installation, the second car to hood installation, and a third
car begins to have its engine installed. If engine installation takes 20 minutes, hood installation
takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only
one car can be operated at once would take 105 minutes. On the other hand, using the assembly
line, the total time to complete all three is 75 minutes. At this point, additional cars will come



                                                  3
off the assembly line at 20 minute increments. As the assembly line example shows, pipelining
doesn't decrease the time for a single datum to be processed; it only increases the throughput of
the system when processing a stream of data. [1]
High pipelining leads to increase of latency - the time required for a signal to propagate through
a full pipe. [1]
A pipelined system typically requires more resources (circuit elements, processing units,
computer memory, etc.) than one that executes one batch at a time, because its stages cannot
reuse the resources of a previous stage. Moreover, pipelining may increase the time it takes for
an instruction to finish. [1]
The pipe lined concept is similar to a shift register made up of flip flops, in a shift register bit(s)
pass from one flip flop to the other at each clock cycle, same is the concept of pipelined
processor instructions pass from one stage to an other and while they are passed from one stage
to other, specific operations are performed on them according to need.


2.4 Pipelined Stages
Our processor is five stage pipelined processor, the number of stages can be increased as desired
but as the number of stages increase the performance might increase to some extent but as the
number of stages increase the design will become more complex and will be difficult to figure
out the problems occurring in the design. The processor is divided into five stages according to
their functionality.


    1. Instruction Fetch (IF)
    2. Instruction Decode (ID)
    3. Instruction Execute (EXE)
    4. Memory (DM)
    5. Write Back (WB)


Instruction Fetch (IF)
Instruction Fetch is the first stage of our pipelined processor, in this stage an instruction is
fetched from an instruction memory, the address of the instruction is specified by the program
counter. An n-way pipelined processor must be capable of fetching n instructions at each clock
cycle.
Instruction Decode (ID)
Instruction Decode is the second stage; this stage determines the requirements for that
instruction i.e. Register Access, Memory Access and Arithmetic and Logic Unit usage.
Instruction Execute (EXE)



                                                  4
At the third Stage Instruction Execute, the Arithmetic and Logical operations are performed
which are required by the instruction’s. An n-way pipelined processor will have n number
function units.


Memory (DM)
At the fourth stage memory reading or writing operations are performed as per requirement. At
this stage either memory read can be done or memory write process can take place.
Write Back (WB)
This is the last stage of our pipelined processor; at this stage the Register Memory write process
takes place i.e. the data is written back to the register file at the desired register.


     IF           ID          EX          DM            WB
                  IF          ID          EX            DM        WB
                              IF          ID            EX        DM          WB
                                          IF            ID        EX          DM          WB

                               2.1 Pipelined stages at each clock cycle




                            2.2 Instruction Execution with Pipelining

2.5 Pipelined Registers
The pipelined registers hold the data of each stage, four pipeline register have been used in
between each stage, each registers holds the data for a clock cycle and provides the data to the
components until the next clock cycle. The Data between these register works similar to the data
between a shift register. Each instruction carries all the data required for completion with itself
among the stages including the control signals. Four Pipelined registers are named.
    1. IF/ID
    2. ID/EX
    3. EX/MEM
    4. MEM/WB


                                                    5
                  IF/ID                ID/EX           EX/MEM         MEM/WB




                           2.3 Transfer of data between Pipelined registers


2.6 Pipelined control Signals

Control signals are very essential for carrying out an instruction correctly, the control
signals are generated by the control unit based on the opcode (opcode differentiates
between the instruction).The control signals are passed on to the next stages on each
clock cycle, just like data is passed over.




                            2.4 Control Signals Passing to the Next Stage


2.7 Hazards
Pipelining may create up new issues generally called data hazards. These Hazards are generally
divided into
    1. Data hazards
    2.   Control hazards

                                                  6
Data Hazards
Data Hazards are basically dependencies of one data onto another data. Data dependency can
occur due to data validity or transfer. This is when reads and writes of data occur in a different
order in the pipeline as specified in the program code/instruction set.
A Read after Write (RAW) hazard occurs when written as in the instructions code, one
instruction reads a location after an earlier instruction writes new data to it, but in the pipeline
the write occurs after the read.
A Write after Read (WAR) hazard is opposite to a RAW Hazard: in the code a write occurs
after a read, but the pipeline causes write to happen first.
A Write after Write (WAW) hazard is a situation in which two writes occur out of order.
normally only consider it a WAW hazard when there is no read in between; if there is, then we
have a RAW and/or WAR hazard to resolve, and by the time we've gotten that straightened out
the WAW has likely taken care of itself. [1]
In the case of the 5-stage RISC pipeline, if an instruction depends upon a register value
generated by its immediate predecessor, then when it comes to read the register value during the
‘register access’ pipeline stage it will incorrectly read the previous value of the register, as the
predecessor will still be in the ‘execute’ stage and will not have written its result back to the
register file. Data hazards are generally resolved using bypass (or forwarding) data paths. For
the 5-stage RISC pipeline, the results of the ‘execute’ and ‘memory access’ pipeline stages are
fed directly into the execute stage on the next cycle, bypassing the register file entirely. The
decode logic then uses a technique known as score boarding to keep track of which registers’
values should be read from the register file and which are currently in the pipeline and should
use the bypass paths.
However, not all data hazards may be resolved like this – if a load instruction is immediately
followed by an instruction using the result of the load, the second instruction will enter the
‘execute’ stage at the same time as the load instruction is in the ‘memory access’ stage,
performing the load. With this pipeline arrangement it is impossible to execute these
instructions simultaneously, as the second instruction requires the result of the load at the time
the load commences. As such, it is impossible to resolve this hazard without delaying the
second instruction. Often it is possible for the compiler to re-order instructions to remove this
hazard, but when this is not possible there are two basic strategies for dealing with this situation
in the processor:[2]
Software interlocking
The processor does nothing to resolve this conflict – it is entirely up to the compiler to
Ensure this situation does not occur, and to insert no-ops into the instruction stream where
necessary.[2]


                                                  7
Hardware interlocking
The processor uses score boarding to determine when these hazards will arise, and inserts
NOPs, as necessary into the pipeline itself. [2]


Consider the following piece of code
    (0) sub $2,$1,$3
    (1) and $1,$2,$5
    (2) or $13,$6,$2
    (3) add $14,$2,$2


In the above code instruction number (1), (2) and (3) are dependent on the result of (0) i.e. $2,
the writing of the actual result will be in process while the other instructions would have
accessed the same location and would of fetched a wrong value from the register file. The
fetched data would definitely be invalid and most probably will yield wrong results for the next
instructions.




                          2.5 Data Dependencies among the instruction



2.8 Resolving Hazards

There are four possible techniques for resolving a hazard. In order of preference, they are:


    1. Forward. If the data is available somewhere, but is just not where we want it, we can
        create extra data paths to ``forward'' the data to where it is needed. This is the best
        solution, since it doesn't slow the machine down and doesn't change the semantics of the
        instruction set. [1] .All of the hazards occurring in the above code can be handled by
        forwarding.


                                                   8
    2. Add hardware. This is most appropriate to structural hazards; if a piece of hardware has
        to be used twice in an instruction, see if there is a way to duplicate the hardware. This is
        exactly what the example MIPS pipeline does with the two memory units [1].
    3. Stall. We can simply make the later instruction wait until the hazard resolves itself. This
        is undesirable because it slows down the machine, but may be necessary. Handling a
        hazard on waiting for data from memory by stalling would be an example here. Notice
        that the hazard is guaranteed to resolve itself eventually, since it wouldn't have existed
        if the machine hadn't been pipelined. By the time the entire downstream pipe is empty
        the effect is the same as if the machine hadn't been pipelined, so the hazard has to be
        gone by then [1].


Forwarding

Forwarding is a technique to resolve Data Hazards by using the results of ALU from Different
stages before they are written back, all the results are made available to a multiplexer placed at
the ALU,A forwarding Unit is employed to take decision based on the opcode and register
address mentioned in the instruction, for the selection of the multiplexer.

    Consider the following piece of code

    (0) Sub $2,$1,$3
    (1) And $12,$2,$5
    (2) Or $13,$6,$2
    (3) Add $14,$2,$2
    (4) SW $15,100($2)

In the Above piece of code (1), (2), (3) are dependent on (0) i.e. $2.The required data can be
forwarded at the appropriate location, providing the valid data to the execution unit, by doing so
the instruction present at the execution stage will yield right and valid results.




                                                  9
                 2.6 Forwarding required data for solving dependencies


Stalling the Pipeline

This unit stalls the pipelined if there is no possible condition for removing dependencies.
This unit simply stops the unit to be written i.e. adding a bubble to the pipeline, in the mean
time the required result will be written back. In actual an empty slot is created in between
also known as a bubble. Generally this task is performed by HDU (Hazard Detection
Unit).There is another Software Solution for creating stalls, one of them is a ‘No Operation
Procedure’ (NOP) ,this instruction works similar to a stall generated by the hardware.
Adding NOP can also solve dependencies between the instructions by just wasting clock
cycles without making any change to the memories. Creating Stalls is an expensive solution
as it will minimize the performance of the processor that’s why stalls are avoided as much
as possible.




                                            10
                             2.7 Adding stall to the instruction flow


Control hazards
When an unconditional branch is decoded, the instruction following it in memory will already
be in the process of being fetched from memory. The processor can either choose to execute this
instruction (the ISA would then specify that the instruction immediately following a jump is
always executed) which is then termed as being in a branch delay slot, or it can nullify it and
allow a no-op to proceed down the pipeline. The advantage of using branch delay slots is that it
simplifies issue logic and can allow slightly better performance. However, branch delay slots
can cause issues with code compatibility, as the number of delay slots following a branch
depends on the pipeline organization – a new version of architecture may require a different
Number of delay slots than a previous architecture yet still have to run code compiled for the
old Architecture. The new processor would then have to emulate a different number of delay
slots, negating the original issue logic complexity advantages. A further problem arises when a
conditional branch is executed – if the branch condition is resolved in the ‘execute’ pipeline
stage then the new PC value will not be available until several cycles after the branch was
fetched, which can be a large penalty if the architecture uses a deep pipeline. This problem is
generally tackled using branch prediction. In the simplest case, all conditional branches can be
assumed to be not taken, meaning that the instruction fetch stage continues fetching instructions
as it would if the branch had not come up – this means there is a performance penalty for a
taken branch, but none for one that is not taken.
More sophisticated schemes include static branch prediction, where the branch opcode specifies
whether it should be predicted as taken or not taken, and dynamic branch prediction, where the
processor maintains some state on past behaviour of branches in order to guess whether a
branch will be taken or not. Modern dynamic branch predictors can achieve accuracy in excess
of 99%. [2]


                                                11
2.9 Superscalar Processor
In super scalar pipeline, multiple execution units are placed with addition of some more
hardware the processor will then become capable of carrying two instructions at almost
doubling the through put of the processor. It is an efficient way to enhance a processor
performance by adding very little to the processor.
The CPI (Clock per Instruction) rate of superscalar processor is typically lesser than a simple
pipelined processor, as it is carrying out more than one instruction in a single clock cycle.
Each functional unit is not a separate CPU core but an execution resource within a single CPU
such as an arithmetic logic unit, a bit shifter, or a multiplier. While a superscalar CPU is
typically also pipelined, they are two different performance enhancement techniques. It is
theoretically possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar
CPU. The superscalar technique is traditionally associated with several identifying
characteristics. Note these are applied within a given CPU core. [1]


          IF           ID           EX           DM            WB
          IF           ID           EX           DM            WB
                       IF           ID           EX            DM           WB
                       IF           ID           EX            DM           WB
                                    IF           ID            EX           DM      WB
                                    IF           ID            EX           DM      WB


                            2.8 Instruction Execution with Superscalar Pipelining




                            2.9 Instruction Execution with Superscalar Pipelining




                                                      12
The simplest processors are scalar processors. Each instruction executed by a scalar processor
typically manipulates one or two data items at a time. By contrast, each instruction executed by
a vector processor operates simultaneously on many data items. An analogy is the difference
between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two.
Each instruction processes one data item, but there are multiple redundant functional units
within each CPU thus multiple instructions can be processing separate data items concurrently.
A superscalar processor usually sustains an execution rate in excess of one instruction per
machine cycle. But merely processing multiple instructions concurrently does not make an
architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve
that, but with different methods. However even given infinitely fast dependency checking logic
on an otherwise conventional superscalar CPU, if the instruction stream itself has many
dependencies, this would also limit the possible speedup. Thus the degree of intrinsic
parallelism in the code stream forms a second limitation. Superscalar architectures (as opposed
to standard or scalar architectures) increase performance by executing multiple instructions in
parallel, thereby increasing CPI. [3].


Super Scalar Processors are classified into two groups according to the instruction set that they
handle, in order and out of order superscalar processors.


In order
In an in-order superscalar processor, the hardware executes as many of the next instructions (up
to two) that it can, making sure that they are independent. Assume the “modest superscalar
processor” (in-order; can execute one R-type and one I-type together).
    1. lw $6, 36($2)
    2. add $5, $6, $4
    3. lw $7, 1000($5)
    4. sub $9, $12, $8
    5. sw $5, 200($6)
    6. add $3, $9, $9
    7. and $11, $5, $6


Out of Order
In an out-of-order processor, the hardware finds four (not necessarily consecutive) instructions
that are independent. Starts execution of an instruction as soon as
All of its dependences are satisfied, even if prior instructions are stalled. Following piece of
code a stall will occur in between 2 and 3 and vice versa.



                                                13
    1. lw $6, 36($2)
    2. add $5, $6, $4
    3. lw $7, 1000($5)
    4. sub $9, $12, $8
    5. sw $5, 200($6)
    6. add $3, $9, $9
    7. and $11, $5, $6




2.10 Pipelined Performance
Data hazards and branch hazards prevent CPI from reaching 1.0, but forwarding and branch
prediction get it pretty close. To improve performance we must reduce
Cycle time (super pipelining) or reduce CPI below 1.0. The execution time of a code can be
determined by the following relation.
                Execution time = Number of instructions * CPI * cycle time

Amdahl’s Law
According to the Amdahl’s law the performance enhancement possible with a given
improvement is limited by the amount that the improved feature is used. Mathematically the law
can be expressed as given below
                             Execution time after improvement =
  (Execution time affected by improvement / amount of improvement) + Execution time
                                          unaffected



2.11 Speculative execution
In computer science, speculative execution is the execution of code, the result of which may not
be needed. In the context of functional programming, the term "speculative evaluation" is used
instead
Generally, statements and definitions in a program can be divided into three types:
       Things which must be run and are mandatory
       Things which do not need to be run because they are irrelevant
       Statements which cannot be proven to be in either of the first two groups
The first group does not benefit from speculative execution because its members need to run
anyway.

                                               14
The second group can be quietly discarded much like lazy evaluation would discard its
members. [4]
The third group is the target of speculative evaluation, as its members can be run concurrently
with the mandatory computations until they are needed or shown to be of the second group; this
concurrency means that speculative execution can be parallelized.
Speculative execution is a performance optimization. It is only useful when early execution
consumes less time and space than later execution would, and the savings are enough to
compensate, in the long run, for the possible wasted effort of computing a value which is never
used. [4]
Modern pipelined microprocessors use speculative execution to reduce the cost of conditional
branch instructions. When a conditional branch instruction is encountered, the processor guesses
which way the branch is most likely to go (this is called branch prediction), and immediately
starts executing instructions from that point. If the guess later proves to be incorrect, all
computation past the branch point is discarded. The early execution is relatively cheap because
the pipeline stages involved would otherwise lie dormant until the next instruction was known.
However, wasted instructions consume CPU cycles that could have otherwise delivered
performance, and on a laptop, those cycles consume batteries. There is always a penalty for a
miss predicted branch. Many microprocessors (such as the Intel Pentium II and successors) have
a conditional move instruction. This is an operation to move data, usually, between a register
and memory, if a condition is met. There is no branching any more, and the miss prediction
penalty is less. [4]




                                              15
3 Project Design

3.1 Methodology

Instruction Set
We have implemented the following Mandatory Instructions, Their Specifications and working
is described As follows.
R-format
This type of instruction format can perform ALU and Register operations listed below.
        Add
        Subtract
        AND
        OR
        XOR
        SLT
        SGT
        SEQ
        SNE
        SGE
        SLE
        SGE




32             26   25          21     20               16   15        11   10 6   5           0

     Opcode                RS                 RT                  RD                    Func


OPCODE is used to guide the control unit and distinguish instruction types.
RS is the Source register address (5-bits).
RT is the Target register address (5-bits).
RD is the Destination register address (5-bits).
Function field distinguishes between ALU operations.




EXAPMLE:


                                                   16
ADD $R1,$R2,$R3
SUB $R1,$R2,$R3
AND $R1,$R2,$R3




LW (Load Word)
The “LW” Instruction copies the contents of the target Register (Rt) .To the Specified Address
of Data Memory. The Address is the Sum of the Base Address present in RS and the Offset
specified in the Offset field of the instruction set.


32            26     25            21     20            16   15                         0

     Opcode                  RS                   RT                   OFFSET


SW (store word)
The Store Word instruction copies the data from the Data Memory to the register Specified in
RT field. Just like load word Register (RS) Has the Base Address and the Address is calculated
the same way.


32            26     25            21     20            16   15                         0

     Opcode                  RS                   RT                   OFFSET
.


ALU Immediate
The ALU immediate instruction can perform Arithmetic and Logical Operations on immediate
values. The Contents of RS are added to the Offset and the result is stored to the Address
Specified in RT.


Example:


ADDI R0,R5,5;
SUBI R0,R2,54;




32            26     25            21     20            16   15                         0

     Opcode                  RS                   RT               Immediate Value



                                                  17
Jumps
Jump is a type of unconditional Branch; Jump instruction will load the jump address to the
Program Counter


Example:


0 ADDI R0,R5,5;
1 ADDI R0,R5,52;
2 J 0;




32            26                                         15                             0

     Opcode                                                         Jump Address


Branch
Branch jumps to a specific line number, on fulfilling a specific condition. Branch Have
Following Categories


Branch If Equal (branch if contents of RS = RT)
Branch If Not Equal (branch if contents of RS != RT)




32            26    25           21       20        16   15                             0

     Opcode                RS                  RT                  Branch Address




Li (Load Immediate)
The load immediate instruction loads and immediate value to the register file, the Destination
addresses is specified by the RT field.


32            26    25           21       20        16   15                             0

     Opcode                RS                  RT                  Immediate Value



                                               18
Summary of Instructions


 SNO                      Instruction                Binary Format example

    1.                    RFORMAT            000000 00001 00010 00111 00000 100000

    2.                        LI             110000 00000 00011 0000000000000011

    3.                         J             000010 00000 00000 0000000000000000

    4.                      BEQ              000100 00000 00000 0000000000000000

    5.                      BNE              000101 00000 00000 0000000000000000

    6.                      ADDI             001000 00001 00010 00111 00000 100000

    7.                      SUBI             001010 00001 00010 00111 00000 100000

   8.                       ANDI             001100 00001 00010 00111 00000 100000

   9.                        ORI             001101 00001 00010 00111 00000 100000

   10.                      XORI             001110 00001 00010 00111 00000 100000

   11.                       LW              100011 00000 00100 0000000000000000

   12.                       SW              101011 00000 00100 0000000000000000

   13.                      NOP              111111 00000 00000 0000000000000000

   14.                      MOV              110011 00001 00010 00111 00000 100000




                                        19
3.2 Architecture
    Overview




            Figure 3-1 Architecture Overview Diagram (Bus description)




                                       20
3.3 Design Description
Following are the modules constituting the product to be developed. Please note that we are
documenting only the salient properties and methods of each module to keep the description
simple and more readable.


3.3.1
Register File
The register file consists of 32 Registers of 32-bits in length. RS, RT, RD are 5-bits address; the
data on addresses RS and RT will be available on RS Out and RD out (32-bits). The data
available on Data In will be written on the address RD (5-bits).
ALU (Arithmetic and Logic Unit)
The ALU can perform all the arithmetic and logical operations (ADD, SUB, AND, OR, XOR,
SLG, SGT, etc)


3.3.2
DATA Memory
The DATA memory holds the Data to be processed (32-bits).A word can be loaded and stored
into this memory based on instruction.
Instruction Memory
The Instructions memory holds the Instructions to be processed (32-bits).


3.3.3
Forwarding unit
The forwarding unit basically a data hazard resolution unit takes decision based on the addresses
available on ID/EX, EX/MEM and MEM/WB register. This unit takes decision on opcode and
addresses available at different stages.


3.3.4
Equator
This component takes decision based on the two words available at the input and the Opcode.
This component is basically used for branch prediction; Jumps are also handled through this
component


3.3.5
Control Unit


                                                21
        The control unit provides all the control signals required by an instruction, control signals are
        generated against opcode.


        3.3.6
        ALU Control Unit
        The ALU control unit provides Function to the ALU, if an instruction is Rformat the function
        field is provided to the ALU else corresponding function is generated by the ALU control unit.

        Brach and Jump issues
        Program counter holds the address of the current instruction, In this case if One Program
        counter is handling Odd address the other will be handling Even Addresses Back to Back. We
        have restricted the Branch & Jump Instruction to the first program counter only, as the second is
        working on the reference of the first. A major problem occurs when the BR/J instruction
        appears on the Second level. As a solution we are writing J/Br instruction twice that at least one
        instruction will appear on the first level at the second level the J/Br instruction will be discarded.

                        Image Removed Intentionally

                                            3.2 Program Counter View

        Data dependencies
In this implementation, Data dependencies matters a lot, if NOP and stalls are generated the
performance of this architecture will be even lesser as compared to a typical pipelined processor. In this
implementation a single instruction might be dependent up to last six to seven instructions.
Example:


        (1) Li R0,56;
        (2) Li R1,100;
        (3) ADD R2,R1,R0
        (4) ADD R3,R2,R2
        (5) LW R2,0(R3)
        (6) ADD R4,R3,R3


In the Above Example
(3) Is dependent on (1) & (2)
(4) Is dependent on (3)
(5) Is dependent on (3) & (4)
(6) Is dependent on (4)

                                                         22
According to wikipedia
The I
        The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on
        other calculations. However, the instructions a = b + c; d = a + f might not be run able in parallel,
        depending on the order in which the instructions complete while they move through the units.


The case matches with (3) & (4) on our the Model Sim simulations (3) & (4) are running in parallel,
Forwarding unit can handle almost all the dependencies occurring in the above case.


The following modifications are made to the Architecture for solving the dependencies, the Multiplexer
is provided with all the possible values that are not written yet but are useable, and the forwarding unit
generates selection for these multiplexers based on the opcode and RD available on each stage.



                      Image Removed Intentionally




                              3.3 Inputs at the Multiplexers at ALU
To solve the dependencies all the values of ALU and offsets are sent back to a multiplexer available at
the ALU(s) .Then the forwarding unit takes decision based on the register address and opcode, which
one to forward. The result of ALU can be needed at the second ALU.


The Problem of the base Address dependency can be solves with this technique, but to solve the
dependency of the data at the data input of the Data Memory, we need to save the data of MEMWB for
on cycle, this will not effect the Stages and regular data writing but it will be a lot useful in solving data
dependencies.
This helps in solving the data dependencies between Li & Rformat, Rformat & Rformat, Li & LW, and
LW & Rformat.




                                                         23
3.4 Data Forwarding at MEM/WB




                24
4 Implementation
We have implemented the suggested design using VHDL (Verilog Hardware Descriptive
Language) and simulate on Model Sim Xilinx Edition 6.7


4.1 Development Stages
Following were the discrete phases we have experienced incrementally to realize our product in
the given time:


4.1.1 Development of components
We started the project by creating and testing all the components in Model Sim Xilinx Edition
6.7.Each component was created separately and then tested



4.1.2 Development of a MIPS32 Pipelined processor
After all the components were tested, a simple MIPS32 pipelined processor was developed. All
possible hazards solutions and other techniques were applied to the architecture. After running
the entire program on this architecture successfully we moved on to the next stage of
enhancements to this processor.



4.1.3 Enhancements in MIPS32 Architecture
As we already described, some of the modules were critically depending on other modules and
could not be unit-tested without communicating to them properly. After the successful
completion of the Simple MIP32 pipelined processor ,we doubled the components except the
local memories .All the pipelined registers size was doubled and the number of multiplexer and
size of the function units was increased according to the requirements.




                                               25
5 Evaluation & Performance
We have focussed on thorough testing through-out the design and implementation phase. While
testing the … … …



5.1 Unit Testing
Each module in the application was tested while being developed to confirm its adherence to the
related requirements. This testing was quite successful.


The following code was simulated on the processor using Model Sim (for Fibonacci Series) the
results of the data memory are shown below in the processors Data Memory snapshot of Model
Sim VHDL Simulation.




                       4.1 Data Memory View on Model Sim Simulation


                                               26
Test code 2 (basic test for dependencies)
    (1) Li $0,1
    (2) Li $1,3
    (3) Add $2,$1,$0
    (4) Add $3,$2,$1
    (5) Add $4,$1,$3


Equitant Machine code
11000000000000000000000000000001
11000000000000010000000000000011
00000000001000000001000000100000
00000000010000010001100000100000
00000000001000110010000000100000


The above code was simulated and the results were found to be accurate, two Instructions are
being carried out in parallel, regardless to the order of the instructions. Below a snapshot of
Model Sim VHDL simulation of the processor is shown below, Simulation snapshot clearly
indicates that 5 instructions were executed in 3 clock cycles excluding the starting shift cycles.
In the above piece of code (3) is dependent on (1) and (2), (4) is dependent on (3) and (2). (5) Is
dependent on (1) and (4) .All the data is made valid through the forwarding unit.




                                                27
                      4.2 Register Memory View

Test code 3 (branches)


   (1) Li $0,1
   (2) Li $1,3
   (3) Li $2,9
   (4) Add $1,$0,$1
   (5) Bne $1,$2,4


Machine code

                                28
11000000000000000000000000000001
11000000000000010000000000000011
11000000000000100000000000001001
00000000001000000000100000100000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
00010100001000100000000000000011
00010100001000100000000000000011



In the above given piece of code instruction (4) is dependent on (1) and (2) .(5) is dependent on
(4) and (3).The Branch will have to wait for the values to be written to the data memory , only
then the branch prediction will take place correctly. To wait NOPs are added to the real machine
code.
This branch perdition dos have a solution same to the data forwarding method at the ALU but
the branch is less probable in my opinion, the solution might be expensive in terms of Hard
ware and may create more extra delays. Leading the machine to de improvement




                                               29
                            4.3 Program Counter View after Branch



Test Code 4 (Bubble Sorting)

Start:
          Li $0,6
          Li $1,9
          Li $2,19
          Li $4,1
Startx:
          Slt $5,$0,$1
          Beq $5,$4,swap1
Start2:


          Slt $6,$1,$2
          Beq $6,$4,swap2


Swap1:
          Mov $31,$1
          Mov $1,$0
          Mov $0,$31
          J Start2
Swap 2:
          Mov $30,$2
          Mov $2,$1
          Mov $1,$30
          Li R5,0
          Li R6,0
          J Startx


Machine Code
11000000000000000000000000000110
11000000000000010000000000001001
11000000000000100000000000010011
11000000000001000000000000000001
11111100000000000000000000000000


                                             30
00000000000000010010100000101010
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
00010000101001000000000000010110
00010000101001000000000000010110
11111100000000000000000000000000
00000000001000100011000000101010
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
00010000110001000000000000100001
00010000110001000000000000100001
11001100000000011111100000100000
11001100001000000000100000100000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11001100001111110000000000100000
00001000000000000000000000001101
00001000000000000000000000001101
11001100000000101111000000100000
11001100001000010001000000100000
11000000000001010000000000000011
11000000000001100000000000000011
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000
11111100000000000000000000000000


                                   31
11111100000000000000000000000000
11001100001111100000100000100000
00001000000000000000000000000110
00001000000000000000000000000110




                    4.4 Register file view after simulation of Sorting Program




5.2 Results
After thorough testing of system, we proceeded for investigation of our implemented techniques


5.2.1 Comparison of simple and enhanced version
All the Registers and multiplexers have been replicated and used twice except for the Memories
i.e. Instruction Memory, Data Memory and Register File only the number of inputs and outputs
has been changed in them.




              Simple Pipelined                           Pipelined With Multiple Execution
CPI ≈ 1                                           CPI < 1
Memory = 500 X 32 bits                            Memory = 500 X 32 bits
Inst Memory = 500 X 32 bits                       Inst Memory = 500 X 32 bits
Clock Cycle=50 ns                                 Clock Cycle=50 ns


For the following snippet of code for Fibonacci series




                                               32
        Li R0,0
        Li R1,1
        Li R7,0
        Li R8,1
        Lw R0,0(R7)
Here:
        Add R7,R7,R8
        MOV R2,R1
        Lw R1,0(R7)
        Add R1,R0,R1
        Mov R0,R2
        J Here
Equitant Machine code Adjusted with NOP,the Actual code was of 11 Lines and the converted
code is of 17 lines (6 NOP added).The MOV instruction has to wait (if next are dependent) for
at least 3 clock cycles (6 NOPS),The following code has been rearranged and adjusted for better
performance.




11000000000000000000000000000000 Li R0,0
11000000000000010000000000000001 Li R1,1
11000000000001110000000000000000 Li R7,0
11000000000010000000000000000001 Li R8,1
10001100111000000000000000000000 Lw R0,0(R7)
00000001000001110011100000100000 Add R7,R7,R8
11001100001000010001000000100000 MOV R2,R1
11111100000000000000000000000000 NOP
11111100000000000000000000000000 NOP
11111100000000000000000000000000 NOP
11111100000000000000000000000000 NOP
11111100000000000000000000000000 NOP
11111100000000000000000000000000 NOP
10001100111000010000000000000000 Lw R1,0(R7)
00000000000000010000100000100000 Add R1,R0,R1
11001100001000100000000000100000 Mov R0,R2
00001000000000000000000000000101 J Here
00001000000000000000000000000101 J Here



                                              33
Simple Pipelined Processor
For First four Clock Cycles (4 X 50ns) =200 ns
For Rest Of Code          (17 X 50ns)=850 ns
Total Estimated Time         200 + 850 =1050 ns (for One Loop)


Pipelined Processor with Multiple Execution Unit
For First four Clock Cycles (4 X 50ns) =200 ns
For Rest of Code         (17 X 50ns)/2=425 ns
Total Estimated Time         200 + 425 =625 ns (for One Loop)




For This Snippet of Code Pipelined Processor with Multiple Execution units is 2 times faster
then simple pipelined processor.


Synthesis Report
======================================================================
*                                      Final Report
*
======================================================================


Device utilization summary:
---------------------------


Selected Device : 3s5000fg1156-4


    Number of Slices:                               31    out of   33280        0%
    Number of Slice Flip Flops:                     29    out of   66560        0%
    Number of 4 input LUTs:                         53    out of   66560        0%
    Number of bonded IOBs:                            1   out of      784       0%
    Number of GCLKs:                                  1   out of        8      12%




======================================================================


Timing Summary:
---------------
Speed Grade: -4


     Minimum period: 7.789ns (Maximum Frequency: 128.386MHz)



                                               34
  Minimum input arrival time before clock: No path found
  Maximum output required time after clock: 6.271ns
  Maximum combinational path delay: No path found


=====================================================================




                                  35
References
[1] Mike Johnson, Superscalar Microprocessor Design, Prentice-Hall
[2] Ian Caulfield,I15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom
http://www.cl.cam.ac.uk/
[3] Steven McGeady, et al., "Performance Enhancements in the Superscalar i960MM
Embedded Microprocessor," ACM Proceedings of the 1991 Conference on Computer
Architecture (Compcon), 1991, pp. 4-7
[4] Sorin Cotofana, Stamatis Vassiliadis, "On the Design Complexity of the Issue Logic of
Superscalar Machines", EUROMICRO 1998




                                              36
Appendix A: Project Timeline


                                                              DATE                   June-2009




                                                                              TOTAL NUMBER
 PROJECT ID        B09-DP-12                                                      OF WEEKS IN
                                                                                         PLAN



 TITLE       Enhancements in MIPS32 Architecture using multiple Execution Units



          STARTING
 No.                                   DESCRIPTION OF MILESTONE                            DURATION
             WEEK
  1

  2

  3

  4

  5

  6

  7

  8

  9

 10




* You can provide Gantt chart instead of filling this form, if you like




                                                     37

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:9/15/2012
language:English
pages:43