lecture15 by smbutt


									Improving Performance

The basic processor design (MIPS) that we have considered in this course is certainly functional, but is by no means optimal

There are a variety of techniques we can use to improve the performance of our machine

We have already seen one method: caches

Lecture 15 Improving Performance

– These can help to speed up memory accesses

Other techniques can be used to – Improve the processors throughput of instructions – Execute more than one instruction at a time – Predict what the machine will do next

We will outline how these techniques work, and what benefits they bring to the machine and its performance

Dr Iain Styles, School of Computer Science December 2006



A non-pipelined approach to washing-up

The basic idea behind pipelining is that we don't always have to wait for something (an instruction) to finish completely before starting the next thing (another instruction)

First we would take the plate from the pile of dirty plates

We wash it and place on the drainer

We then take the plate from the drainer and dry it

Consider a domestic analogy: washing up

Finally, we put the plate in the cupboard Then, and only then, do we go back to the next plate

In this analogy, plates/cups/knives/forks take the role of instructions

In the washing-up process, we can identify four specific operations that must be done to each item: 1) Take the plate from the pile of dirty plates 2) Wash the plate and put on the drainer 3) Take the plate from the drainer and dry it 4) Take the plate from the drainer and place in appropriate cupboard

This is a rather slow approach to doing the washing-up, yet this is exactly what we do in our non-pipelined processor – Fetch instruction – Decode instruction – Execute instruction – Fetch next instruction It is not difficult to conceive of a better scheme It does introduce extra complexity (in the case of the washing up, some extra people), but can lead to tremendous performance benefits

In a non-pipelined approach, we would tackle the washing up in the following way:



Pipelining the washing-up

Pipelining MIPS instructions

In a pipelined approach to the washing-up, we would work in the following way:

It is no harder to pipeline MIPS instructions than it is to pipeline the washing up

Person A would take plate 1 from the dirty pile, wash it up, put it on the drainer

We will consider a five-stage pipeline 1) Fetch instruction from memory 2) Decode instruction and read registers (can happen simultaneously in MIPS) 3) Execute the operation or calculate memory address 4) Access memory if required 5) Write result into a register if required

Person B would take plate 1 from the drainer and dry it, whilst person A is washing plate 2

Person C takes plate 1 from person B and puts it away. Person B takes plate 2 from the drainer and drys it, whilst person A fetches plate 3 from the pile of plates

Each plate takes the same amount of time as before to be processed, but three times as many plates can be processed in the same time period!

The important thing is that different instructions can be at different stages in the pipeline simultaneously

This is referred to as a three-stage pipeline

The instructions are allowed to overlap in the same way that plates in our washing-up analogy can be at different stages of the process



How MIPS instructions are executed

Problems with pipelining

Without pipelining:

In code where instructions are executed sequentially, pipelining is a great way of increasing performance
Reg IFetch Decode ALU Data Reg IFetch Decode ALU Data

lw lw add

IFetch Decode ALU


– We can just keep putting new instructions in the pipeline If we have branches though, there is a problem – We can't start executing the instruction after the branch as we don't know which one it will be! – The pipeline is said to stall

With pipelining:

lw lw add

At a branch we have a choice of two possible instructions to execute next
IFetch Decode ALU IFetch Decode Data ALU Reg Data Reg Data Reg

– If the branch is not taken, the instruction we want is the next in the program sequence – If the branch is taken, we must first compute the branch address before we know which instruction we must execute

IFetch Decode ALU

We need some way of predicting which instruction goes next



Branch Prediction

Branch Prediction

There are two simple choices we could make with regard to branch prediction

The obvious alternative is to assume that branches are always taken

We could assume that, by default, branches are not taken

In this case, we allow the CPU to just execute the next instruction in the code

This is not quite so simple, as we don't know where the next instruction is – We first must compute the branch address

The result of the branch arrives some time later

If the branch is not taken, then we can just carry on, since we're already executing the right code

If we add specific hardware for computing branch addresses, then we can usually do this with only one cycle of delay to the pipeline We can then start executing the next instruction whilst awaiting the result of the branch

If the branch is taken, then we've executed the wrong piece of code

We must clear (flush) the pipeline of the incorrect instructions and start executing the right ones

Of course, if our guess is wrong, we must flush the pipeline and get the right instruction



Two approaches to branch prediction

More sophisticated branch prediction

Assume branches are never taken

lw beq add

IFetch Decode ALU


Reg Data ALU Reg Data Reg

Result of branch is known here: save two cycles

The all-or-none approach to branch prediction is rather simplistic

IFetch Decode ALU IFetch Decode

Some machines use dynamic branch prediction which looks at the context in which a branch is used – Example: a branch which breaks out of a loop when the termination condition is met if far more likely to not be taken than it is to be taken.

Assume branches are always taken

Other approaches are statistical

lw beq or

IFetch Decode ALU


Reg Data Reg Data Reg


IFetch Decode ALU

The CPU keeps a history for each branch and uses it to predict future behaviour – Gets it right about 90% of the time In delayed branching, the code is reordered so that the branch can be computed as early as possible – Can then know the result of the branch in advance – Only works if the branch does not depend on the preceding instruction

IFetch Decode ALU

Pipeline bubble

Branch address not known until here



Delayed Branching

Data Hazards

Consider the following code: add $r4, $r5, $r6

Branches are not the only kind of hazard that can cause pipeline stalls Consider

beq $r1, $r2, 40 lw $r3, 300($0)

add $r1, $r2, $r3 sub $r4, $r5, $r1

The branch operation does not depend on the add

The sub instruction depends on the result of the add So we can’t start the sub until the add has completed – The pipeline stalls

We can therefore start execution of beq before we start add

The result of the branch is then known in plenty of time

The actual branch itself is not taken until the correct place in the code – The branch is delayed by one instruction

This is very common and instruction reordering is impractical

The compiler does this invisibly, keeping the desired branch behaviour, but making sure the machine does not stall on a branch

The solution to this is to note that the result of the add is available after the third stage of the pipeline, but is not in the destination register until stage five

We insert a short-cut in hardware that allows us to get the ALU result directly from the ALU at stage 3 of the pipeline


Can typically fill about 50% of branch delay slots with useful instructions

We can then route this back in as an input for the subtract


This is known as forwarding or bypassing

Implementing a pipeline

Pipelined datapath

Pipelined CPUs are very complex and we won’t study the detail too closely

In a very rough schematic, we might implement a pipelined datapath in the following way:

But it is useful to consider how pipelining is implemented

In essence it is very simple We split the datapath into pieces, each piece responsible for one portion of the pipeline – Eg: ALU executes stage 3 of the pipeline



Data Memory


We add so-called pipeline registers to the datapath, which means that during each clock cycle, the processor only needs to push an instruction through one pipeline stage

Each pipeline register stores an intermediate stage of a different instruction

The pipeline registers store the results of intermediate pipeline stages

Every clock cycle, the data moves forward one stage in the pipeline

As a result of this, the CPU can be run at a much higher clock speed, since data has to pass through less combinational logic between registers

Pipelined datapaths are where microprogrammed control becomes useful

Each successive microinstruction configures only the relevant stage of the pipeline



Pipelined control is very hard

Superscalar machine

A superscalar pipeline
Ifetch and decode Fetches, decodes instruction, allocates them to pipelines Hold instructions Reservation and operands for station each pipeline Execute the Load/store operations

In recent years, as transistor density has increased exponentially, designers have been able to put more and more functional units on a chip

Superscalar machines have multiple pipelines which can execute multiple instructions in parallel

With N pipelines, can get up to a factor of N performance increase

Reservation station

Reservation station

Reservation station

This comes at a cost: control is very hard – Instruction ordering is a problem: can’t execute instructions in parallel if they are co-dependent – Dealing with hazards is very hard indeed




Techniques exist for resolving some of these issues – Out-of-order execution allows instructions to be re-ordered dynamically if one pipeline stalls – Speculative execution allows the CPU to continue to execute code whilst waiting for a hazard to be resolved
Commit Unit Commits the results to registers/memory


This is extremely complex and beyond the scope of this course




In this lecture we have studied some techniques for improving the performance of a modern CPU – Pipelining: breaking instructions up into smaller pieces – Superscalar machines containing multiple pipelines

We’ve studied some of the problems that this can cause – Branches hazards – Data hazards

And the solutions to these problems – Branch prediction – Delayed branching – Forwarding/bypassing

This concludes our study of computer architectures

Next lecture we will introduce some basic concepts from computer networking


To top