Docstoc

lecture15

Document Sample
lecture15 Powered By Docstoc
					Improving Performance
 

The basic processor design (MIPS) that we have considered in this course is certainly functional, but is by no means optimal
 

There are a variety of techniques we can use to improve the performance of our machine
 

We have already seen one method: caches

Lecture 15 Improving Performance

– These can help to speed up memory accesses
 

Other techniques can be used to – Improve the processors throughput of instructions – Execute more than one instruction at a time – Predict what the machine will do next
 

We will outline how these techniques work, and what benefits they bring to the machine and its performance

Dr Iain Styles, School of Computer Science December 2006

2

Pipelining
 

A non-pipelined approach to washing-up
 

The basic idea behind pipelining is that we don't always have to wait for something (an instruction) to finish completely before starting the next thing (another instruction)
 

First we would take the plate from the pile of dirty plates
 

We wash it and place on the drainer
 

We then take the plate from the drainer and dry it
 

Consider a domestic analogy: washing up
   

Finally, we put the plate in the cupboard Then, and only then, do we go back to the next plate
 

In this analogy, plates/cups/knives/forks take the role of instructions
 

In the washing-up process, we can identify four specific operations that must be done to each item: 1) Take the plate from the pile of dirty plates 2) Wash the plate and put on the drainer 3) Take the plate from the drainer and dry it 4) Take the plate from the drainer and place in appropriate cupboard
     

This is a rather slow approach to doing the washing-up, yet this is exactly what we do in our non-pipelined processor – Fetch instruction – Decode instruction – Execute instruction – Fetch next instruction It is not difficult to conceive of a better scheme It does introduce extra complexity (in the case of the washing up, some extra people), but can lead to tremendous performance benefits

In a non-pipelined approach, we would tackle the washing up in the following way:

3

4

Pipelining the washing-up
 

Pipelining MIPS instructions
 

In a pipelined approach to the washing-up, we would work in the following way:
 

It is no harder to pipeline MIPS instructions than it is to pipeline the washing up
 

Person A would take plate 1 from the dirty pile, wash it up, put it on the drainer
 

We will consider a five-stage pipeline 1) Fetch instruction from memory 2) Decode instruction and read registers (can happen simultaneously in MIPS) 3) Execute the operation or calculate memory address 4) Access memory if required 5) Write result into a register if required
 

Person B would take plate 1 from the drainer and dry it, whilst person A is washing plate 2
 

Person C takes plate 1 from person B and puts it away. Person B takes plate 2 from the drainer and drys it, whilst person A fetches plate 3 from the pile of plates
 

Each plate takes the same amount of time as before to be processed, but three times as many plates can be processed in the same time period!
 

The important thing is that different instructions can be at different stages in the pipeline simultaneously
 

This is referred to as a three-stage pipeline

The instructions are allowed to overlap in the same way that plates in our washing-up analogy can be at different stages of the process

5

6

How MIPS instructions are executed
 

Problems with pipelining
 

Without pipelining:

In code where instructions are executed sequentially, pipelining is a great way of increasing performance
Reg IFetch Decode ALU Data Reg IFetch Decode ALU Data
 

lw lw add
 

IFetch Decode ALU

Data

– We can just keep putting new instructions in the pipeline If we have branches though, there is a problem – We can't start executing the instruction after the branch as we don't know which one it will be! – The pipeline is said to stall
 

With pipelining:

lw lw add

At a branch we have a choice of two possible instructions to execute next
IFetch Decode ALU IFetch Decode Data ALU Reg Data Reg Data Reg

– If the branch is not taken, the instruction we want is the next in the program sequence – If the branch is taken, we must first compute the branch address before we know which instruction we must execute
 

IFetch Decode ALU

We need some way of predicting which instruction goes next

7

8

Branch Prediction
 

Branch Prediction
 

There are two simple choices we could make with regard to branch prediction
 

The obvious alternative is to assume that branches are always taken
 

We could assume that, by default, branches are not taken
 

In this case, we allow the CPU to just execute the next instruction in the code
 

This is not quite so simple, as we don't know where the next instruction is – We first must compute the branch address
 

The result of the branch arrives some time later
 

If the branch is not taken, then we can just carry on, since we're already executing the right code
   

If we add specific hardware for computing branch addresses, then we can usually do this with only one cycle of delay to the pipeline We can then start executing the next instruction whilst awaiting the result of the branch
 

If the branch is taken, then we've executed the wrong piece of code
 

We must clear (flush) the pipeline of the incorrect instructions and start executing the right ones

Of course, if our guess is wrong, we must flush the pipeline and get the right instruction

9

10

Two approaches to branch prediction
 

More sophisticated branch prediction
 

Assume branches are never taken

lw beq add
 

IFetch Decode ALU

Data

Reg Data ALU Reg Data Reg

Result of branch is known here: save two cycles

The all-or-none approach to branch prediction is rather simplistic
 

IFetch Decode ALU IFetch Decode

Some machines use dynamic branch prediction which looks at the context in which a branch is used – Example: a branch which breaks out of a loop when the termination condition is met if far more likely to not be taken than it is to be taken.
 

Assume branches are always taken

Other approaches are statistical

lw beq or

IFetch Decode ALU

Data

Reg Data Reg Data Reg
 

 

IFetch Decode ALU

The CPU keeps a history for each branch and uses it to predict future behaviour – Gets it right about 90% of the time In delayed branching, the code is reordered so that the branch can be computed as early as possible – Can then know the result of the branch in advance – Only works if the branch does not depend on the preceding instruction

IFetch Decode ALU

Pipeline bubble

Branch address not known until here

11

12

Delayed Branching
 

Data Hazards
 

Consider the following code: add $r4, $r5, $r6
 

Branches are not the only kind of hazard that can cause pipeline stalls Consider
 

beq $r1, $r2, 40 lw $r3, 300($0)
 

add $r1, $r2, $r3 sub $r4, $r5, $r1

The branch operation does not depend on the add
   

The sub instruction depends on the result of the add So we can’t start the sub until the add has completed – The pipeline stalls
 

We can therefore start execution of beq before we start add
 

The result of the branch is then known in plenty of time
 

The actual branch itself is not taken until the correct place in the code – The branch is delayed by one instruction
 

This is very common and instruction reordering is impractical
 

The compiler does this invisibly, keeping the desired branch behaviour, but making sure the machine does not stall on a branch
 

The solution to this is to note that the result of the add is available after the third stage of the pipeline, but is not in the destination register until stage five
 

We insert a short-cut in hardware that allows us to get the ALU result directly from the ALU at stage 3 of the pipeline
 

13

Can typically fill about 50% of branch delay slots with useful instructions

We can then route this back in as an input for the subtract
 

14

This is known as forwarding or bypassing

Implementing a pipeline
 

Pipelined datapath
 

Pipelined CPUs are very complex and we won’t study the detail too closely
 

In a very rough schematic, we might implement a pipelined datapath in the following way:

But it is useful to consider how pipelining is implemented
 

In essence it is very simple We split the datapath into pieces, each piece responsible for one portion of the pipeline – Eg: ALU executes stage 3 of the pipeline
   

IFetch

Registers

Data Memory

 

We add so-called pipeline registers to the datapath, which means that during each clock cycle, the processor only needs to push an instruction through one pipeline stage
 

Each pipeline register stores an intermediate stage of a different instruction
 

The pipeline registers store the results of intermediate pipeline stages
 

Every clock cycle, the data moves forward one stage in the pipeline
 

As a result of this, the CPU can be run at a much higher clock speed, since data has to pass through less combinational logic between registers

Pipelined datapaths are where microprogrammed control becomes useful
 

Each successive microinstruction configures only the relevant stage of the pipeline
 

15

16

Pipelined control is very hard

Superscalar machine
 

A superscalar pipeline
Ifetch and decode Fetches, decodes instruction, allocates them to pipelines Hold instructions Reservation and operands for station each pipeline Execute the Load/store operations

In recent years, as transistor density has increased exponentially, designers have been able to put more and more functional units on a chip
 

Superscalar machines have multiple pipelines which can execute multiple instructions in parallel
 

With N pipelines, can get up to a factor of N performance increase
 

Reservation station

Reservation station

Reservation station

This comes at a cost: control is very hard – Instruction ordering is a problem: can’t execute instructions in parallel if they are co-dependent – Dealing with hazards is very hard indeed
 

ALU

ALU

Load/store

Techniques exist for resolving some of these issues – Out-of-order execution allows instructions to be re-ordered dynamically if one pipeline stalls – Speculative execution allows the CPU to continue to execute code whilst waiting for a hazard to be resolved
Commit Unit Commits the results to registers/memory

17

This is extremely complex and beyond the scope of this course

18

 

Conclusions
 

In this lecture we have studied some techniques for improving the performance of a modern CPU – Pipelining: breaking instructions up into smaller pieces – Superscalar machines containing multiple pipelines
 

We’ve studied some of the problems that this can cause – Branches hazards – Data hazards
 

And the solutions to these problems – Branch prediction – Delayed branching – Forwarding/bypassing
 

This concludes our study of computer architectures
 

Next lecture we will introduce some basic concepts from computer networking

19


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:8/25/2009
language:English
pages:5
Shah Muhammad  Butt Shah Muhammad Butt IT professional
About IM IT PROFESSIONAL