lecture15
Document Sample


Improving Performance The basic processor design (MIPS) that we have considered in this course is certainly functional, but is by no means optimal There are a variety of techniques we can use to improve the performance of our machine We have already seen one method: caches Lecture 15 Improving Performance – These can help to speed up memory accesses Other techniques can be used to – Improve the processors throughput of instructions – Execute more than one instruction at a time – Predict what the machine will do next We will outline how these techniques work, and what benefits they bring to the machine and its performance Dr Iain Styles, School of Computer Science December 2006 2 Pipelining A non-pipelined approach to washing-up The basic idea behind pipelining is that we don't always have to wait for something (an instruction) to finish completely before starting the next thing (another instruction) First we would take the plate from the pile of dirty plates We wash it and place on the drainer We then take the plate from the drainer and dry it Consider a domestic analogy: washing up Finally, we put the plate in the cupboard Then, and only then, do we go back to the next plate In this analogy, plates/cups/knives/forks take the role of instructions In the washing-up process, we can identify four specific operations that must be done to each item: 1) Take the plate from the pile of dirty plates 2) Wash the plate and put on the drainer 3) Take the plate from the drainer and dry it 4) Take the plate from the drainer and place in appropriate cupboard This is a rather slow approach to doing the washing-up, yet this is exactly what we do in our non-pipelined processor – Fetch instruction – Decode instruction – Execute instruction – Fetch next instruction It is not difficult to conceive of a better scheme It does introduce extra complexity (in the case of the washing up, some extra people), but can lead to tremendous performance benefits In a non-pipelined approach, we would tackle the washing up in the following way: 3 4 Pipelining the washing-up Pipelining MIPS instructions In a pipelined approach to the washing-up, we would work in the following way: It is no harder to pipeline MIPS instructions than it is to pipeline the washing up Person A would take plate 1 from the dirty pile, wash it up, put it on the drainer We will consider a five-stage pipeline 1) Fetch instruction from memory 2) Decode instruction and read registers (can happen simultaneously in MIPS) 3) Execute the operation or calculate memory address 4) Access memory if required 5) Write result into a register if required Person B would take plate 1 from the drainer and dry it, whilst person A is washing plate 2 Person C takes plate 1 from person B and puts it away. Person B takes plate 2 from the drainer and drys it, whilst person A fetches plate 3 from the pile of plates Each plate takes the same amount of time as before to be processed, but three times as many plates can be processed in the same time period! The important thing is that different instructions can be at different stages in the pipeline simultaneously This is referred to as a three-stage pipeline The instructions are allowed to overlap in the same way that plates in our washing-up analogy can be at different stages of the process 5 6 How MIPS instructions are executed Problems with pipelining Without pipelining: In code where instructions are executed sequentially, pipelining is a great way of increasing performance Reg IFetch Decode ALU Data Reg IFetch Decode ALU Data lw lw add IFetch Decode ALU Data – We can just keep putting new instructions in the pipeline If we have branches though, there is a problem – We can't start executing the instruction after the branch as we don't know which one it will be! – The pipeline is said to stall With pipelining: lw lw add At a branch we have a choice of two possible instructions to execute next IFetch Decode ALU IFetch Decode Data ALU Reg Data Reg Data Reg – If the branch is not taken, the instruction we want is the next in the program sequence – If the branch is taken, we must first compute the branch address before we know which instruction we must execute IFetch Decode ALU We need some way of predicting which instruction goes next 7 8 Branch Prediction Branch Prediction There are two simple choices we could make with regard to branch prediction The obvious alternative is to assume that branches are always taken We could assume that, by default, branches are not taken In this case, we allow the CPU to just execute the next instruction in the code This is not quite so simple, as we don't know where the next instruction is – We first must compute the branch address The result of the branch arrives some time later If the branch is not taken, then we can just carry on, since we're already executing the right code If we add specific hardware for computing branch addresses, then we can usually do this with only one cycle of delay to the pipeline We can then start executing the next instruction whilst awaiting the result of the branch If the branch is taken, then we've executed the wrong piece of code We must clear (flush) the pipeline of the incorrect instructions and start executing the right ones Of course, if our guess is wrong, we must flush the pipeline and get the right instruction 9 10 Two approaches to branch prediction More sophisticated branch prediction Assume branches are never taken lw beq add IFetch Decode ALU Data Reg Data ALU Reg Data Reg Result of branch is known here: save two cycles The all-or-none approach to branch prediction is rather simplistic IFetch Decode ALU IFetch Decode Some machines use dynamic branch prediction which looks at the context in which a branch is used – Example: a branch which breaks out of a loop when the termination condition is met if far more likely to not be taken than it is to be taken. Assume branches are always taken Other approaches are statistical lw beq or IFetch Decode ALU Data Reg Data Reg Data Reg IFetch Decode ALU The CPU keeps a history for each branch and uses it to predict future behaviour – Gets it right about 90% of the time In delayed branching, the code is reordered so that the branch can be computed as early as possible – Can then know the result of the branch in advance – Only works if the branch does not depend on the preceding instruction IFetch Decode ALU Pipeline bubble Branch address not known until here 11 12 Delayed Branching Data Hazards Consider the following code: add $r4, $r5, $r6 Branches are not the only kind of hazard that can cause pipeline stalls Consider beq $r1, $r2, 40 lw $r3, 300($0) add $r1, $r2, $r3 sub $r4, $r5, $r1 The branch operation does not depend on the add The sub instruction depends on the result of the add So we can’t start the sub until the add has completed – The pipeline stalls We can therefore start execution of beq before we start add The result of the branch is then known in plenty of time The actual branch itself is not taken until the correct place in the code – The branch is delayed by one instruction This is very common and instruction reordering is impractical The compiler does this invisibly, keeping the desired branch behaviour, but making sure the machine does not stall on a branch The solution to this is to note that the result of the add is available after the third stage of the pipeline, but is not in the destination register until stage five We insert a short-cut in hardware that allows us to get the ALU result directly from the ALU at stage 3 of the pipeline 13 Can typically fill about 50% of branch delay slots with useful instructions We can then route this back in as an input for the subtract 14 This is known as forwarding or bypassing Implementing a pipeline Pipelined datapath Pipelined CPUs are very complex and we won’t study the detail too closely In a very rough schematic, we might implement a pipelined datapath in the following way: But it is useful to consider how pipelining is implemented In essence it is very simple We split the datapath into pieces, each piece responsible for one portion of the pipeline – Eg: ALU executes stage 3 of the pipeline IFetch Registers Data Memory We add so-called pipeline registers to the datapath, which means that during each clock cycle, the processor only needs to push an instruction through one pipeline stage Each pipeline register stores an intermediate stage of a different instruction The pipeline registers store the results of intermediate pipeline stages Every clock cycle, the data moves forward one stage in the pipeline As a result of this, the CPU can be run at a much higher clock speed, since data has to pass through less combinational logic between registers Pipelined datapaths are where microprogrammed control becomes useful Each successive microinstruction configures only the relevant stage of the pipeline 15 16 Pipelined control is very hard Superscalar machine A superscalar pipeline Ifetch and decode Fetches, decodes instruction, allocates them to pipelines Hold instructions Reservation and operands for station each pipeline Execute the Load/store operations In recent years, as transistor density has increased exponentially, designers have been able to put more and more functional units on a chip Superscalar machines have multiple pipelines which can execute multiple instructions in parallel With N pipelines, can get up to a factor of N performance increase Reservation station Reservation station Reservation station This comes at a cost: control is very hard – Instruction ordering is a problem: can’t execute instructions in parallel if they are co-dependent – Dealing with hazards is very hard indeed ALU ALU Load/store Techniques exist for resolving some of these issues – Out-of-order execution allows instructions to be re-ordered dynamically if one pipeline stalls – Speculative execution allows the CPU to continue to execute code whilst waiting for a hazard to be resolved Commit Unit Commits the results to registers/memory 17 This is extremely complex and beyond the scope of this course 18 Conclusions In this lecture we have studied some techniques for improving the performance of a modern CPU – Pipelining: breaking instructions up into smaller pieces – Superscalar machines containing multiple pipelines We’ve studied some of the problems that this can cause – Branches hazards – Data hazards And the solutions to these problems – Branch prediction – Delayed branching – Forwarding/bypassing This concludes our study of computer architectures Next lecture we will introduce some basic concepts from computer networking 19