Pipeline Processor Performance Evaluation

Document Sample
Pipeline Processor Performance Evaluation Powered By Docstoc
					  Pipeline processor
performance Evaluation
 1.   What is processor
 2.   Processor Performance
 3.   Why use Pipeline
 4.   Concept and motivation
 5.   Design Considerations
 6.   Pipeline Implementations
 7.   Pipeline Evaluation
 8.   Source of reference
A processor is the logic circuitry that responds to and process the basic instruction that drive a
computer. Basically it is an electronic circuit which executes computer programs, containing a
processing unit and a control unit.

The performance or speed of a processor depends on the clock rate (generally given in multiples
of hertz) and the instructions per clock (IPC), which together are the factors, for the instructions
per second (IPS) that the CPU can perform. Many reported IPS values have represented “peak”
execution rates on artificial instruction sequences with few branches, whereas realistic workloads
consist of a mix of instructions and applications, some of which take longer to executed than
others. The performance of the memory hierarchy also greatly affects processor performance,
and issue barely considered in MIPS calculations. Because of these problems, various
standardized tests, often called “benchmarks” for this purpose---such as SPECint – have been
developed to attempt to measure the real effective performance in commonly used applications.

To achieve better performance, most modern processor (super-pipelined, superscalar RISC, and
VLIW processors) have many functional units on which several instructions can be executed
simultaneously. An instruction starts execution if its issue conditions are satisfied. If not,
instruction is stalled until its conditions are satisfied. Such interblock (pipeline) delay causes
interruption of the fetching of successor instructions ( or demands nop instructions for some
MIPS processors).

In computing, a pipeline is a set of data processing elements connected in series, where the
output of one element is the input of the next one. The element of a pipeline is often executed in
parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted
between elements.

Computer-related pipelines include:

      Instruction pipelines, such as the classic RISC pipeline, which are used in processors to
       allow overlapping execution of multiple instructions with the same circuitry. The
       circuitry is usually divided up into stages, including instruction decoding, arithmetic, and
       register fetching stages, wherein each stage processes one instruction at a time.
      Graphics pipelines, found in most graphics cards, which consist of multiple arithmetic
       units, or complete CPUs, that implement the various stages of common rendering
       operations (perspective projection , window clipping , color and light calculation ,
       rendering , etc).
      Software pipelines, where commands can be written where the output of one operation is
       automatically fed to the next, following operation. The Unix system call pipe is a classic
       example of this concept; although other operating systems do support pipes as well.

Pipelining is a natural concept in everyday life, e.g. on an assembly line. Consider the assembly
of a car: assume that certain steps in the assembly line are to install the engine, install the hood,
and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line
can have only one of the three steps done at once. After the car has its engine installed, it moves
on to having its hood installed, leaving the engine installation facilities available for the next car.
The first car then moves on to wheel installation, the second car to hood installation, and a third
car begins to have its engine installed. If engine installation takes 20 minutes, hood installation
takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only
one car can be assembled at once would take 105 minutes. On the other hand, using the assembly
line, the total time to complete all three is 75 minutes. At this point, additional cars will come off
the assembly line at 20 minute increments.

Linear and non-linear pipelines
A linear pipeline processor is a series of processing stages which are arranged linearly to perform
a specific function over a data stream. The basic usages of linear pipeline are instruction
execution, arithmetic computation and memory access.
A non-linear pipeline (also called dynamic pipeline) can be configured to perform various
functions at different times. In a dynamic pipeline there is also feed forward or feedback
One key aspect of pipeline design is balancing pipeline stages. Using the assembly line example,
we could have greater time savings if both the engine and wheels took only 15 minutes.
Although the system latency would still be 35 minutes, we would be able to output a new car
every 15 minutes. In other words, a pipelined process outputs finished items at a rate determined
by its slowest part. (Note that if the time taken to add the engine could not be reduced below 20
minutes, it would not make any difference to the stable output rate if all other components
increased their production time to 20 minutes.)
Another design consideration is the provision of adequate buffering between the pipeline stages
— especially when the processing times are irregular, or when data items may be created or
destroyed along the pipeline.
To observe the scheduling of a pipeline (be it static or dynamic), reservation tables are used.
Reservation table
A reservation table for a linear or a static pipeline can be generated easily because data flow
follows a linear stream as static pipeline performs a specific operation. But in case of dynamic
pipeline or non-linear pipeline a non-linear pattern is followed so multiple reservation tables can
be generated for different functions.
The reservation table mainly displays the time space flow of data through the pipeline for a
function. Different functions in a reservation table follow different paths.
The number of columns in a reservation table specifies the evaluation time of a given function.


Buffered, synchronous pipelines
Conventional microprocessors are synchronous circuits that use buffered, synchronous pipelines.
In these pipelines, "pipeline registers" are inserted in-between pipeline stages, and
are clocked synchronously. The time between each clock signal is set to be greater than the
longest delay between pipeline stages, so that when the registers are clocked, the data that is
written to them is the final result of the previous stage.
Buffered, asynchronous pipelines
Asynchronous pipelines are used in asynchronous circuits, and have their pipeline registers
clocked asynchronously. Generally speaking, they use a request/acknowledge system, wherein
each stage can detect when it's "finished". When a stage is finished and the next stage has sent it
a "request" signal, the stage sends an "acknowledge" signal to the next stage, and a "request"
signal to the previous stage. When a stage receives an "acknowledge" signal, it clocks its input
registers, thus reading in the data from the previous stage.
The AMULET microprocessor is an example of a microprocessor that uses buffered,
asynchronous pipelines.
Unbuffered pipelines
Unbuffered pipelines, called "wave pipelines", do not have registers in-between pipeline stages.
Instead, the delays in the pipeline are "balanced" so that, for each stage, the difference between
the first stabilized output data and the last is minimized. Thus, data flows in "waves" through the
pipeline, and each wave is kept as short (synchronous) as possible.
The maximum rate that data can be fed into a wave pipeline is determined by the maximum
difference in delay between the first piece of data coming out of the pipe and the last piece of
data, for any given wave. If data is fed in faster than this, it is possible for waves of data to
interfere with each other..

All regular ChipGeek readers have undoubtedly read about the number of pipeline stages each
processor has. Their number and use are big factors in overall performance, and they can really
speed-up or slow-down certain types of code. But what is a pipeline and why is it useful?

The pipeline itself comprises a whole task that has been broken out into smaller sub-tasks. The
concept actually has its roots in mass production manufacturing plants, such as Ford Motor
Company. Henry Ford determined long ago that even though it took several hours to physically
build a car, he could actually produce a car a minute if he broke out all of the steps required to
put a car together into different physical stations on an assembly line. As such, one station was
responsible for putting in the engine, other tires, other seats, and so on.

Using this logic, when the car assembly line was initially turned on it still took several hours to
get the first car to come off the end and be finished, but since everything was being done in steps
or stages, the second car was right behind it and was almost completed when the first one rolled
off. This followed with the third, fourth, and so on. Thus the assembly line was formed, and mass
production became a reality.

In computers, the same basic logic applies, but rather than producing something physical on an
assembly line, it is the workload itself (required to carry out the task at hand) that gets broken
down into smaller stages, called the pipeline.

Consider a simple operation. Suppose the need exists to take two numbers and multiply them
together and then store the result. As humans, we would just look at the numbers and multiply
them (or, if they're too big, punch them into a calculator) and then write down the result. We
wouldn't give much thought to the process, we would just do it.
Computers aren't that smart; they have to be told exactly how to do everything. So, a
programmer would have to tell the computer where the first number was, where the second
number was, what operation to perform (a multiply), and then where to store the result.

This logic can be broken down into the following (greatly simplified) steps–or stages–of the

This pipeline has four stages. Now suppose that each of these logical operations took one clock
cycle to complete (which is fairly typical in modern computers). That would mean the completed
task of multiplying two numbers together would take four clock cycles to complete. However,
with the ability to do things at the same time (in parallel) rather than one after another, the result
can often be that while the task itself physically takes four clock cycles to complete, it can
actually appear to be completed in fewer clock cycles because each of those stages can also be
doing something immediately before and after the first task's needs are met. As a result, after
each clock cycle the output of those operations are “retired” or completed, meaning that task is
done. And, since we're doing things in a pipeline, that means that each task, taking four clock
cycles to complete, can actually appear to be retired one per clock cycle.

This concept can be visualized with colors added to the previous image and the stages broken out
for each clock. Imagine each color representing a stage involved in processing a computer
instruction, and that each takes four clock cycles to complete. The red, green, and dark blue
instructions would've had other stages above our block, and the yellow, purple, and brown
instructions would need additional clock cycles after our block to complete. But, as you can see,
even with all of this going on simultaneously, after every single clock cycle an instruction (which
actually took four clocks to execute) is completed! This is the big advantage of processing data
via a pipeline.

This may seem a little confusing, so try to look at it this way. There are four units, and in every
clock cycle each unit is doing something. You can visualize each unit doing its own bit of work
with the following breakout:
Every clock cycle, each unit has something to do. And because each sub-task is known to only
take one clock cycle, by the time the data from the first clock cycle gets ready to be processed
next, it knows the data will be ready because, by definition, each unit has to complete its work in
one clock cycle. If it doesn't then the processor isn't working like it's supposed to (this is one
reason why you can only overclock CPUs so far and no further, even with great cooling). And
because all of that stuff is working together, four-step instructions (or tasks) can be completed at
a rate of one per clock.

The advantages of this as a speed-up potential should be obvious, especially when you consider
how many stages modern processors have (from 8 in Itanium 2 all the way up to 31 in
Prescott!!). The terms “Super Pipelined” and “Hyper Pipelined” have become commonplace to
describe the extent to which this breakout has been employed.

Below is the pipeline for the Itanium 2. Each stage represents something that IA64 can do, and
once everything gets rolling the Itanium 2 is able to process data really, really quickly. The
problem with IA64 is that the compiler or assembly language programmer has to be extremely
comprehensive to figure out the best way to keep all of those pipeline stages filled all of the time,
because when they're not filled the Itanium's performance goes down significantly:

I was hoping to find an image showing Prescott's 31-stages, but I couldn't. The closest I found
was a black-and-white comparison of the P6 core (Pentium III) and the original P7 core
(Willamette). If anyone has a link showing Prescott's 31 stages, please let us know.
Here is an Opteron pipeline shown through actual logic units as they exist on the chip. This will
help you visualize how the logical stages shown above for Itanium 2 might relate to physical
units on the CPU die itself:

As you can see, there are different parts to the pipeline all working together, just like on an
assembly line. They all relate to one another to do some real quantity of work. Some of it is
front-end preparation, some of it is actual execution; and once everything is completed, parts are
dedicated to “retiring data” or putting it back wherever it needs to go (main memory/cache or
something called an internal register, which is like a super-fast cache inside of the processor
itself, or an external data port, etc.).
It's worth noting that the hyper-pipelined design of Intel's Netburst (used in Willamette through
Prescott) has been found to be dead-ended when pushed to its extreme 31-stage pipeline in
Prescott. The reason for this is a penalty that comes from mis-predicting where the computer
program will go next. If the processor guesses wrong, it has to refill the pipeline, and that takes
many clock cycles before any real work can start flowing again (just like how it takes several
hours to make the first car). Another penalty is extreme heat generation at the high clock rates
seen in Prescott-based P4s.

As a result, real-world experience has shown that there is a trade-off between how deep your
pipelinecan be and how deep it should be given the type of processing you're doing. Even though
on paper it might seem a better idea to have a 50-stage pipeline with a 50GHz clock rate, a
designer cannot simply go and build it–even though it would allow extremely complex tasks to
be completed 50 billion times per second (though with GaAs chips on the way, that might now
be possible).

Chip designers can't do it because there are real-world constraints that mandate a happy medium
between that ideal and the real-world actual. The most major factor is how the computer program
jumps around constantly, calling sub-routines or functions, going over if..else..endif branches,
looping, etc. The processor is constantly running the risk of guessing a branch wrong, and when
it does it must invalidate everything it “guessed” on in the pipeline and begin to refill it
completely–and that takes away time and lowers your performance.

The imposed limitations on pipeline depth are simply the side-effect of running code via the
facilities within a processor available to carry out the workload. A processor just can't do stuff
the way a person can. Everything inside a CPU has to be programmed exactly as it needs to be,
with absolutely no margin for error or guesswork. Any error–any error whatsoever, no matter
how small–means the processor becomes totally and completely useless; it might as well not
even exist.

I hope this article has been informative. It should've given you a way to visualize a processor
pipeline, understand why it is important to performance, and help you put together how it all
works. You should be able to see why designs like Prescott (which take the pipeline depth to an
extreme) often come at a real-world performance cost. You should also appreciate why slower-
clocked processors (such as Itanium 2 at 1.8GHz) are able to do more work than much higher
clocked processors (like Pentium 4 at 4GHz). It's exactly because of the number of pipeline
stages, coupled to the number of available units inside of the chip that can do things in parallel.

The pipeline allows things to be done in parallel, and that means that a CPU's logic units are kept
as busy as possible as often as possible to make sure that the instructions keep flying off the end
at the highest rate possible.

Keep in mind that there are several other factors that speed-up processing: processor concepts
such as OoO (Out of Order) execution, speculative execution, the benefits of cache, etc.

staytuned to ChipGeek for coverage of those, and keep your inner-geek close by.

Post your questions and comments below.
Also, for your reading pleasure, here are some other online articles relating to pipelines and
pipeline stages: Ars Technica on pipelining in general and Opteron's pipeline; some info
on Prescott's die; and a history of Intel chips and their pipeline depths. This closing graphic will

summarize the trend from the original 8086 through today's Pentium 4.
Source of reference:


Shared By: