Department of Computer Science
University of the West Indies
Searching for Parallelism
Goal of the computer architect:
Identify potential opportunities for parallelism at
every possible level and exploit them e.g.
• Bit level
• Instruction level
• Processor level
More parallelism within each CPU
Pipelined CPU’s ( increase instruction throughput )
Superscalar CPU’s ( multiple functional units )
“Superpipelined” CPUs (multiple instruction issues per clock)
Multi-threaded CPUs that run multiple instruction streams (so when
one stream stalls on memory or I/O, another stream can make
More CPUs (100 to 10,000)
Hardware support for shared memory, and for “locks” on memory.
Hardware support for memory consistency (because a remote write can
change “local memory” at any time)
Hardware support for data movement between memories
Some CPUs divide the fetch-decode-execute
cycle into smaller steps.
These smaller steps can often be executed in
parallel to increase throughput.
Such parallel execution is called instruction-level
This term is sometimes abbreviated to ILP in the
Let's say that we have decided to go into the increasingly
lucrative SUV manufacturing business. After some
intense research, we determine that there are five
stages in the SUV building process, as follows:
Stage 1: build the chassis.
Stage 2: drop the engine in the chassis.
Stage 3: put doors, a hood, and coverings on the
Stage 4: attach the wheels.
Stage 5: paint the SUV.
There are five skilled crews ready to work on each stage in the
Our big strategy is to have the factory run as follows:
1. Line up all five crews in a row, and we have the first crew start an SUV
at Stage 1.
2. After Stage 1 is complete, the SUV moves down the line to the next
stage and the next crew drops the engine in.
While the Stage 2 Crew is installing the engine in the chassis that the
Stage 1 Crew just built, the Stage 1 Crew (along with all of the rest of the
crews) is free to go play football, watch the big-screen plasma TV in the
break room, surf the 'net, etc.
3. Once the Stage 2 Crew is done, the SUV moves down to Stage 3 and
the Stage 3 Crew takes over while the Stage 2 Crew hits the break
room to party with everyone else.
The SUV moves on down the line through all five stages this way, with
only one crew working on one stage at any given time while the rest
of the crews are idle.
Once the completed SUV finishes Stage 5, the crew at Stage 1 then
starts on another SUV.
At this rate, it takes exactly five hours to finish a single SUV, and our
factory puts out one SUV every five hours (assuming 1 hr per stage)
How can we improve the production?
Add a second production line using 5 additional skilled crews.
This increases throughput to two SUV’s every 5 hours.
Requires a lot more money to pay for extra crews
Double the inefficiency with twice the number of crews in the
break room at one time.
Finally a smart consultant hits upon the a clever idea to
Why let workers spend four-fifth’s of their day in the break room, when
they could be doing useful work during that time.
The revised workflow is now as follows:
The Stage 1 crew builds a chassis. Once the chassis is complete, they
send it on to the Stage 2 crew.
The Stage 2 crew receives the chassis and begins dropping the engine
in, while the Stage 1 crew starts on a new chassis.
When both Stage 1 and Stage 2 crews are finished, the Stage 2 crew's
work advances to Stage 3, the Stage 1 crew's work advances to Stage 2,
and the Stage 1 crew starts on a new chassis.
As the assembly line begins to fill up with SUVs in various stages of
production, more of the crews are put to work simultaneously until
all of the crews are working on a different vehicle in a different
stage of production.
If we can keep the assembly line full, and keep all five crews working at
once, then we can produce one SUV every hour: a five-fold improvement
in SUV completion rate over the previous completion rate of one SUV
every five hours.
That, in a nutshell, is pipelining.
While the total amount of time that each individual SUV spends in
production has not changed from the original 5 hours, the rate at
which the factory as a whole completes SUVs has increased
All stages in the pipeline are working simultaneously
In the von Neumann model of execution an instruction
starts only after its predecessor completes.
instr 1 instr 2
This is not a very efficient model of execution.
Due to von Neumann bottleneck or the memory
Almost all processors today use instruction pipelines to allow
overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).
The execution of an instruction is divided into stages; each
stage is performed by a separate part of the processor.
instr F D E M W
F: Fetch instruction from cache or memory.
D: Decode instruction.
E: Execute. ALU operation or address calculation.
M: Memory access.
W: Write back result into register.
Each of these stages completes its operation in one cycle
(shorter than the cycle in the von Neumann model).
An instruction still takes the same time to execute.
4 Stage Pipeline
Single Cycle Pipeline
White space –
hardware sitting idle
processed after 9ns
5 instructions processed
The length of slowest stage will determine the length
of all the stages in the pipeline.
If one stage takes considerably longer than the others
then many cycles are wasted as the functional units
The smaller the pipeline stage, the faster the clock
speed per stage. Hence deeper pipelines increase
overall clock frequency.
10 stage pipeline
Stage 1: build the chassis.
Crew 1a: Fit the parts of the chassis together and spot-weld the joins.
Crew 1b: Fully weld all the parts of the chassis.
Stage 2: drop the engine in the chassis.
Crew 2a: Place the engine in the chassis and mount it in place.
Crew 2b: Connect the engine to the moving parts of the car.
Stage 3: put doors, a hood, and coverings on the chassis.
Crew 3a: Put the doors and hood on the chassis.
Crew 3b: Put the other coverings on the chassis.
Stage 4: attach the wheels.
Crew 4a: Attach the two front wheels.
Crew 4b: Attach the two rear wheels.
Stage 5: paint the SUV.
Crew 5a: Paint the sides of the SUV.
Crew 5b: Paint the top of the SUV.
A pipeline will only work at peak efficiency when all
stages are filled. The initial filling of a pipeline can
impact performance in the early stages of a programs
Many pipeline flushes will have a negative impact on
Pipeline being filled
In reality pipelining isn’t totally “free”.
Sometimes instructions get hung up in one pipeline stage for multiple
When this happens the pipeline is said to have stalled.
When an instruction stalls it backs up all instructions coming
behind it in the execution.
When it eventually exits the stalled stage then the gap ( called a
bubble ) created by the stall remains in the pipeline until the
instruction is executed fully.
Pipeline bubbles reduce the overall instruction throughput for an
Many of the architectural features associated with modern
processors are deigned to avoid pipeline stalls due to
Resource conflicts – two instructions requiring same resource
at the same time.
Conditional branching – unknown branching address.
OOE – Out of order execution
Pentium (P5) = 5 stages
Pentium Pro, II, III (P6) = 10 stages (1 cycle ex)
Pentium 4 (NetBurst) = 20 stages (no decode)
From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
Almost all modern processors are superscalar i.e.
They allow more than one instruction to be completed per clock
Superscalar computing is achieved by having multiple functional
With the increase of transistors per die, more functional units can be
included e.g. two ALU’s working in parallel as in the Pentium
Hence more than one scalar ( integer ) operation can be performed
per clock cycle and therefore superscalar computing was introduced.
Lets assume we add two additional crews to Stage 2,
each building different engines
To illustrate both pipelining and superscalar parallel
execution in action, consider the following sequence of
three SUV orders sent out to the empty factory floor,
right when the shop opens up:
1. Extinction Turbo
2. Extinction Turbo
3. Extinction LE
Now let's follow these three cars through the assembly
line during the first four hours of the day.
Hour 1: The line is empty when the first Turbo enters it
and the Stage 1 Crew kicks into action.
Hour 2: The first Turbo moves on to Stage 2a, while the
second Turbo enters the line.
Hour 3: Both of the Turbos are in the line being worked
on when the LE enters the line.
Hour 4: Now all three cars are in the assembly line at different
stages. Notice that there are actually three cars in various versions
and stages of "Stage 2," all at the same time.
Single stage ALU
Multi-stage FPU unit
Dual execution units per stage
How can we guarantee no dependencies between instructions in a
pipeline ( and reduce pipeline stalls or bubbles )?
One way is to interleave execution of instructions from different
program threads on same pipeline. This is called multithreading.
What Is a Thread ?
Is an independent flow of control
Operates within a process with other threads
Mono-threaded process Multi-threaded process
Process Process Thread1
A Thread B Thread2
What Is A Thread ?
Threads vs. Processes
Threads use and exist within the process resources
A thread maintains its own stack and registers, scheduling
properties, set of pending and blocked signals.
Secondary Threads vs. Initial Threads
An initial thread is created automatically when a process is created.
Secondary threads are peers.
To realize potential program performance gains:
On a uniprocessor, multi-threaded processes provide for
On a multiprocessor system, a process with multiple threads
provides potential parallelism.
Benefits of multithreaded programming
Compared to the cost of creating and managing a process, a
thread can be created and managed with much less operating
All threads within a process share the same address space.
Inter-thread communication is more efficient and than inter-
1. Can be context switched more easily
Registers and PC
Not memory management
2. Can run on different processors concurrently in an
(Symmetric Multi-threaded Processor ) SMP
3. Share CPU in a uniprocessor
4. May (will) require concurrency control programming
like mutex locks.
Single thread of
execution per task;
other tasks wait
Threads belonging to each process are given a fixed time-slice to
When a time-slice is up, its context is saved to memory.
When the thread or process gains a new time slice its context is
reloaded and it can continue execution from the exact point it was in
when it was flushed from the CPU.
This is called context-switching.
Context switching for a process is more expensive than for a
So to improve performance, cut down on context switches or at least
constrain them to lightweight threads.
A solution to this problem is Symmetric Multi Processing
(SMP) i.e. have two processors attached to a global
Two processes can be executing concurrently at the
same time on two different processors.
Twice as much execution but equally twice as much
empty issue and execution slots.
Empty issue slots
Twice the number of pipeline bubbles
A technique employed in high performance architectures
to reduce the amount of wasted resources is time-slice
multithreading or superthreading.
Processors that exploit this technology are known as
Multithreaded processors can execute more than one
thread at a time.
Only the instructions
belonging to one thread can
be in a stage at one time.
Fewer wasted slots
Lack of Pipeline Bubbles
Still a waste of execution
( due to memory latency or data
An improvement on superthreading, is to remove the restriction that
only one thread can have access to a pipeline stage during a
This is called Simultaneous Multithreading or HyperThreading.
Fewer execution slots
Mixed thread instructions
Hyperthreading is similar to a single-threaded SMP system.
Instead of physical dual processing units a hyperthreaded processor
has access to dual logical processing units.
Threads are scheduled to execute on any of the logical processors.
The main advantages of hyperthreading is:
1. Increased flexibility to fill execution slots
2. The cost to add hyperthreading logic to die is small e.g. ~5% of
surface die for Intel Xeon processor.
3. Reduced cache coherency problems than SMP but there are
increased chances of cache conflict.