Superscalar Processors
by
Sherri Sparks
Overview
1. What are superscalar processors?
2. Program Representation, Dependencies, & Parallel
Execution
3. Micro architecture of a typical superscalar processor
4. A look at 3 superscalar implementations
5. Conclusion: The future of superscalar processing
What are superscalars and how do they
differ from pipelines?
In simple pipelining, you are limited to fetching 1 single
instruction into the pipeline per clock cycle. This causes a
performance bottleneck.
Superscalar processors overcome the 1 instruction per clock
cycle limit of simple pipelines and possess the ability to fetch
multiple instructions during the same clock cycle. They also
employ advanced techniques like “branch prediction” to ensure
an uninterrupted stream of instructions.
Development & History of Superscalars
Pipelining was developed in the late 1950’s and
became popular in the 1960’s.
Examples of early pipelined architectures are the CDC
6600 and the IBM 360/91 (Tomasulo’s algorithm)
Superscalars appeared in the mid to late 1980’s
Instruction Processing Model
Need to maintain software compatibility.
The assembly instruction set was the level chosen to maintain
compatibility because it did not affect existing software.
Need to maintain at least a semblance of a “sequential execution
model” for programmers who rely on the concept of sequential
execution in software design.
A superscalar processor may execute instructions out of order at the
hardware level, but execution must *appear* sequential at the
programming level.
Superscalar Implementation
Instruction fetch strategies that simultaneously fetch multiple instructions often by using
branch prediction techniques.
Methods for determining data dependencies and keeping track of register values during
execution
Methods for issuing multiple instructions in parallel
Resources for parallel execution of many instructions including multiple pipelined functional
units and memory hierarchies capable of simultaneously servicing multiple memory
references.
Methods for communicating data values through memory through load and store
instructions.
Methods for committing the process state in correct order. This is to maintain the outward
appearance of sequential execution.
From Sequential to Parallel…
Parallel execution often results in instructions completing non sequentially.
Speculative execution means that some instructions may be executed when
they would not have been executed at all according to the sequential model (i.e.
incorrect branch prediction).
To maintain the outward appearance of sequential execution for the
programmer, storage cannot be updated immediately. The results must be held
in temporary status until the storage us updated. Meanwhile, these temporary
results must be usable by dependant instructions.
When its determined that the sequential model would have executed an
instruction, the temporary results are made permanent by updating the outward
state of the machine. This process is called “committing” the instruction.
Dependencies
Parallel Execution introduces 2 types of dependencies
Control dependencies due to incrementing or updating the
program counter in response to conditional branch instructions.
Data dependencies due to resource contention as instructions
may need to read / write to the same storage or memory
locations.
Overcoming Control Dependencies Example
L2: mov r3,r7
lw r8,(r3)
add r3,r3,4 Block 1
lw r9,(r3)
ble r8,r9,L3
move r3,r7
sw r9,(r3)
add r3,r3,4
sw r8,(r3) Block 2
add r5,r5,1
L3: add r6,r6,1
add r7,r7,4
blt r6,r4,L2 Block 3
Blocks are issued are initiated into the “window of execution”.
Control Dependencies & Branch Predicition
To gain the most parallelism, control dependencies due to conditional
branches has to be overcome.
Branch prediction attempts to overcome this by predicting the outcome
of a branch and speculatively fetching and executing instructions from
the predicted path.
If the predicted path is correct, the speculative status of the instructions
is removed and they affect the state of the machine like any other
instruction.
If the predicted path is wrong, then recovery actions are taken so as not
to incorrectly modify the state of the machine.
Data Dependencies
Data dependencies occur because instructions may access the same
register or memory location
3 Types of data dependencies or “hazards”
RAW (“read after write) : occurs because a later instruction can only read a
value after a previous instruction has written it.
WAR (“write after read”) : occurs when an instruction needs to write a new
value into a storage location but must wait until all preceding instructions
needing to read the old value have done so.
WAW (“write after write”) : occurs when multiple instructions update the
same storage location; it must appear that these updates occur in the proper
sequence.
Data Dependency Example
mov r3,r7 RAW
lw WAW r8,(r3)
add r3,r3,4 WAR
lw r9,(r3)
ble r8,r9,L3
Parallel Execution Method
1. Instructions are fetched using branch prediction to form a dynamic stream of
instructions
2. Instructions are examined for dependencies and dependencies are removed
3. Examined instructions are dispatched to the “window of execution” (These
instructions are no longer in sequential order, but are ordered according to
their data dependencies.
4. Instructions are issued from the window in an order determined by their
dependencies and hardware resource availability.
5. Following execution, instructions are put back into their sequential program
order and then “committed” so their results update the machine state.
Superscalar Microarchitecture
Parallel Execution Method Summarized in 5 phases:
1. Instruction Fetch & Branch Prediction
2. Decode & Register Dependence Analysis
3. Issue & Execution
4. Memory Operation Analysis & Execution
5. Instruction Reorder & Commit
Superscalar Microarchitecture
Instruction Fetch & Branch Prediction
Fetch phase must fetch multiple instructions per cycle from cache memory to
keep a steady feed of instructions going to the other stages.
The number of instructions fetched per cycle should match or be greater than
the peak instruction decode & execution rate (to allow for cache misses or
occasions where the max # of instructions can’t be fetched)
For conditional branches, fetch mechanism must be redirected to fetch
instructions from branch targets.
4 steps to processing conditional branch instructions
1. Recognizing that in instruction is a conditional branch
2. Determining the branch outcome (taken or not taken)
3. Computing the branch target
4. Transferring control by redirecting instruction fetch (as in the case of a taken branch)
Processing Conditional Branches
STEP 1: Recognizing Conditional Branches
Instruction decode information is held in the
instruction cache. These extra bits are used to
identify the basic instruction types.
Processing Conditional Branches
STEP 2: Determining Branch Outcome
Static Predictions (information determined from static binary). Ex: Certain
opcode types might result in more branches taken than others or a backwards
branch direction might be more likely in loops.
Predictions based on profiling information (execution statistics collected during a
previous run of the program).
Dynamic Predictions (information gathered during program execution about
past history of branch outcomes). Branch history outcomes are stored in a
“branch history table” or a “branch prediction table”.
Processing Conditional Branches
STEP 3: Computing Branch Targets
Branch targets are usually relative to the program counter and are
computed as:
branch target = program counter + offset
Finding target addresses can be sped up by having a “branch target
buffer which holds the target address used the last time the branch was
executed.
EX: Branch Target Address Cache used in PowerPC 604
Processing Conditional Branches
STEP 4: Transferring Control
Problem: Thee is often a delay in recognizing a branch, modifying the
program counter and fetching the target instructions.
Several Solutions:
1. Use the stockpiled instructions in the instructions buffer to mask the delay
2. Use a buffer that contains instructions from both “taken” and “not taken” branch
paths
3. Delayed Branches – Branch does not take effect until instruction after the
branch. This allowed the fetch of target instructions to overlap execution of the
instruction following the branch. The also introduce assumptions about pipeline
structure and therefore delayed branches are rarely used anymore.
Instruction Decoding, Renaming, & Dispatch
Instructions are removed from the fetch buffers,
decoded and examined for control and data
dependencies.
Instructions are dispatched to buffers associated
with hardware functional units for later issuing and
execution.
Instruction Decoding
The decode phase sets up “execution tuples” for
each instruction.
An “execution tuple” contains:
An operation to be executed
The identities of storage elements where input operands
will eventually reside
The locations where an instructions result must be placed
Register Renaming
Used to eliminate WAW and RAW dependencies.
2 Types:
Physical register file is larger than logical register file and a mapping table is
used to associate physical register values with logical register values.
Physical registers are assigned from a “free list”.
Reorder Buffer: Uses the same size physical and logical register files. There
is also a “reorder buffer” that contains 1 entry per active instruction and
maintains the sequential ordering of instructions. It is a circular queue
implemented in hardware. As instructions are dispatched they enter the
queue at the tail. As instructions complete, their results are inserted into their
assigned locations in the reorder buffer. When an instructions reaches the
head of the queue, its entry is removed and its result placed in the register
file.
Register Renaming I
Before: add r3,43,4 After: add R2,R1,4
r0 R8 r0 R8
r1 R7 r1 R7
Mapping Table: Mapping Table:
r2 R5 r2 R5
r3 R1 r3 R2
r4 R9 r4 R9
Free List: R2 R6 R13 Free List: R6 R13
Register Renaming II
(using a reorder buffer)
After: add r3,rob6,4
Before: add r3,r3,4
(rob8)
r0 r0 r0 r0
r1 r1 r1 r1
Mapping Table: Mapping Table:
r2 r2 r2 r2
r3 rob6 r3 rob8
r4 r4 r4 r4
7 6 0 8 7 6 0
Recorder Buffer: Recorder Buffer:
(partial) . r3 …... (partial) r3 r3
. ………
Instruction Issuing & Parallel Execution
Instruction issuing is defined as the run-time
checking for availability of data and resources.
Constraints on instruction issue:
Availability of physical resources like instruction units,
interconnect, and register file
Organization of buffers holding execution tuples
Single Queue Method
If there is no out of order issuing, operand availability can be
managed via reservation bits assigned to each register.
A register is reserved when an instruction modifying the register
issues.
A register is cleared when the instruction completes.
Instructions may issue if there are no reservations on its
operands.
Multiple Queue Method
There are multiple queues organized according to
instruction type.
Instructions issue from individual queues in
sequential order.
Individual queues may issue out of order with
respect to one another.
Reservation Stations
Instructions issue out of order
Reservation stations hold information about
source operands for an operation.
When all operands are present, the instruction may issue.
Reservation stations may be partitioned according to instruction
type or pooled into a single large block.
Operation Source 1 Data 1 Valid 1 Source 2 Data 2 Valid 2 Destination
Memory Operation Analysis & Execution
To reduce latency, memory hierarchies are used & may contain primary
and secondary caches.
Address translation to physical addresses is improved by using a
“translation lookaside buffer” which contains a cache of recently
accessed pages.
“Multiported” memory hierarchy is used to allow multiple memory
requests to be serviced simultaneously. Multiporting is achieved by
having multiple memory banks or making multiple serial requests during
the same cycle.
“Store address buffers” are used to make sure memory operations don’t
violate hazard conditions. Store address buffers contain the addresses
of all pending store operations.
Memory Hazard Detection
Instruction Reorder & Commit
When an instruction is “committed”, its result is allowed to modify the logical
state of the machine.
The purpose of the commit phase is to maintain the illusion of a sequential
execution model.
2 methods
1. The state of the machine is saved in a history buffer. Instruction update the state of
the machine as they execute and when there is a problem, the state of the machine
can be recovered from the history buffer. The commit phase gets rid of the history state
that’s no longer needed.
2. The state of the machine is separated into a physical state and a logical state. The
physical state is updated in memory as instructions complete. The logical state is
updated in a sequential order as the speculative status of instructions is cleared. The
speculative state is maintained in a reorder buffer and during the commit phase, the
result of an operation is moved from the reorder buffer to a logical register or memory.
The Role of Software
Superscalars can be made more efficient if
parallelism in software can be increased.
1. By increasing the likelihood that a group of instructions
can be issued simultaneously
2. By decreasing the likelihood that an instruction has to
wait for the result of a previous instruction
A Look At 3 Superscalar Processors
1. MIPS R10000
2. DEC Alpha 21164
3. AMD K5
MIPS R10000
“Typical” superscalar processor
Able to fetch 4 instructions at a time
Uses predecode to generate bits to assist with branch prediction (512 entry
prediction table)
Resume cache is used to fetch “not taken” instructions and has space to handle
4 branch predictions at a time
Register renaming uses a physical register file 2x the size of the logical register
file. Physical registers are allocated from a free list
3 instruction queues – memory, integer, and floating point
5 functional units (an address adder, 2 integer ALU’s, a floating point multiplier /
divider / square rooter, & floating point adder)
Supports on-chip primary data cache (32 KB, 2 way set associative) and an off-
chip secondary cache.
Uses reorder buffer mechanism to maintain machine state during execptions.
Instructions are committed 4 at a time
Alpha 21164
Simple superscalar that forgoes the advantage of dynamic
scheduling in favor of a high clock rate
4 Instructions at a time are fetched from an 8K instruction cache
2 instruction buffers that issue instructions in program order
Branches are predicted using a history table associated with the
instruction cache
Uses the single queue method of instruction issuing
4 functional units (2 ALUs, a floating point adder, and a floating
point multiplier)
2 level cache memory (primary 8K cache & secondary 96 K 3way
set associative cache)
Sequential machine state is maintained during interrupts
because instructions are not issued out of order
The pipeline functions as a simple reorder buffer since
instructions in the pipeline are maintained in sequential order
Alpha 21164 Superscalar Organization
AMD-K5
Implements the complex Intel x86 instruction set
Use 5 pre-decode bits for decoding variable length instructions
Instructions are fetched from the instruction cache at a rate of 16 bytes
/ cycle & placed in a 16 element queue.
Branch prediction is integrated with the instruction cache. There is 1
prediction entry per cache line.
Due to instruction set complexity, 2 cycles are required to decode
1. Instructions are converted to ROPS (simple risc like operations)
2. Instructions read operand data & are dispatched to functional unit reservation
stations
There are 6 functional units: 2 integer ALUs, 1 floating point unit, 2
load/ store units & a branch unit.
Up to 4 ROPs can be issued per clock cycle
Has an 8K data cache with 4 banks. Dual load/ stores are allowed to
different banks.
16 entry reorder buffer maintains machine state when there is an
exception and recovers from incorrect branch predictions
AMD K5 Superscalar Organization
The Future of Superscalar Processing
Superscalar design = performance gain
BUT increasing hardware parallelism may be a case of diminishing
returns.
1. There are limits to instruction level parallelism in programs that can be
exploited.
2. Simultaneously issuing more instructions increases complexity and
requires more cross checking. This will eventually affect the clock rate.
3. There is a widening gap between processor and memory performance
4. Many believe that the 8-way superscalar is the limit and that we will reach
this limit within 2 years.
Some believe VLIW will replace superscalars and offers advantages
1. Because software is responsible for creating the execution schedule, the
size of the instruction window that can be examined for parallelism is
larger than a superscalar can do in hardware
2. Since there is no dependence checking by the processor VLIW hardware
is simpler to implement and may allow a faster clock.
Reference
The Microarchitecture of Superscalar Processors by James E.
Smith, IEEE and Gurindar S. Sohi, senior member, IEEE
?