PC Processor Microarchitecture by changcheng2


									                                            Speculative, Out-of-Order Execution Gets a New Name

  PC Processor
                      TABLE OF CONTENTS

   •   Introduction

   •   Building a Framework for Comparison

   •   What Does a Computer Really Do?

   •   The Memory Subsystem

   •   Exploiting ILP Through Pipelining

   •   Exploiting ILP Via Superscalar Processing

   •   Exploiting Data-Level Parallelism Via SIMD

   •   Where Should Designers Focus The Effort?

   •   A Closer Look At Branch Prediction

   •   Speculative, Out-of-Order Execution Gets a New Name

   •   Analyzing Some Real Microprocessors: P4

   •   Pentium 4's Cache Organization

   •   Pentium 4's Trace Cache

   •   The Execution Engine Runs Out Of Order

   •   AMD Athlon Microarchitecture

   •   AMD Athlon Scheduler, Data Access

   •   Centaur C3 Microarchitecture

   •   Overall Conclusions

   •   List of References


Isn't it interesting that new high-tech products seem so complicated, yet only a few years later we talk about how much
simpler the old stuff was? This is certainly true for microprocessors. As soon as we finally figure out all the new features
and feel comfortable giving advice to our family and friends, we're confronted with details about a brand-new processor
that promises to obsolete our expertise on the "old" generation. Gone are the simple and familiar diagrams of the past,
replaced by arcane drawings and cryptic buzzwords. For a PC technology enthusiast, this is like discovering a new world
to be explored and conquered. While many areas will seem strange and unusual, much of the landscape resembles places
we've traveled before. This article is meant to serve as a faithful companion for this journey, providing a guidebook of the
many wondrous new discoveries we're sure to encounter.

An Objective Tutorial and Analysis of PC Microarchitecture
The goal of this article is to give the reader some tools for understanding the internals of modern PC microprocessors. In
the article "PC Motherboard Technology", we developed some tools for analyzing a modern PC motherboard. This article
takes us one step deeper, zooming into the complex world inside the PC processor itself. The internal design of a
processor is called the "microarchitecture". Each CPU vendor uses slightly different techniques for getting the most out of
their design, while meeting their unique performance, power, and cost goals. The marketing departments from these
companies will often highlight microarchitectural features when promoting their newest CPUs, but it's often difficult for us
PC technology enthusiasts to figure out what it really means.

What is needed is an objective comparison of the design features for all the CPU vendors, and that's the goal of this article.
We'll walk through the features of the latest x86 32-bit desktop CPUs from Intel, AMD, and VIA (Centaur). Since the
Transmeta "Crusoe" processor is mostly targeted at the mobile market, we'll analyze their microarchitecture in another
article. It will also be the task for another article to thoroughly explore Apple's PowerPC G4 microprocessor, and many of
the analytical tools learned here will apply to all high-end processors.
                                                                                        Building a Framework for Comparison
Building a Framework for Comparison

Before we can dive right into the block diagram of a modern CPU, we need to develop some analytical tools for
understanding how these features affect the operation of the PC system. We also need to develop a common framework
for comparison. As you'll soon see, that is no easy task. There are some radical differences in architecture between these
vendors, and it's difficult to make direct comparisons. As it turns out, the best way to understand and compare these new
CPUs is to go back to basic computer architectural concepts and show how each vendor has solved the common problems
faced in modern computer design. In our last section, we'll gaze into the future of PC microarchitecture and make a few

Let's Not Lose Sight of What Really Matters
There is one issue that should be stressed right up front. We should never lose sight of the real objective in computer
design. All that really matters is how well the CPU helps the PC run your software. A PC is a computer system, and subtle
differences in CPU microarchitecture may not be noticeable when you're running your favorite computer program. We
learned this in our article on motherboard technology, since a well-balanced PC needs to remove all the bottlenecks (and
meet the cost goals of the user). The CPU designers are turning to more and more elaborate techniques to squeeze extra
performance out of these machines, so it's still really interesting to peek in on the raging battle for even a few percent
better system performance.
For a PC technology enthusiast, it's just downright fascinating how these CPU architects mix clever engineering tricks with
brute-force design techniques to take advantage of the enormous number of transistors available on the latest chips.
                                                                                            What Does a Computer Really Do?
What Does a Computer Really Do?

It's easy to get buried too deeply in the complexities of these modern machines, but to really understand the design
choices, let's think again about the fundamental operation of a computer. A computer is nothing more than a machine that
reads a coded instruction, decodes the instruction, and executes it. If the instruction needs to load or store some data, the
computer figures out the location for the data and moves it. That's it; that's all a computer does. We can break this
operation into a series of stages:

                                          The 5 Computer Operation Stages
                                Stage 1                      Instruction Access (IA)

                                Stage 2                      Instruction Decode (ID)

                                Stage 3                          Execution (EX)

                                Stage 4                         Data Access (DA)

                                Stage 5                  Store (write back) Results (WB)

Some computer architects may re-arrange, combine, or break up the stages, but every computer microarchitecture does
these five things. We can use this framework to build on as we work our way up to even the most complicated CPUs.
For those of you who eat this stuff for breakfast and are anxious to jump ahead, remember that we haven't yet talked about
pipelines. These stages could all be completely processed for a single instruction before starting the next one. If you think
about that idea for a moment, you'll realize that almost all the complexity comes when we start improving on that limitation.
Don't worry; the discussion will quickly ramp up in complexity, and some readers might appreciate a quick refresher. Let's
see what happens in each of these stages:
Instruction Access
A coded instruction is read from the memory subsystem at an address that is determined by a program counter (PC). In
our analysis, we'll treat memory as something that hangs off to the side of our CPU "execution core", as we show in the
figure below. Some architects like to view memory and the system bus as an integral part of the microarchitecture, and
we'll show how the memory subsystem interacts with the rest of the machine.

Instruction Decode
The coded instruction is converted into control information for the
logic circuits of the machine. Each "operation code (Opcode)"
represents a different instruction and causes the machine to
behave in different ways. Embedded in the Opcode (or stored in
later bytes of the instruction) can be address information or
"immediate" data to be processed. The address information can
represent a new address that might need to be loaded into the PC
(a branch address) or the address can represent a memory
location for data (loads and stores). If the instruction needs data
                                                                                                What Does a Computer Really Do?
from a register, it is usually brought in during this stage.
This is the stage where the machine does whatever operation was directed by the instruction. This could be a math
operation (multiply, add, etc.) or it might be a data movement operation. If the instruction deals with data in memory, the
processor must calculate an "Effective Address (EA)". This is the actual location of the data in the memory subsystem
(ignoring virtual memory issues for now), based on calculating address offsets or resolving indirect memory references (A
simple example of indirection would be registers that house an address, rather than data).
Data Access
In this stage, instructions that need data from memory will present the Effective Address to the memory subsystem and
receive back the data. If the instruction was a store, then the data will be saved in memory. Our simple model for
comparison gets a bit frayed in this stage, and we'll explain in a moment what we mean.
Write Back
Once the processor has executed the instruction, perhaps having been forced to wait for a data load to complete, any new
data is written back to the destination register (if the instruction type requires it).
Was There a Question From the Back of the Room?
Some of the x86 experts in the audience are going to point out the numerous special cases for the way a processor must
deal with an instruction set designed in the 1970s. Our five-stage model isn't so simple when it must deal with all the
addressing modes of an x86. A big issue is the fact that the x86 is what is called a "register-memory" architecture where
even ALU (Arithmetic Logic Unit) instructions can access memory. This is contrasted with RISC (Reduced Instruction Set
Computing) architectures that only allow Load and Store instructions to move data (register-register or more commonly
called Load/Store architectures).
The reason we can focus on the Load/Store architecture to describe what happens in each stage of a computer is that
modern x86 processors translate their native CISC (Complex Instruction Set Computing) instructions into RISC
instructions (with some exceptions). By translating the instructions, most of the special cases are turned into extra RISC
instructions and can be more efficiently processed. RISC instructions are much easier for the hardware to optimize and run
at higher clock rates. This internal translation to RISC is one of the ways that x86 processors were able to deal with the
threat that higher-performance RISC chips would take over the desktop in the early 1990s. We'll talk about instruction
translation more when we dig into the details of some specific processors, at which point we'll also show several ways in
which our model is dramatically modified.
To the questioner in the back of the room, there will be several things we're going to have to gloss over (and simplify) in
order to keep this article from getting as long as a computer textbook. If you really want to dig into details, check out the list
of references at the end of this article.
                                                                                                      The Memory Subsystem

The Memory Subsystem

The memory subsystem plays a big part in the microarchitecture of a CPU. Notice that both the Instruction Access stage
and the Data Access stage of our simple processor must get to memory. This memory can be split into separate sections
for instructions and data, allowing each stage to have a dedicated (hence faster) port to memory.
This is called a "Harvard Architecture", a term from work at Harvard University in the 1940s that has been extended to also
refer to architectures with separate instruction and data caches--even though main memory (and sometimes L2 cache) is
"unified". For some background on cache design, you can refer to the memory hierarchy discussion in the article, "PC
Motherboard Technology". That article also covers the system bus interface, an important part of the PC CPU design that
is tailored to support the internal microarchitecture.
Virtual Memory: Making Life Easier for the Programmer and Tougher for the Hardware Designer
To make life simpler for the programmer, most addresses are "virtual addresses" that allow the software designer to
pretend to have a large, linear block of memory. These virtual addresses are translated into "physical addresses" that refer
to the actual addresses of the memory in the computer. In almost all x86 chips, the caches contain memory data that is
addressed with physical addresses. Before the cache is accessed, any virtual addresses are translated in a "Translation
Look-aside Buffer (TLB)". A TLB is like a cache of recently-used virtual address blocks (pages), responding back with the
physical address page that corresponds to the virtual address presented by the CPU core. If the virtual address isn't in one
of the pages stored by the TLB (a TLB miss), then the TLB must be updated from a bigger table stored in main memory--a
huge performance hit (especially if the page isn't in main memory and must be loaded from disk). Some CPUs have
multiple levels of TLBs, similar to the notion of cache memory hierarchy. The size and structure of the TLBs and caches
will be important during our CPU comparisons later, but we'll focus mainly on the CPU core for our analysis.
                                                                                             Exploiting ILP Through Pipelining

Exploiting ILP Through Pipelining

Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction
as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our
"pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level
Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're
now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple"

Let's Just Think About This a Minute
We'll use a bunch of computer engineering terms
in a moment, since we've got to keep that person in
the back of the room happy. Before doing that, take
a step back and think about what we did to the
machine. (Even experienced engineers forget to
do that sometimes.) Suddenly, memory fetches
have to occur five times faster then before. This
implies that system and cache must now run five
times as fast, even though each instruction still
takes five cycles to completely execute.

We've also made a huge assumption that each stage was taking exactly the same amount of time, since that's the rule that
our pipeline clock is enforcing. What about the assumption that the processor was even going to run the next four
instructions in that order? We (usually) won't even know until the execute stage whether we need to branch to some other
instruction address. Hey, what would happen if the sequence of instructions called for the processor to load some data
from memory and then try to perform a math operation using that data in the next instruction? The math operation would
likely be delayed, due to memory latency slowing down the process.
They're Called Pipeline Hazards
What we're describing are called "pipeline hazards", and their effects can get really ugly. There are three types of hazards
that can cause our pipeline to come to a screeching halt--or cause nasty errors if we don't put in extra hardware to detect
them. The first hazard is a "data hazard", such as the problem of trying to use data before it's available (a "data
dependency"). Another type is a "control hazard" where the pipeline contains instructions that come after a branch. A
"structural hazard" is caused by resource conflicts where an instruction sequence can cause multiple instructions to need
the same processor resource during a given clock cycle. We'd have a structural hazard if we tried to use the same memory
port for both instructions and data.
                                                                                                Exploiting ILP Through Pipelining
Modern Pipelines Can Have a Lot of Stall Cycles
There are ways to reduce the chances of a pipeline hazard occurring, and we'll discuss some of the ways that CPU
architects deal with the various cases. In a practical sense, there will always be some hazards that will cause the pipeline
to stall. One way to describe the situation is to say that an instruction will "block" part of the pipe (something modern
implementations help minimize). When the pipe stalls, every (blocked) instruction behind the stalled stage will have to wait,
while the instructions fetched earlier can continue on their way. This opens up a gap (a "pipeline bubble") between blocked
instructions and the instructions proceeding down the pipeline in front of the blocked instructions.
When the blocked instruction restarts, the bubble will continue down the pipeline. For some hazards, like the control
hazard caused by a (mispredicted) branch instruction, the following instructions in the pipeline need to be killed, since they
aren't supposed to execute. If the branch target address isn't in the instruction cache, the pipeline can stall for a large
number of clock cycles. The stall would be extended by the latency of accesses to the L2 cache or, worse, accesses to
main memory. Stalls due to branches are a serious problem, and this is one of the two major areas where designers have
focused their energy (and transistor budget). The other major area, not surprisingly, is when the pipeline goes to memory
to load data. Most of our analysis will focus in on these 2 latency-induced problems.
Design Tricks To Reduce Data Hazards
For some data hazards, one
commonly-used            solution   is   to
forward    result        data   from     a
completed instruction straight to
another instruction yet to execute
in the pipeline (data "forwarding",
though       sometimes              called
"bypassing"). This is much faster
than writing out the data and
forcing the other instruction to
read it back in. Our case of a math
operation needing data from a
previous memory load instruction
would     seem      to     be   a    good
candidate for this technique. The
data loaded from memory into a
register can also be forwarded straight to the ALU execute stage, instead of going all the way through the register
write-back stage. An instruction in the write-back stage could forward data straight to an instruction in the execute stage.
Why wait 2 cycles? Why not forward straight from the data access stage? In reality, the data load stage is far from
instantaneous and suffers from the same memory latency risk as instruction fetches. The figure below shows how this can
occur. What if the data is not in the cache? There would be a huge pipeline bubble. As it turns out, data access is even
more challenging than an instruction fetch, since we don't know the memory address until we've calculated the Effective
Address. While instructions are usually accessed sequentially, allowing several cache lines to be prefetched from the
instruction cache (and main memory) into a fast local buffer near the execution core, data accesses don't always have
such nice "locality of reference".
                                                                                                  Exploiting ILP Through Pipelining

The Limits of Pipelining
If five stages made us run up to five times faster, why not chop up the work into a bunch more stages? Who cares about
pipeline hazards when it gives the marketing folks some really high peak performance numbers to brag about? Well, every
x86 processor we'll analyze has a lot more than five stages. Originally called "super-pipelining" until Intel (for no obvious
reason) decided to rename it "hyper-pipelining" in their Pentium 4 design, this technique breaks up various processing
stages into multiple clock cycles.
This also has the architectural benefit of giving better granularity to operations, so there should be fewer cases where a
fast operation waits around while slow operations throttle the clock rate. With some of the clever design techniques we'll
examine, the pipeline hazards can be managed, and clock rates can be cranked into the stratosphere. The real limit isn't
an architectural issue, but is related to the way digital circuits clock data between pipeline stages.
To pipeline an operation, each new stage of the pipeline must store information passed to it from a prior stage, since each
stage will (usually) contain information for a different instruction. This staged data is held in a storage device (usually a
"latch"). As you chop up a task into smaller and smaller pipeline stages, the overhead time it takes to clock data into the
latch ("set-up and hold" times and allowance for clock "skew" between circuits) becomes a significant percentage of the
entire clock period. At some point, there is no time left in the clock cycle to do any real work. There are some exotic circuit
tricks that can help, but it would burn a lot of power - not a good trade-off for chips that already exceed 70 watts in some
                                                                                       Exploiting ILP Via Superscalar Processing
Exploiting ILP Via Superscalar Processing

While our simple machine doesn't have any serious structural hazards, that's only because it is a "single-issue"
architecture. Only a single instruction can be executed during a clock cycle. In a "superscalar" architecture, extra compute
resources are added to achieve another dimension of instruction-level parallelism. The original Pentium provided 2
separate pipelines that Intel called the U and V pipelines. In theory, each pipeline could be working simultaneously on 2
different sets of instructions.

With a multi-issue processor (where multiple instructions can be dispatched each clock cycle to multiple pipelines in the
single processor), we can have even more data hazards, since an operation in one pipeline could depend on data that is in
another pipeline. The control hazards can get worse, since our "instruction fetch bandwidth" rises (doubled in a 2-issue
machine, for example). A (mispredicted) branch instruction could cause both pipelines to need instructions flushed.
Issue Restrictions Limit How Often Parallelism Can Be Achieved
In practice, a superscalar machine has lots of "issue restrictions" that limit what each
pipeline is capable of processing. This structural hazard limited how often both the U
and V pipe of the Pentium could simultaneously execute 2 instructions. The limitations
are caused by the cost of duplicating all the hardware for each pipeline, so the
designers focus instead on exploiting parallelism in as many cases as practical.

Combining Superscalar with Super-Pipelining to Get the Best of Both
Another approach to superscalar is to duplicate portions of the pipeline. This becomes much easier in the new
architectures that don't require instructions to proceed at the same rate through the pipeline (or even in the original
program order). An obvious stage for exploiting superscalar design techniques is the execute stage, since PC's process
three different types of data. There are integer operations, floating-point operations and now "media" operations. We know
all about integer and floating-point. A media instruction processes graphics, sound or video data (as well as
communications data). The instruction sets now include MMX, 3DNow!, Enhanced 3DNow!, SSE, and SSE2 media
instructions. The execute stage could attempt to simultaneously process all three types of instructions, as long as there is
enough hardware to avoid structural hazards.
In practice, there are several structural hazards that require issue restrictions. Each new execution resource could also
have its own pipeline. Many floating-point instructions and media instructions require multiple clocks and aren't fully
pipelined in some implementations. We'll clear up any confusion when we analyze some real processors later. For now, it's
only important to understand the fundamentals of superscalar design and realize that modern architectures include
combinations of multiple pipelines running simultaneously.
                                                                                      Exploiting Data-Level Parallelism Via SIMD
Exploiting Data-Level Parallelism Via SIMD

We'll talk more about this later, but the new focus on media instructions has allowed CPU designers to recognize the
inherent parallelism in the way data is processed. The same operation is often performed on independent data sets, such
as multiplying data stored in a vector or a matrix. A single instruction is repeated over and over for multiple pieces of data.
We can design special hardware to do this more efficiently, and we call this a "Single Instruction Multiple Data (SIMD)"
computing model.

More Pressure on the Memory System
Once again, take a step back and think about the implications before that person in the back of the room gets us to dive
into implementation details. With some intuitive analysis, we can observe that we've once again put tremendous pressure
on our memory subsystem. A single instruction coming down our pipeline(s) could force multiple data load and store
operations. Thinking a bit further about the nature of media processing, some of the streaming media types (like video)
have critical timing constraints, and the streams can last for a long time (i.e. as a viewer of video, you expect a continuous
flow of the video stream over time, preferably without choppiness or interruptions). Our data caches may not do us much
good, since the data may only get processed once before the next chunk of data wants to replace it (data caches are most
effective when the same data is accessed over and over). Thus the CPU architects have some new challenges to solve.
                                                                                     Where Should Designers Focus The Effort?

Where Should Designers Focus The Effort?

By now, you've likely come to realize that every CPU vendor is trying to solve similar problems. They're all trying to take a
1970s instruction set and do as much parallel processing as possible, but they're forced to deal with the limitations of both
the instruction set and the nature of memory systems. There is a practical limit to how many instructions can be processed
in parallel, and it gets more and more difficult for the hardware to "dynamically" schedule instructions around any possible
instruction blockage. The compilers are getting better at "statically" scheduling, based on the limited information available
at compile time. However, the hardware is being pushed to the limits in an attempt to look as far ahead in the instruction
stream as possible in the search for non-blocking instructions.

It's All About Memory Latency
As we've shown, there are 2 stages of our computer model where the designers can get the most return on their efforts.
These are Instruction Fetch and Data Access, and both can cause an enormous performance loss if not handled properly.
The problem is caused by the fact that our pipelines are now running at over one GHz, and it can take over 100 pipeline
cycles to get something from main memory. The key to solving the problem is to make sure that the required instructions or
data aren't sitting in main memory when you need them, but instead, are already in a buffer inside your pipeline--or at least
sitting in an upper level of your cache hierarchy.

Branch Prediction Can Solve the Problem With I-Fetch Latency
If we could predict with 100% certainty which direction a program branch is going (forward or backward in the instruction
stream), then we could make sure that the instructions following the branch instruction are in the correct sequence in the
pipeline. That's not possible, but improvement in the branch predictor can have a dramatic performance gain for these
modern, deeply-pipelined architectures. We'll analyze some branch prediction approaches later.
Data Memory Latency is Much Tougher to Handle
One way to deal with data latency is to have "non-blocking loads" so that other memory operations can proceed while
we're waiting for the data for a specific instruction to come back from the memory system. Every x86 architecture does this
now. Still, if the data is sitting in main memory when the load is being executed, the chip's performance will take a severe
hit. The key is to pre-fetch blocks of data before they're needed, and special instructions have been added to directly allow
the software to deal with the limited locality of data.
There are also some ways that the pipeline can help by buffering up load requests and using intelligent data pre-fetching
techniques based on the processor's knowledge of the instruction stream. We'll analyze some of the vendor solutions to
the problem of data access.
                                                                                              A Closer Look At Branch Prediction
A Closer Look At Branch Prediction
The person in the back of the room will be happy to hear that things are about to get more complicated. We're now going to
explore some of the recent innovations in CPU microarchitecture, starting with branch prediction. All the easy techniques
have already been implemented. To get better prediction accuracy, microprocessor designers are combining multiple
predictors and inventing clever new algorithms.

There really are three different kinds of branches:

       Forward conditional branches - based on a run-time condition, the PC (Program Counter) is changed to point to an
        address forward in the instruction stream.
       Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is
        based on some condition, such as branching backwards to the beginning of a program loop when a test at the end
        of the loop states the loop should be executed again.
       Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For
        example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the
        instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a
        conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a
        comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented
        addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a
        segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)

Using Branch Statistics for Static Prediction
Forward branches dominate backward branches by about 4 to 1 (whether conditional or not). About 60% of the forward
conditional branches are taken, while approximately 85% of the backward conditional branches are taken (because of the
prevalence of program loops). Just knowing this data about average code behavior, we could optimize our architecture for
the common cases. A "Static Predictor" can just look at the offset (distance forward or backward from current PC) for
conditional branches as soon as the instruction is decoded. Backward branches will be predicted to be taken, since that is
the most common case. The accuracy of the static predictor will depend on the type of code being executed, as well as the
coding style used by the programmer. These statistics were derived from the SPEC suite of benchmarks, and many PC
software workloads will favor slightly different static behavior.
Dynamic Branch Prediction with a Branch History Buffer (BHB)
To refine our branch prediction, we could create a buffer that is indexed by the low-order address bits of recent branch
instructions. In this BHB (sometimes called a "Branch History Table (BHT)"), for each branch instruction, we'd store a bit
that indicates whether the branch was recently taken. A simple way to implement a dynamic branch predictor would be to
check the BHB for every branch instruction. If the BHB's prediction bit indicates the branch should be taken, then the
pipeline can go ahead and start fetching instructions from the new address (once it computes the target address).
By the time the branch instruction works its way down the pipeline and actually causes a branch, then the correct
instructions are already in the pipeline. If the BHB was wrong, a "misprediction" occurred, and we'll have to flush out the
incorrectly fetched instructions and invert the BHB prediction bit.
Refining Our BHB by Storing More Bits
It turns out that a single bit in the BHB will be wrong twice for a loop--once on the first pass of the loop and once at the end
                                                                                               A Closer Look At Branch Prediction
of the loop. We can get better prediction accuracy by using more bits to create a "saturating counter" that is incremented
on a taken branch and decremented on an untaken branch. It turns out that a 2-bit predictor does about as well as you
could get with more bits, achieving anywhere from 82% to 99% prediction accuracy with a table of 4096 entries. This size
of table is at the point of diminishing returns for 2 bit entries, so there isn't much point in storing more. Since we're only
indexing by the lower address bits, notice that 2 different branch addresses might have the same low-order bits and could
point to the same place in our table--one reason not to let the table get too small.
Two-Level Predictors and the GShare Algorithm
There is a further refinement we can make to our BHB by correlating the behavior of other branches. Often called a "Global
History Counter", this "two-level predictor" allows the behavior of other branches to also update the predictor bits for a
particular branch instruction and achieve slightly better overall prediction accuracy. One implementation is called the
"GShare algorithm". This approach uses a "Global Branch History Register" (a register that stores the global result of
recent branches) that gets "hashed" with bits from the address of the branch being predicted. The resulting value is used
as an index into the BHB where the prediction entry at that location is used to dynamically predict the branch direction. Yes,
this is complicated stuff, but it's being used in several modern processors.
Using a Branch Target Buffer (BTB) to Further Reduce the Branch Penalty
In addition to a large BHB, most predictors also include a buffer that stores the actual target address of taken branches
(along with optional prediction bits). This table allows the CPU to look to see if an instruction is a branch and start fetching
at the target address early on in the pipeline processing. By storing the instruction address and the target address, even
before the processor decodes the instruction, it can know that it is a branch. The figure below shows an implementation of
a BTB. A large BTB can completely remove most branch penalties (for correctly-predicted branches) if the CPU looks far
enough ahead to make sure the target instructions are pre-fetched. Using a Return Address Buffer to predict the return
from a subroutine One technique for dealing with the unconditional branch at the end of a subroutine is to create a buffer of
the most recent return addresses. There are usually some subroutines that get called quite often in a program, and a
return address buffer can make sure that the correct instructions are in the pipeline after the return instruction.
                                                                                 Speculative, Out-of-Order Execution Gets a New Name

Speculative, Out-of-Order Execution Gets a New Name

While RISC chips used the same terms as the rest of the computer engineering community, the Intel marketing department
decided that the average consumer wouldn't like the idea of a computer that "speculates" or runs programs "out of order".
A nice warm-and-fuzzy term was coined for the P6 architecture, and "Dynamic Execution" was added to our list of
non-descriptive buzzwords.

Both AMD and Intel use a microarchitecture that, after decoding into simpler RISC instructions, tosses the instructions into
a big hopper and allows them to execute in whatever order best matches the available compute resources. Once the
instructions have finished executing out of order, results get "committed" in the original program order. The term
"speculation" refers to instructions being speculatively fetched, decoded and executed.
A useful analogy can be drawn to the stock market investor who "speculates" that a stock will go up in value and justify an
investment. For a microprocessor speculating on instructions in advance, if the speculation turns out to be incorrect, those
instructions are eliminated before any machine state changes are committed (written to processor registers or memory).
Once Again, Let's Take a Step Back and Try Some More Intuitive Analysis
By now that person in the back of the room has finally gotten used to these short pauses to look at the big picture. In this
case, we just made a huge change to our machine, and it's hard to easily conceptualize. We've completely scrambled the
notion of how instructions flow down a one-way pipeline. One thing that becomes obvious is the need for darn good branch
prediction. All that speculation becomes wasted memory bandwidth, execution time, and power if we end up taking a
branch we didn't expect. Following our stock investor analogy, if the value doesn't go up, then the investment was wasted
and could have been more productively used elsewhere. In fact, the speculation could make us worse off.
The need to wait before committing completed instructions to registers or memory
should probably be obvious, since we could end up with incorrect program behavior and
incorrect data--then have to try to unwind everything when a branch misprediction (or an
exception) comes along. The real power of this approach would seem to be realized by
having lots of superscalar stages, since we can reorder the instructions to better match
the issue restrictions of multiple compute resources. OK, enough speculation, let's dig
into the details:
Register Renaming Creates Virtual Registers
If you're going to have speculative instructions operating out of order, then you can't have them all trying to change the
same registers. You need to create a "register alias table (RAT)" that renames and maps the eight x86 registers to a much
larger set of temporary internal register storage locations, permitting multiple instances of any of the original eight registers.
An instruction will load and store values using these temporary registers, while the RAT keeps track of what the latest
known values are for the actual x86 registers. Once the instructions are completed and re-ordered so that we know the
register state is correct, then the temporary registers are committed back to the real x86 registers.
The Reorder Buffer (ROB) Helps Keep Instructions in Order
After an instruction is decoded, it's allowed to execute out of order as soon as the operands (data) become available. A
special Reorder Buffer is created to keep track of instruction status, such as when the operands become available for
execution, or when the instruction has completed execution and results can be "committed" or "retired" to architectural
registers or memory in the original program order. These instructions use the renamed register set and are "dispatched" to
the execution units as resources become available, perhaps spending some time in "reservation stations" that operate as
                                                                                Speculative, Out-of-Order Execution Gets a New Name
instruction queues at the front of various execution units. After an instruction has finished executing, it can be "retired" by
the ROB. However, the state still isn't committed until all the older instructions (with respect to program order) have been
retired first.
A neat thing about using register renaming, reservation stations, and the ROB is that a result from a completed instruction
can be forwarded directly to the renamed register of a new instruction. Many potential data dependencies go away
completely, and the pipelines are kept moving.
Load and Store Buffering Tries to Hide Data Access Latency
In the same way that instructions are executed as soon as resources become available, a load or a store instruction can
get an early start by using this speculative approach. Obviously, the stores can't actually get sent all the way to memory
until we're sure the data really should be changed (requiring we maintain program order). Instead, the stores are buffered,
retired, and committed in order. The loads are a more interesting case, since they are directly affected by memory latency,
the other key problem we highlighted earlier. The hardware will speculatively execute the load instruction, calculating the
Effective Address out of order. Depending on the implementation, it may even allow out-of-order cache access, as long as
the loads don't access the same address as a previous store instruction still in the processor pipeline, but not yet
committed. If in fact the load instruction needs the results of a previous store that has completed but is still in the machine,
the store data can get forwarded directly to the load instruction (saving the memory load time).
                                                                                      Analyzing Some Real Microprocessors: P4

Analyzing Some Real Microprocessors: P4

We've come to the end of our tutorial on processor microarchitecture. Hopefully, we've given you enough analytical tools
so that you're now ready to dig into the details of some real products. There are a few common microarchitectural features
(like instruction translation) that we decided would be easier to explain as we show some real implementations. We'll also
look a bit deeper at the arcane science of branch prediction. Let's now take an objective look at the Intel P4, AMD Athlon,
and VIA/Centaur C3. We'll then do some more big-picture analysis and gaze forward to predict the future of PC

Intel Pentium 4 Microarchitecture
Intel is vigorously promoting the Pentium 4 as the preferred desktop processor, so we'll focus our Intel analysis on this
microarchitecture. We'll make a few comparisons to previous processor generations, but our goal is to gain a detailed
understanding of how the Pentium 4 meets its design goals. We'll leave it as an "exercise for the reader" to apply your new
analytical tools to the Pentium III. The Pentium 4 is the first x86 chip to use some newer microarchitectural innovations,
offering us an opportunity to explore some of these new approaches to dealing with the 2 key latency-induced challenges
in CPU design.
We should point out that our analysis only covers the "Willamette" version of the P4, while the forthcoming "Northwood" will
move to a .13 micron process geometry and make slight changes to the microarchitecture (most likely improving the
memory subsystem). We'll update this article when we get more information on Northwood.
The NetBurstTM Moniker Describes a Collection of Design Features
What's the point of introducing a new product without adding a new Intel buzzword? In this case, the name doesn't refer to
a single architectural improvement, but is really meant to serve as a name for this family of microprocessors. The NetBurst
design changes include a deeper pipeline, new bus architecture, more execution resources, and changes to the memory
subsystem. The figure below shows a block diagram of the Pentium 4, and we'll take a look at each major section.
                                                                                           Analyzing Some Real Microprocessors: P4
Deeply Pipelined for Higher Clock Rate
The Pentium 4 has a whopping 20-stage pipeline when processing a branch misprediction. The figure below shows how
this pipeline compares to the 10 stages of the Pentium III. The most interesting thing about the Pentium 4 pipe is that Intel
has dedicated 2 stages for driving data across the chip. This is fascinating proof that the limiting factor in modern IC design
has become the time it takes to transmit a signal across the wire connections on the chip. To understand why it's
fascinating, consider that it wasn't so long ago that designers only worried about the speed of transistors, and the time it
took to traverse such a short piece of metal was considered essentially instantaneous. Now we're moving from aluminum
to copper, just because electrons propagate faster with copper. (I can see that person in the back of the room is still with us
and is nodding in agreement.) This is fascinating stuff, and Intel is probably the first vendor to design a pipeline with "Drive"

What About All Those Problems with Long Pipelines?
Well, Intel has to work especially hard to make sure they avoid pipeline hazards. If that long pipeline needs to be flushed
very often, then the performance will be much lower than other designs. We should remind ourselves that the longer
pipeline actually results in less work being done on each clock cycle. That's the whole point of super-pipelining (or
hyper-pipelining, if you prefer), since doing less work in a clock cycle is what allows the clock cycle time to be shortened.
The pipeline has to run at a higher frequency just to do the same amount of work as a shorter pipeline. All other things
being equal, you'd expect the Pentium 4 to have less performance than parts with shorter pipelines at the same frequency.
Searching for Even More Instruction-Level Parallelism
As we learned, there is another thing to realize about long pipelines (besides being able to run at the high clock rates that
motivate uninformed buyers). Longer pipelines allow more instructions to be in process at the same time. The compiler
(static scheduler) and the hardware (dynamic scheduler) must keep the faster and deeper pipeline fed with the instructions
and data it needs during a larger instruction "window". The machine is going to have to search even further to find
instructions that can execute in parallel. As you'll see, the Pentium 4 can have an incredible 126 instructions in-flight as it
searches further and further ahead in the instruction stream for something to work on while waiting for data or resource
dependencies to clear.
                                                                                                  Pentium 4’s Cache Organization

Pentium 4's Cache Organization

Cache Organization in the Memory Hierarchy
As we described in our article on motherboard technology, there is usually a trade-off between cache size and speed. This
is mostly because of the extra capacitive loading on the signals that drive the larger SRAM arrays. Refer again to block
diagram of the Pentium 4. Intel has chosen to keep the L1 caches rather small so that they can reduce the latency of cache
accesses. Even a data cache hit will take 2 cycles to complete (6 cycles for floating-point data). We'll talk about the L1
caches in a moment, but further down the hierarchy we find that the L2 cache is an 8-way, unified (includes both instruction
and data), 256KB cache with a 128B line size.

The 8-way structure means it has 8 sets of tags, providing about the same cache miss rate as a "fully-associative" cache
(as good as it gets). This makes the 256KB cache more effective than its size indicates, since the miss rate of this cache is
approximately 60% of the miss rate for a direct-mapped (1-way) cache of the same size.

The downside is that an 8-way cache will be slower to access. Intel states that the load
latency is 7 cycles (this reflects the time it takes an L2 cache line to be fully retrieved to
either the L1 data cache or the x86 instruction prefetch/decode buffers), but the cache is
able to transfer new data every 2 cycles (which is the effective throughput assuming
multiple concurrent cache transfers are initiated). Again, notice that the L2 cache is
shared between instruction fetches and data accesses (unified).
System Bus Architecture is Matched to Memory Hierarchy Organization
One interesting change for the L2 cache is to make the line size 128 bytes, instead of
the familiar 32 bytes. The larger line size can slightly improve the hit rate (in some cases), but requires a longer latency for
cache line refills from the system bus. This is where the new Pentium 4 bus comes into play. Using a 100MHz clock and
transferring data four times on each bus clock (which Intel calls a 400MHz data rate), the 64-bit system bus can bring in 32
bytes each cycle. This translates to a bandwidth of 3.2 GB/sec.
                                                                                                  Pentium 4’s Cache Organization
To fill an L2 cache line requires four bus cycles- the same number of cycles as the P6 bus for a 32-byte line). Note that the
system bus protocol has a 64-byte access length (matching the line size of the L1 cache) and requires 2 main memory
request operations to fill an L2 cache line. However, the faster bus only helps overcome the latency of getting the extra
data into the CPU from the North Bridge. The longer line size still causes a longer latency before getting all the burst data
from main memory. In fact, some analysts note that P4 systems have about 19% more memory latency than Pentium III
systems (measured in nanoseconds for the demand word of a cache refill). Smart pre-fetching is critical or else the P4 will
end up with less performance on many applications.
Pre-Fetching Hardware Can Help if Data Accesses Follow a Regular Pattern
The L2 cache has pre-fetch hardware to request the next 2 cache lines (256 bytes) beyond the current access location.
This pre-fetch logic has some intelligence to allow it to monitor the history of cache misses and try to avoid unnecessary
pre-fetches (that waste bandwidth and cache space). We'll talk more about the pre-fetcher later, but let's take a quick
pause for some of our patented intuitive analysis. We've described the problem of dealing with streaming media types (like
video) that don't spend much time in the cache. The hardware pre-fetch logic should easily notice the pattern of cache
misses and then pre-load data, leading to much better performance on these types of applications.
Designing for Data Cache Hits
Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1 data cache. They are most likely referring
to the fact that the Pentium 4 speculatively processes load instructions as if they always hit in the L1 data cache (and data
TLB). By optimizing for this case, there aren't any extra cycles burned while cache tags are checked for a miss. The load
instruction is sent on its merry way down the pipeline; if a cache miss delays the load, the processor passes temporarily
incorrect data to dependent instructions that assumed the data arrived in 2 cycles. Once the hardware discovers the L1
data cache miss and brings in the actual data from the rest of the memory hierarchy, the machine must "replay' any
instructions that had data dependencies and grabbed the wrong data.
It's unclear how efficient this approach will be, since it obviously depends on the load pattern for the applications. The
worst case would be an application that constantly loads data that is scattered around memory, while attempting to
immediately perform an operation on each new data value. The hardware pre-fetch logic would (perhaps mercifully) never
"trigger", and the pipeline would be constantly restarting instructions.
Again, the Pentium 4 design seems to have been optimized for the case of streaming media (just as Intel claims), since
these algorithms are much more regular and demand high performance. The designers probably hope that the
pathological worst case only occurs for code that doesn't need high performance. When the L1 data cache does have a
miss, it has a "fat pipe" (32 bytes wide) to the L2 cache, allowing each 64-byte cache line to be refilled in 2 clocks. However,
there is a 7-cycle latency before the L2 data starts arriving, as we mentioned previously. The Pentium 4 can have up to
four L1 data cache misses in process.
                                                                                                         Pentium 4’s Trace Cache
Pentium 4's Trace Cache

The Trace Cache Depends on Good Branch Prediction
Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident enough in their branch prediction
algorithms to implement a trace cache. Rather than storing standard x86 instructions, the trace cache stores the
instructions after they've already been decoded into RISC-style instructions. Intel calls them "µops" (micro-ops) and stores
6 µops for each "trace line". The trace cache can house up to 12K µops. Since the instructions have already been decoded,
hardware knows about any branches and fetches instructions that follow the branch. As we learned, it's the conditional
branches that could really cause a problem, since we won't know if we're wrong until the branch condition check in
Arithmetic Logic Unit 0 (ALU0) of the execution core. By then, our trace cache could have pre-fetched and decoded a lot of
instructions we don't need. The pipeline could also allow several out-of-order instructions to proceed if the branch
instruction was forced to wait for ALU0.

Hopefully, the alternative branch address is somewhere in the trace cache. Otherwise, we'll have to pay those 7 cycles of
latency to get the proper instructions from the L2 cache (pity us if it's not there either, as the L2 cache would need to get
the instructions from main memory) plus the time to decode the fetched x86 instructions. Intel's reference to the 20-stage
P4 pipeline actually starts with the trace cache, and does not include the cycles for instruction or data fetches from system
memory or L2 cache.
The Trace Cache has Several Advantages
If our predictors work well, then the trace cache is able to provide (the correct) three µops per cycle to the execution
scheduler. Since the trace cache is (hopefully) only storing instructions that actually get executed, then it makes more
efficient use of the limited cache space. Since the branch target instruction has already been decoded and fetched in
execution order, there isn't any extra latency for branches. The person in the back of the room just reminded us of an
interesting point. We never mentioned a TLB check for the trace cache, because it does not use one. So, the Pentium 4
isn't so complicated after all. Most of you correctly observed that this cache uses virtual addressing, so there isn't any need
to convert to physical addresses until we access the L2 cache. Intel documents don't give the size of the instruction TLB for
the L2 cache.
Pentium 4 Decoder Relies on Trace Cache to Buffer µops
The Pentium 4 decoder can only convert a single x86 instruction on each clock, fewer than other architectures. However,
since the µops are cached in the trace buffer (and hopefully reused), the decode bandwidth is probably adequate to match
the instruction issue rate (three µops/cycle). If an x86 instruction requires more than four µops, then the decoder fetches
µops directly from a µops "Read-Only Memory (ROM)". All x86 processor architectures use some sort of ROM for
infrequently used instructions or multi-cycle string operations.
                                                                                               The Execution Engine Out of Order
The Execution Engine Runs Out Of Order

For an out-of-order machine, the main design goal is to provide enough parallel compute resources to make it worth all the
extra complexity. In this case, the machine is working to schedule instructions for 7 different parallel units, shown in the
figure below. Two of these units dispatch loads and stores (the Data Access stage of our original computer model). The
other processing tasks use multiple schedulers and are dispatched through the 2 Exec Ports. Each port could have a fast
ALU operation scheduled every half cycle, though other µops get scheduled every cycle. The figure below shows what
each port can dispatch.

Notice the numerous issue restrictions (structural hazards). If you were to have just fast ALU µops on both Exec Ports and
a simultaneous Load and Store dispatch, then a total of 6 µops/cycle (four double-speed ALU instructions, a Load, and a
Store) can be dispatched to execution units. The performance of the execution engine will depend on the type of program
and how well the schedulers can align µops to match the execution resources.
Retiring Instructions in Order and Updating the Branch Predictors
The Reorder Buffer can retire three µops/cycle, matching the instruction issue rate. There are some subtle differences in
the way the Pentium 4 ROB and register renaming are implemented compared to other processors like the Pentium III, but
the operation is very similar. As we've shown, a key to performance is to avoid mispredicted branches. As instructions are
retired from the ROB, the final branch addresses are used to update the Branch Target Buffer and Branch History Buffer.
In case some of you have finally figured out modern branch predictors, Intel has chosen to rename the combination of a
BTB and a BHB. Intel calls the combination a "Branch Target Buffer (BTB)", insuring extra confusion for our new students
of computer microarchitecture.
Branch Prediction Uses a Combination of Schemes
While there isn't much public information about how the Pentium 4 does branch prediction, they likely use a two-level
predictor and combine information from the Static Prediction we discussed earlier. They also include a Return Address
Buffer of some undisclosed size. The specific algorithms are part of the "secret sauce" that processor vendors guard
closely. In the past, we've seen various patent filings describing algorithmic mechanisms used in branch predictors and
other processor subsystems. The patent details shed more light on their implementations than processor vendors would
otherwise choose to disclose publicly.
Branch Hints Can Allow Faster Performance on a Known Data Set
The Pentium 4 also allows software-directed branch hints to be passed as prefixes to branch instructions. These branch
hints allow the software to override the Static Predictor and can be a powerful tool. This is particularly true if the program is
                                                                                             The Execution Engine Out of Order
compiled and executed with special features enabled to collect information about program flow. The information from the
prior run can be fed back to the compiler to create a new executable with Branch Hints that avoid the earlier
There is some potential for marketing abuse of this feature, since benchmarks that use a repeatable data set can be
optimized to avoid performance-killing branch mispredictions.
Support for New Media Instructions
The Pentium 4 has retained the earlier x86 instruction extensions (MMX and SSE) and added 144 new instructions they
call SSE2. It will be the task for another article to give a complete analysis and comparison of the x86 instruction
extensions and execution resources. However, as we've noted several times, the Pentium 4 is tuned for performance on
streaming media applications.
Poor Thermal Management Can Limit Performance
One potentially troubling feature of the Pentium 4 is the "Thermal Monitor" that can be enabled to slow the internal clock
rate to half speed (or less, depending on the setting) when the die temperature exceeds a certain value. On a 1.5 GHz
Pentium 4 (Willamette), this temperature currently equates to 54.7 Watts of power (according to Intel's Thermal Design
Guide and P4 datasheet). This is almost certainly a limitation of the package and heat sink, but the maximum power
dissipation of a 1.5 GHz part is currently about 73 Watts.
Intel would argue that this maximum would never be reached, but it is quite possible that demanding applications will
cause a poorly-cooled CPU to exceed the current thermal cut-off point - losing performance at a time when you need it the
most. As Intel moves to lower voltages in a more advanced manufacturing process, these limits will be less of a problem -
at current clock rates. As higher clock rate parts are introduced, the potential performance loss will again be an issue.
Certainly, the Thermal Monitor is a good feature for ensuring that parts don't destroy themselves. It also is a clever solution
to the problem of turning on fans quickly enough to match the high thermal ramp rates. The concerns may only arise for
low-cost, inadequate heatsinks and fans. Customers may appreciate the system stability this feature offers, but not the
uncertainty about whether they're getting all the performance they paid for. We've heard from one of Intel's competitors
that certain Dell and HP Pentium 4 systems they tested do not enable this clock slow-down feature. This is actually a good
thing if Dell and HP are confident about their thermal solution. We plan to write a separate report on our testing of this
feature soon.
Overall Conclusions About the Pentium 4
The large number of complex new features in this processor has required a lot of explanation. Clearly, this is a design that
is intended to scale to dramatically higher clock rates. Only at higher clock rates does the benefit of the microarchitecture
become realized. It is also likely that the designers were forced to make painful trade-offs in the sizes for the on-chip
memory hierarchy. With a microarchitecture so sensitive to cache misses, it will be critical to increase the size of these
memories as transistor budgets increase. With good thermal management, higher clock rates and bigger caches, this chip
should compete well in desktop systems in the future, while doing very well today with streaming media, memory
bandwidth-intensive applications, and functions that use SSE2 instructions.
                                                                                                   AMD Athlon Microarchitecture
AMD Athlon Microarchitecture

The Athlon architecture is more similar to our earlier analysis of speculative, out-of-order machines. This similarity is partly
due to the (comforting) maturity of the architecture, but it should be noted that the original design of the Athlon
microarchitecture emphasized performance above other factors. The more aggressive initial design approach keeps the
architecture sustainable while minor optimizations are implemented for clock speed or die cost.

AMD will soon ship a new version of Athlon, code-named "Palomino" and possibly sporting bigger caches and subtle
changes to the microarchitecture. For this article, we examine "Thunderbird", the design introduced in June 2000.
Parallel Compute Resources Benefit From Out-of-Order Approach
The extra complexity of creating an out-of-order machine is wasted if there aren't parallel compute resources available for
taking advantage of those exposed instructions. Here is where Athlon really shines. The microarchitecture can execute 9
simultaneous RISC instructions (what AMD calls "OPs").
The figure below shows the block diagram of Athlon. Note the extra resources for standard floating-point Ops, likely
explaining why this processor does so well on FP-intensive programs. (Well, that person in the back of the room is still with
us.) Yes, indeed the comparative analysis gets more complex if we include the P4's SSE2 instructions for SIMD
floating-point, but we'll have to leave that analysis for another article. The current Athlon architecture will certainly have
higher performance for applications that don't have high data-level parallelism.

Cache Architecture Emphasizes Size to Achieve High Hit Rate
Note that AMD has chosen to implement large L1 caches. The L1 instruction and data caches are each 2-way, 64KB
caches. The L1 instruction cache has a line-size of 64 bytes with a 64-byte sequential pre-fetch. The L1 data cache
provides a second data port to avoid structural hazards caused by the superscalar design. The L2 cache is a 16-way,
256KB unified cache, backed up by the fast EV6 bus we discussed in the motherboard article.
If we take a step back and think about differences between P4 and Athlon memory hierarchies, we can make a few
observations. Intel's documentation states that their 12K trace cache will have the same hit rate as an "8K to 16K byte
                                                                                                   AMD Athlon Microarchitecture
conventional instruction cache". By that measure, the Athlon will have much better hit rates, though hits will have longer
latency for decoding instructions. An L1 miss is much worse for the P4's longer pipeline, though smart pre-fetching can
overcome this limitation. Remember, at these high clock rates, it doesn't take long to drain an instruction cache. It will
eventually come down to the accuracy of the branch predictor, but the Pentium 4 will still need a bigger trace cache to
match Athlon instruction fetch effectiveness.
Pre-Decoding Uses Extra Cache Bits
To deal with the complexities of the x86 instruction set, AMD does some early decoding of x86 instructions as they are
fetched into the L1 instruction cache. These extra bits help mark the beginning and end of the variable-length instructions,
as well as identify branches for the pre-fetcher (and predictor). These extra bits and early (partial) decoding give some of
the benefits of a trace cache, though there is still latency for the completion of the decoding.
Final Decoding Follows 2 Different Paths
Figure 9 shows the decode pipeline for the Athlon. Notice that it matches the flow of our original computer model, breaking
up Instruction Access and Decode stages into 6 pipeline stages. AMD uses a fixed-length instruction format called a
"MacroOp", containing one or more Ops. The instruction scheduler will turn MacroOp's into Op's as it dispatches to the
execution units. The "DirectPath Decoder" generates MacroOp's that take one or two Ops. The "VectorPath Decoder"
fetches longer instructions from ROM. Notice in the figure below that the Athlon can supply three MacroOp's/cycle to the
instruction decoder (the IDEC stage), and later they'll enter the instruction scheduler, equating to a maximum of 6
Ops/cycle decode bandwidth. Note that the actual decode performance depends on the type of instructions.
                                                                                         AMD Athlon Schedulaer, Data Access
AMD Athlon Scheduler, Data Access

Integer Scheduler Dispatches Ops to 6 Execution Units
The figure below shows how pipeline stage 7 buffers up to 18 MacroOP's that are dispatched as Ops to the integer
execution units. This (reservation station) is where instructions wait for operands (including data from memory) to become
available before executing out of order. As you'll recall, there is a Reorder Buffer that keeps track of instruction status,
operands, and results ensuring the instructions are retired and committed in program order. Note that Integer Multiply
instructions require more compute resources and force extra issue restrictions.

Data Access Forces Instructions to Wait
Even for an out-of-order machine, our original computer model still holds up well. Notice in the figure below that loads and
stores will use the "Address Generation Units (AGU's)" to calculate the Effective Address (cycle 9 ADDGEN stage) and
access the data cache (cycle 10 DC ACC). In cycle 11, the data cache sends back a hit/miss response (and potentially the
data). If another instruction is waiting in the scheduler for this data, the data is forwarded. Cache misses will cause the
instructions to wait. There is a separate 44-entry Load/Store Unit (LSU) that manages these instructions.

Floating Point Instructions Have Their Own Scheduler and Pipeline
The Athlon can simultaneously process three types of floating-point instructions (FADD, FMUL, and FSTORE), as shown
in the figure below. The floating-point units are "fully pipelined", so that new FP instructions can start while other
instructions haven't yet completed. MMX/3DNow! instructions can be executed in the FADD and FMUL pipelines. The FP
instructions execute out of order, and each of the three pipelines has several different execution units. There are some
issue restrictions that apply to these pipelines. The performance of the Athlon's fully-pipelined FP units allow it to
consistently outperform the Pentium III at similar clock speeds, and a 1.33GHz Athlon even performs better than a 1.5GHz
Pentium 4 in some FP benchmarks. We haven't seen enough SSE2-optimized applications to draw a definitive conclusion
with applications that may benefit from SSE2, however.
                                                                                           AMD Athlon Schedulaer, Data Access

Branch Prediction Logic is a Combination of the Latest Methods
There is a 2048-entry Branch Target Buffer that caches the predicted target address. This works in concert with a Global
History Table that uses a "bimodal counter" to predict whether branches are taken. If the prediction is correct, then there is
a single-cycle delay to change the instruction fetcher to the new address. (Note that the P4 trace cache doesn't have any
predicted-branch-taken delays). If the predictor is wrong, then the minimum delay is 10 cycles. There is also a 12-entry
Return Address Buffer.
Overall Conclusions About the Athlon Microarchitecture
To prevent this article from beocming interminably long, we have to gloss over many features of the Athlon architecture,
and undoubtedly several features will change as new versions are introduced. The main conclusion is that Athlon is a more
traditional, speculative, out-of-order machine and requires fewer pipeline stages than the Pentium 4. At the same clock
rate, Athlon should perform better than Pentium 4 on many of today's mainstream applications. The actual comparison
ratio would depend on how well the P4's SSE2 instructions are being used, how well the P4's branch predictors and
pre-fetchers are working, and how well the system/memory bus is being utilized. Memory bandwidth-intensive applications
favor the P4 today. There is a lot of room for optimizing code to match the microarchitecture, and both AMD and Intel are
working with software developers to tune the applications. We look forward to seeing what enhancements AMD delivers
with Palomino.
                                                                                                  Centaur C3 Microarchitecture
Centaur C3 Microarchitecture

Even though VIA/Centaur doesn't have the same market share as Intel and AMD, they have an experienced design team
and some interesting architectural innovations. This architecture also makes a nice contrast with the Intel and AMD
approaches, since Centaur has been able to stay with an in-order pipeline and still achieve good performance. The
Centaur chips use the same P6 system bus and Socket 370 motherboards.

A great cost advantage for C3 is its diminutive size--only 52 sqmm in its .18 micron process. This compares to 120 sqmm
for Athlon and 217 sqmm for P4. Also, the fastest C3 today at 800MHz consumes a very modest 17.4 watts max at 1.9V,
with typical power measured at 10.4 watts. This is much more energy-efficient than Athlon and P4.
Improving the Memory Subsystem to Solve the Key Problems
There are some philosophical differences of opinion on how best to spend the limited transistor budget, especially for
architectures specifically designed for lower cost and power. Intel and AMD are battling for the high-end where the fastest
CPUs command a price premium. They can tolerate the expense of larger die sizes and more thermally-effective packages
and heat sinks. However, when the goal of maximum performance drops to a number 2 or 3 slot behind power and cost,
then different design choices are made.
Up until now, Intel and AMD have made slight modifications to their high-performance architectures to address these other
markets. As the markets bifurcate further, AMD and Intel may introduce parts with microarchitectures that are more
optimized for power and cost.
Centaur Uses Cache Design to Directly Deal with Latency
VIA (Centaur) has made early design choices to target the low-cost markets. Centaur has stressed the value of optimizing
the memory subsystem to solve the key problems of memory latency. If you're constraining your die size to reduce cost,
then many processor designers feel it's often a better trade-off to use those transistors in the memory subsystem.
Centaur's chip architects believe that their large L1 caches (four-way, 64KB each) give them a better performance return
than if they had used the die area (and design time) to more aggressively reschedule instructions in the pipeline. If latency
is the key problem, then clever cache design is a direct way to address it. The figure below shows the block diagram of the
Centaur processor. The Cyrix name has recently been dropped, and this product is marketed as the "VIA C3" (internally
referred to as C5B).
                                                                                                      Centaur C3 Microarchitecture

Decoupling the Pipeline to Reduce Instruction Blockage
Even with a pipeline that processes instructions in-order, it is possible to solve many of the key design problems by
allowing the different pipeline stages to process groups of instructions. At various stages of the pipeline, instructions are
queued up while waiting for resources to come available. Called a "decoupled architecture", an in-order machine like the
Centaur C3 processor will have the same performance as the out-of-order approach we've described, as long as no
instructions block the pipeline. If a block occurs at a later stage of the pipeline, the in-order machine continues to fill queues
earlier in the pipeline while waiting for dependencies to clear. It can then proceed again at full speed to drain the queues.
This is somewhat analogous to the reservation stations in the out-of-order architectures. As Centaur continues to refine
their architecture, they plan to further decouple the pipeline by adding queues to other stages and execution units.
Super-Pipelining an In-Order Microarchitecture
The 12 stages of the C3 pipeline are shown on the right-hand side of the block diagram in figure 13. By now, you're
probably able to easily identify what happens in each stage. Instructions are fetched from the large I-cache and then
pre-decoded (without needing extra pre-decode bits stored in the cache). The decoder works by first translating x86
instructions into an interim x86 format and placing them into a five-deep buffer, at which point enough is known about
branches to enable static prediction.
                                                                                                    Centaur C3 Microarchitecture
From this buffer, the interim instructions are translated into micro-instructions, either directly or from a microcode ROM.
The micro-instructions are queued again before passing through the final decoder where they also receive any data from
registers. From there, the instructions are dispatched to the appropriate execution unit, unless they require access to the
data cache.
Note that this pipeline has the Data Access stages before execution, much different from our computer model. We'll talk
about the implications in a moment. The floating-point units are not designed for the highest performance, since they run at
half the pipeline frequency and are not fully pipelined (a new FP instruction starts every other cycle). After the execution
stage, all instructions proceed through a "Store-Branch" stage before the result registers are updated in the final pipeline
stage. Note that the C3 supports MMX and 3DNow! instructions.
Breaking Our Simple Load/Store Computer Model
During the Store-Branch stage, a couple of interesting things occur. If a branch instruction is incorrectly predicted, the new
target address is sent to the I-cache in this stage. The other operation is to move Store data into a store buffer. Since an
instruction has to pass through this pipeline stage anyway, Centaur was able to directly implement the common Load-ALU
and Load-ALU-Store instructions as single micro-instructions that execute in a single cycle (with data required to be loaded
before the execute stage).
This completely removes the extra Load and Store instructions from the instruction stream (as found in other current x86
processors following internal RISC principles), speeding up execution time for these operations. No other modern x86
processor has this interesting twist to the microarchitecture. It also has the unfortunate side effect of complicating our
original, simple model of a computer pipeline, since this is a register-memory operation.
A Sophisticated Branch Prediction Mechanism
Since the C3 pipeline is fairly deep (P4's pipeline has changed our perspective), good branch prediction becomes quite
important. (That person in the back of the room is going to love this discussion, since Centaur uses every trick and invents
some more.) Centaur takes the interesting approach of directly calculating the target for unconditional branches that use a
displacement value (to an offset address). The designers decided that including a special adder early in the pipeline was
better than relying on a Branch Target Buffer for these instructions (about 95% of all branches). Obviously, directly
calculating the address will always give the correct target address, whereas the BTB may not always contain the target
For conditional branches, Centaur used the G-Share algorithm we described earlier. This uses a 13-bit Global Branch
History that is XOR'd with the branch instruction address (an exclusive-OR of each pair of bits returns a 1 if ONLY one
input bit is a 1). The result indexes into the Branch History Buffer to look up the prediction of the branch. Centaur also uses
the "agrees-mode" enhancement to encode a (single) bit that indicates whether the table look-up agrees with the static
predictor. They also have another 4K-entry table that selects which predictor (simple or history-based) to use for a
particular branch (based on the previous behavior of the branch). Basically, Centaur uses a static predictor and two
different dynamic predictors, as well as a predictor to select which type of dynamic predictor to use. To that person in the
back of the room, if you'd like to know more, check out Centaur's patent filings. A future ExtremeTech article will focus
specifically on branch prediction methods.
Overall conclusions about the Centaur architecture
This microarchitecture has some interesting innovations that are made possible by staying with an in-order pipeline and
focusing on low-cost, single-processor systems. While these microarchitectural features are interesting, our analysis
doesn't draw any conclusions about performance (except to note the half-speed FP unit). The performance will depend on
                                                                                               Centaur C3 Microarchitecture
the type of applications, and a CPU that is optimized for cost should really be viewed at the system level. If cost is a
primary concern, then the entire system needs to be configured with the minimum hardware required to acceptably run the
applications you care about. Stay tuned to ExtremeTech for benchmarks of these budget PCs.
                                                                                                              Overall Conclusions
Overall Conclusions

This ends our journey of the strange world inside modern CPUs. We started from basic concepts and went very rapidly
through a lot of complicated stuff. We hope you didn't have too much trouble digesting it all at one sitting. As we stated at
the very beginning, the details about microarchitecture are only interesting to CPU architects and hard-core PC technology
enthusiasts. As you've learned, the designers have made several trade-offs, and they've been forced to optimize for
certain types of applications. If those applications are important to you, then check out the appropriate benchmarks running
on real systems. In that way, the CPU microarchitecture can be analyzed in the context of the entire PC system.

The Future of PC Microarchitectures
It used to be easy to forecast the sort of microarchitectural features coming to PC processors. All one had to do was look at
high-end RISC chips or large computer systems. Well, most of the high-end design techniques have already made their
way into the PC processor world, and to go forward will require new innovation by the PC CPU vendors.
Teaching an Old Dog New Tricks
One interesting trend is to return to older approaches that were not previously viable for the mainstream. The most
noteworthy example is "Very Long Instruction Word (VLIW)" architectures. This is what is referred to as an "exposed
pipeline" where the compiler must specifically encode separate instructions for each parallel operation in advance of
execution. This is much different than forcing the processor to dynamically schedule instructions while it is running.
The key enabler is that compiler technology has improved dramatically, and a VLIW architecture makes the compiler do
more of the work for discovering and exploiting instruction-level-parallelism. Transmeta has implemented an internal VLIW
architecture for their low-power Crusoe CPUs, counting on their software morphing technology to exploit the parallel
architecture. Intel's new 64-bit "Itanium" architecture uses a version of VLIW, but it has been slow to get to market. It will be
several years before enough interesting desktop applications can be ported to Itanium and make it a mainstream desktop
AMD Plans to Hammer Its Way into the High End of the Market
Instead of counting on new compilers and the willingness of software developers to support a radically-new architecture
(like Itanium), AMD is evolving the x86 instruction set to support full 64-bit processing. With a 64-bit architecture, the
"Hammer" series of processors will be better at working on very large problems that require more addressing space
(servers and workstations). There will also be a performance gain for some applications, but the real focus will be support
for large, multi-processor systems. Eventually, the Hammer family could make its way down into the mainstream desktop.
Still Some Features to Copy From RISC
Some new RISC chips have an interesting and exciting feature that hasn't yet made its way into the PC space. Called
"Simultaneous Multithreading (SMT)", this approach duplicates all the registers and swaps register sets whenever a
"thread" comes to a long-latency operation. A thread is just an independent instruction sequence, whether explicitly
defined in a single program or part of a completely different process. This is how multi-processing works with advanced
operating systems, dispatching threads to different processors. Imagine that future CPUs may take thousands of pipeline
cycles for a main memory load.
In an SMT machine, rather than have a processor sit idle while waiting for data from memory, it could just "context switch"
to a different register set and run code from the different thread. The more sets of registers, the more simultaneous threads
                                                                                                             Overall Conclusions
the CPU could switch between. It is rumored that Intel's new XEON processor based on the P4 core actually has SMT
capability built-in but not yet enabled.
Integration and a Change in Focus
Most of the recent architectural innovation has been directed at performing better on media-oriented tasks. Instead of just
adding instructions for media processing, why not create a media processor that can also handle x86 instructions? A
media processor is a class of CPU that is optimized for processing multiple streams of timing-critical media data.
The shift in focus from "standard" x86 processing will become even more likely as CPUs are more tightly-integrated with
graphics, video, sound and communications subsystems. It's unlikely that vendors would market their products as
x86-compatible media processors, rather than just advanced x86 processors, but the shift in design focus is already
Getting Comfortable with Complexity
In all too short a time, even these forthcoming technologies will seem like simple designs. We'll soon find it humorous that
we thought a GHz processor was a fast chip. We'll eventually consider it quaint that most computers used only a single
processor, since we could be working on machines with hundreds of CPUs on a chip. Someday we might be forced to pore
through complicated descriptions of the physics of optical processing. We can easily imagine down the road that some
people will long for the simple days when our computers could send data with metal traces on the chips or circuit boards.
In closing, if you've made it all the way through this article, you agree with that enthusiastic person in the back of the room.
As PC technology enthusiasts, our hobby will just get better and better. These complex new technologies will open up yet
more worlds for our discovery, and we'll be inspired to explore every new detail.
                                                                                                        List of Reference
List of References

References and Suggestions for Further Reading:

   1. Computer Architecture, a Quantitative Approach, 2nd Edition. Morgan Kaufmann Publishers. Written by Hennessy
       & Patterson. This is a great book and is a collaboration between John Hennessy (the Stanford professor who
       helped create the MIPS architecture) and Dave Patterson (the Berkely professor who helped create the SPARC

   2. Pentium Pro and Pentium II System Architecture, 2nd Edition. Mindshare, Inc. Written by Tom Shanley. This book
       is slightly out of date, but Tom does a great job of exposing extra details that aren't part of Intel's official

   3. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, First Quarter 2001.
       Written by Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, and Patrice
       Roussel of Intel Corporation. This is a surprisingly-detailed look at the Pentium 4 microarchitecture and design

   4. Other Intel links:
           o   ftp://download.intel.com/pentium4/download/netburstdetail.pdf
           o   ftp://download.intel.com/pentium4/download/nextgen.pdf
           o   ftp://download.intel.com/pentium4/download/netburst.pdf

   5. AMD Athlon Processor x86 Code Optimization.
       http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf Appendix A of this document has an excellent
       walk-through of the Athlon microarchitecture.

   6. Other AMD links:
           o   http://www.amd.com/products/cpg/athlon/techdocs/index.html

   7. Other Centaur Links:
           o   http://www.viatech.com
           o   http://www.centtech.com

To top