Document Sample
aca Powered By Docstoc
                        Fundamentals of Computer design
A) Introduction:
        Computer technology has made incredible progress in the roughly 55 years since the
first general-purpose electronic computer was created. This rapid rate of improvement has
come both from advances in the technology used to build computers and from innovation in
computer design. Although technological improvements have been fairly steady, progress
arising from better computer architectures has been much less consistent.
        In about 1970, computer designers became largely dependent upon integrated circuit
technology. During the 1970s, performance continued to improve at about 25% to 30% per
year for the mainframes and minicomputers that dominated the industry. The late 1970s saw
the emergence of the microprocessor. Rate of improvement - roughly 35% growth per year in
performance. They were preferred for the following two reasons: First, the virtual elimination
of assembly language programming reduced the need for object-code compatibility. Second,
the creation of standardized, vendor-independent operating systems, such as UNIX and its
clone, Linux, lowered the cost and risk of bringing out a new architecture.
        These changes made it possible to successfully develop a new set of architectures,
called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-
based machines focused the attention of designers on two critical performance techniques, the
exploitation of instruction-level parallelism (initially through pipelining and later through
multiple instruction issue) and the use of caches (initially in simple forms and later using more
sophisticated organizations and optimizations).
        The effect of this dramatic growth rate has been twofold. First, it has significantly
enhanced the capability available to computer users. Second, this dramatic rate of
improvement has led to the dominance of microprocessor- based computers across the entire
range of the computer design. In the last few years, the tremendous imporvement in
integrated circuit capability has allowed older less-streamlined architectures, such as the x86
(or IA-32) architecture, to adopt many of the innovations first pioneered in the RISC designs.

B) The Changing Face of Computing and the Task of the Computer Designer:
         In the 1960s, the dominant form of computing was on large mainframes, machines
costing millions of dollars and stored in computer rooms with multiple operators overseeing
their support. Typical applications included business data processing and large-scale
scientific computing. The 1970s saw the birth of the minicomputer, a smaller sized machine
initially focused on applications in scientific laboratories, but rapidly branching out as the
technology of timesharing, multiple users sharing a computer interactively through
independent terminals, became widespread.
         The 1980s saw the rise of the desktop computer based on microprocessors, in the
form of both personal computers and workstations. The individually owned desktop computer
replaced timesharing and led to the rise of servers, computers that provided larger-scale
services such as: reliable, long-term file storage and access, larger memory, and more
computing power. The 1990s saw the emergence of the Internet and the world-wide web, the
first successful handheld computing devices (personal digital assistants or PDAs), and the
emergence of high-performance digital consumer electronics, varying from video games to
set-top boxes.
         These changes in computer use have led to three different computing markets each
characterized by different applications, requirements, and computing technologies.
1. Desktop Computing:
        The first, and still the largest market in dollar terms, is desktop computing. Desktop
computing spans from low-end systems that sell for under $1,000 to high end, heavily-
configured workstations that may sell for over $10,000. Throughout this range in price and
capability, the desktop market tends to be driven to optimize price-performance. As a result
desktop systems often are where the newest, highest performance microprocessors appear, as
well as where recently cost-reduced microprocessors and systems appear first.
        Desktop computing also tends to be reasonably well characterized in terms of
applications and benchmarking, though the increasing use of web-centric, interactive
applications poses new challenges in performance evaluation.

2. Servers:
         The emergence of the world-wide web accelerated this trend due to the tremendous
growth in demand for web servers and the growth in sophistication of web-based services.
Such servers have become the backbone of large-scale enterprise computing replacing the
traditional mainframe. For servers, different characteristics are important.
         First, availability is critical. We use the term availability, which means that the
system can reliably and effectively provide a service.
         A second key feature of server systems is an emphasis on scalability. Server systems
often grow over their lifetime in response to a growing demand for the services they support
or an increase in functional requirements. Thus, the ability to scale up the computing
capacity, the memory, the storage, and the I/O bandwidth of a server are crucial.
         Lastly, servers are designed for efficient throughput. That is, the overall performance
of the server–in terms of transactions per minute or web pages served per second–is what is

3. Embedded Computers:
        Embedded computers, the name given to computers lodged in other devices where the
presence of the computer is not immediately obvious, are the fastest growing portion of the
computer market. The range of application of these devices goes from simple embedded
microprocessors that might appear in a everyday machines (most microwaves and washing
machines, most printers, most networking switches, and all cars contain such
microprocessors) to handheld digital devices (such as palmtops, cell phones, and smart cards)
to video games and digital set-top boxes. Although in some applications (such as palmtops)
the computers are programmable, in many embedded applications the only programming
occurs in connection with the initial loading of the application code or a later software
upgrade of that application.
        Like other computing applications, software costs are often a large factor in total cost
of an embedded system. Embedded computers have the widest range of processing power and
cost. Performance requirements do exist, of course, but the primary goal is often meeting the
performance need at a minimum price, rather than achieving higher performance at a higher
price. Two other key characteristics exist in many embedded applications: the need to
minimize memory and the need to minimize power.
        Larger memories also mean more power, and optimizing power is often critical in
embedded applications. Although the emphasis on low power is frequently driven by the use
of batteries, the need to use less expensive packaging (plastic versus ceramic) and the
absence of a fan for cooling also limit total power consumption. In practice, embedded
problems are usually solved by one of three approaches:
        1. Using a combined hardware/software solution that includes some custom hardware
          and typically a standard embedded processor,
       2. Using custom software running on an off-the-shelf embedded processor, or
       3. Using a digital signal processor and custom software.

The Task of a Computer Designer:
        The task the computer designer faces is a complex one: Determine what attributes are
important for a new machine, then design a machine to maximize performance while staying
within cost and power constraints. This task has many aspects, including instruction set
design, functional organization, logic design, and implementation. The implementation may
encompass integrated circuit design, packaging, power, and cooling.
        The implementation of a machine has two components: organization and hardware.
The term organization includes the high-level aspects of a computer’s design, such as the
memory system, the bus structure, and the design of the internal CPU (central processing
unit—where arithmetic, logic, branching, and data transfer are implemented). Hardware is
used to refer to the specifics of a machine, including the detailed logic design and the
packaging technology of the machine. Often a line of machines contains machines with
identical instruction set architectures and nearly identical organizations, but they differ in the
detailed hardware implementation.
        Computer architects must design a computer to meet functional requirements as well
as price, power, and performance goals. Once a set of functional requirements has been
established, the architect must try to optimize the design. In addition to performance and cost,
designers must be aware of important trends in both the implementation technology and the
use of computers.

Summary of some of the most important functional requirements an architect faces:
C) Technology Trends:
       If an instruction set architecture is to be successful, it must be designed to survive
rapid changes in computer technology. To plan for the evolution of a machine, the designer
must be especially aware of rapidly occurring changes in implementation technology. Four
implementation technologies, which change at a dramatic pace, are critical to modern

Integrated circuit logic technology—Transistor density increases by about 35% per year,
quadrupling in somewhat over four years. Increases in die size are less predictable and
slower, ranging from 10% to 20% per year. The combined effect is a growth rate in transistor
count on a chip of about 55% per year.

Semiconductor DRAM (dynamic random-access memory)—Density increases by between
40% and 60% per year, quadrupling in three to four years. Cycle time has improved very
slowly, decreasing by about one-third in 10 years. Bandwidth per chip increases about twice
as fast as latency decreases.

Magnetic disk technology—Recently, disk density has been improving by more than 100%
per year, quadrupling in two years. Prior to 1990, density increased by about 30% per year,
doubling in three years. It appears that disk technology will continue the faster density
growth rate for some time to come. Access time has improved by one-third in 10 years.

Network technology—Network performance depends both on the performance of switches
and on the performance of the transmission system, both latency and bandwidth can be
improved, though recently bandwidth has been the primary focus.
        These rapidly changing technologies impact the design of a microprocessor that may,
with speed and technology enhancements, have a lifetime of five or more years.
Traditionally, cost has decreased very closely to the rate at which density increases. Although
technology improves fairly continuously, the impact of these improvements is sometimes
seen in discrete leaps, as a threshold that allows a new capability is reached.

Scaling of Transistor Performance, Wires, and Power in Integrated Circuits:
        Integrated circuit processes are characterized by the feature size, which is the
minimum size of a transistor or a wire in either the x or y dimension. Feature sizes have
decreased from 10 microns in 1971 to 0.18 microns in 2001. As feature sizes shrink, devices
shrink quadratically in the horizontal dimensions and also shrink in the vertical dimension.
Density improvements have supported the introduction of 64-bit microprocessors as well as
many of the innovations in pipelining and caches.
        Although transistors generally improve in performance with decreased feature size,
wires in an integrated circuit do not. In particular, the signal delay for a wire increases in
proportion to the product of its resistance and capacitance. Of course, as feature size shrinks
wires get shorter, but the resistance and capacitance per unit length gets worse.
        In the past few years, wire delay has become a major design limitation for large
integrated circuits and is often more critical than transistor switching delay. Larger and larger
fractions of the clock cycle have been consumed by the propagation delay of signals on wires.
Power also provides challenges as devices are scaled.
        The energy required per transistor is proportional to the product of the load
capacitance of the transistor, the frequency of switching, and the square of the voltage. As we
move from one process to the next, the increase in the number of transistors switching and
the frequency with which they switch, dominates the decrease in load capacitance and
voltage, leading to an overall growth in power consumption.

D) Cost, Price and their Trends:
        Although there are computer designs where costs tend to be less important—
specifically supercomputers—cost-sensitive designs are of growing importance. Indeed, in
the past 15 years, the use of technology improvements to achieve lower cost, as well as
increased performance, has been a major theme in the computer industry. Understanding of
cost and its factors is essential for designers to be able to make intelligent decisions about
whether or not a new feature should be included in designs where cost is an issue.

The Impact of Time, Volume, Commodification, and Packaging:
        The cost of a manufactured computer component decreases over time even without
major improvements in the basic implementation technology. The underlying principle that
drives costs down is the learning curve—manufacturing costs decrease over time. The
learning curve itself is best measured by change in yield— the percentage of manufactured
devices that survives the testing procedure. Whether it is a chip, a board, or a system, designs
that have twice the yield will have basically half the cost.
        Understanding how the learning curve will improve yield is key to projecting costs
over the life of the product. As an example of the learning curve in action, the price per
megabyte of DRAM drops over the long term by 40% per year. Since DRAMs tend to be
priced in close relationship to cost–with the exception of periods when there is a shortage–
price and cost of DRAM track closely. In fact, there are some periods in which it appears that
price is less than cost; of course, the manufacturers hope that such periods are both infrequent
and short.
        Between the start of a project and the shipping of a product, say two years, the cost of
a new DRAM drops by a factor of between five and ten in constant dollars. Microprocessor
prices also drop over time, but because they are less standardized than DRAMs, the
relationship between price and cost is more complex. In a period of significant competition,
price tends to track cost closely, although microprocessor vendors probably rarely sell at a
        Volume is a second key factor in determining cost. Increasing volumes affect cost in
several ways. First, they decrease the time needed to get down the learning curve, which is
partly proportional to the number of systems (or chips) manufactured. Second, volume
decreases cost, since it increases purchasing and manufacturing efficiency. As a rule of
thumb, some designers have estimated that cost decreases about 10% for each doubling of
        Commodities are products that are sold by multiple vendors in large volumes and are
essentially identical. There are a variety of vendors that ship virtually identical products and
are highly competitive. Of course, this competition decreases the gap between cost and
selling price, but it also decreases cost.

Cost of an Integrated Circuit:
        In an increasingly competitive computer marketplace where standard parts—disks,
DRAMs, and so on—are becoming a significant portion of any system’s cost, integrated
circuit costs are becoming a greater portion of the cost that varies between machines,
especially in the high-volume, cost-sensitive portion of the market.
        Although the costs of integrated circuits have dropped exponentially, the basic
procedure of silicon manufacture is unchanged: A wafer is still tested and chopped into dies
that are packaged Thus the cost of a packaged integrated circuit is:
       To learn how to predict the number of good chips per wafer requires first learning
how many dies fit on a wafer and then learning how to predict the percentage of those that
will work.

        The most interesting feature of this first term of the chip cost equation is its sensitivity
to die size, shown below. The number of dies per wafer is basically the area of the wafer
divided by the area of the die. It can be more accurately estimated by:

       The first term is the ratio of wafer area   to die area. The second compensates for
the “square peg in a round hole” problem—rectangular dies near the periphery of round
wafers. Dividing the circumference ( d) by the diagonal of a square die is approximately the
number of dies along the edge.

EXAMPLE: Find the number of dies per 30-cm wafer for a die that is 0.7 cm on a side.

ANSWER: The total die area is 0.49 cm2. Thus

       But this only gives the maximum number of dies per wafer. The critical question is,
What is the fraction or percentage of good dies on a wafer number, or the die yield?

where wafer yield accounts for wafers that are completely bad and so need not be tested.
Defects per unit area is a measure of the random manufacturing defects that occur. For
today’s multilevel metal CMOS processes, a good estimate is

EXAMPLE: Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side,
assuming a defect density of 0.6 per cm2.

ANSWER: The total die areas are 1 cm2 and 0.49 cm2. For the larger die the yield is

        The bottom line is the number of good dies per wafer, which comes from multiplying
dies per wafer by die yield (which incorporates the effects of defects).
        The examples above predict 224 good 1-cm2 dies from the 30-cm wafer and 781 good
0.49-cm2 dies. Given the tremendous price pressures on commodity products such as DRAM
and SRAM, designers have included redundancy as a way to raise yield. Obviously, the
presence of redundant entries can be used to significantly boost the yield.

Cost Versus Price—Why They Differ and By How Much:
        The relationship between price and volume can increase the impact of changes in cost,
especially at the low end of the market Furthermore, as volume decreases, costs rise, leading
to further increases in price. The categories that make up price can be shown either as a tax
on cost or as a percentage of the price.
        Direct costs refer to the costs directly related to making a product. These include
labor costs, purchasing components, scrap (the leftover from yield), and warranty, which
covers the costs of systems that fail at the customer’s site during the warranty period. Direct
cost typically adds 10% to 30% to component cost.
        The next addition is called the gross margin, the company’s overhead that cannot be
billed directly to one product. This can be thought of as indirect cost. It includes the
company’s research and development (R&D), marketing, sales, manufacturing equipment
maintenance, building rental, cost of financing, pretax profits, and taxes.
        When the component costs are added to the direct cost and gross margin, we reach the
average selling price—ASP in the language of MBAs—the money that comes directly to the
company for each product sold. The gross margin is typically 10% to 45% of the average
selling price, depending on the uniqueness of the product.
        Manufacturers of low-end PCs have lower gross margins for several reasons. First,
their R&D expenses are lower. Second, their cost of sales is lower, since they use indirect
distribution (by mail, the Internet, phone order, or retail store) rather than salespeople. Third,
because their products are less unique, competition is more intense, thus forcing lower prices
and often lower profits, which in turn lead to a lower gross margin.
        List price and average selling price are not the same. One reason for this is that
companies offer volume discounts, lowering the average selling price. The information above
suggests that a company uniformly applies fixed overhead percentages to turn cost into price,
and this is true for many companies.
        Large, expensive machines generally cost more to develop. Since large, expensive
machines generally do not sell as well as small ones, the gross margin must be greater on the
big machines for the company to maintain a profitable return on its investment. The issue of
cost and cost/performance is a complex one. There is no single target for computer designers.
At one extreme, high-performance design spares no cost in achieving its goal.
        At the other extreme is low-cost design, where performance is sacrificed to achieve
lowest cost. Between these extremes is cost/performance design, where the designer
balances cost versus performance.

                                   The Components of Price
E) Measuring and Reporting Performance:
        The user of a desktop machine may say a computer is faster when a program runs in
less time, while the computer center manager running a large server system may say a
computer is faster when it completes more jobs in an hour. The computer user is interested in
reducing response time—the time between the start and the completion of an event—also
referred to as execution time. The manager of a large data processing center may be
interested in increasing throughput—the total amount of work done in a given time.
        “X is n times faster than Y” will mean

Since execution time is the reciprocal of performance, the following relationship holds:

       Whether we are interested in throughput or response time, the key measurement is
time: The computer that performs the same amount of work in the least time is the fastest.

Measuring Performance:
        Even execution time can be defined in different ways depending on what we count.
The most straightforward definition of time is called wall-clock time, response time, or
elapsed time, which is the latency to complete a task, including disk accesses, memory
accesses, input/output activities, operating system overhead—everything. With
multiprogramming the CPU works on another program while waiting for I/O and may not
necessarily minimize the elapsed time of one program. Hence we need a term to take this
activity into account. CPU time recognizes this distinction and means the time the CPU is
computing, not including the time waiting for I/O or running other programs. (Clearly the
response time seen by the user is the elapsed time of the program, not the CPU time.) CPU
time can be further divided into the CPU time spent in the program, called user CPU time,
and the CPU time spent in the operating system performing tasks requested by the program,
called system CPU time.
        The term system performance is used to refer to elapsed time on an unloaded
system, while CPU performance refers to user CPU time on an unloaded system. To
evaluate a new system the user would simply compare the execution time of her workload—
the mixture of programs and operating system commands that users run on a machine. Most
must rely on other methods to evaluate machines and often other evaluators, hoping that these
methods will predict performance for their usage of the new machine. There are five levels of
programs used in such circumstances, listed below in decreasing order of accuracy of

1. Real applications—Examples are compilers for C, text-processing software like Word,
and other applications like Photoshop. Real applications have input, output, and options
that a user can select when running the program. There is one major downside to using real
applications as benchmarks: Real applications often encounter portability problems arising
from dependences on the operating system or compiler.

2. Modified (or scripted) applications—In many cases, real applications are used as the
building block for a benchmark either with modifications to the application or with a script
that acts as stimulus to the application. Applications are modified for two primary reasons: to
enhance portability or to focus on one particular aspect of system performance.

3. Kernels—Several attempts have been made to extract small, key pieces from real
programs and use them to evaluate performance. Livermore Loops and Linpack are the best
known examples. Unlike real programs, no user would run kernel programs, for they exist
solely to evaluate performance.

4. Toy benchmarks—Toy benchmarks are typically between 10 and 100 lines of code and
produce a result the user already knows before running the toy program. Programs like Sieve
of Eratosthenes, Puzzle, and Quicksort are popular.

5. Synthetic benchmarks—Similar in philosophy to kernels, synthetic benchmarks try to
match the average frequency of operations and operands of a large set of programs.
Whetstone and Dhrystone are the most popular synthetic benchmarks. Synthetic benchmarks
are, in fact, even further removed from reality than kernels because kernel code is extracted
from real programs, while synthetic code is created artificially to match an average execution

Benchmark Suites:
       Recently, it has become popular to put together collections of benchmarks to try to
measure the performance of processors with a variety of applications. Nonetheless, a key
advantage of such suites is that the weakness of any one benchmark is lessened by the
presence of the other benchmarks. One of the most successful attempts to create standardized
benchmark application suites has been the SPEC (Standard Performance Evaluation

1. Desktop Benchmarks:
        Desktop benchmarks divide into two broad classes: CPU intensive benchmarks and
graphics intensive benchmarks. SPEC originally created a benchmark set focusing on CPU
performance (initially called SPEC89), which has evolved into its fourth generation: SPEC
CPU2000, which follows SPEC95, and SPEC92. SPEC CPU2000, consists of a set of
eleven integer benchmarks (CINT2000) and fourteen floating point benchmarks (CFP2000).
The SPEC benchmarks are real program modified for portability and to minimize the role of
I/O in overall benchmark performance. The integer benchmarks vary from part of a C
compiler to a VLSI place and route tool to a graphics application. The floating point
benchmarks include code for quantum chromodynmics, finite element modeling, and fluid
dynamics. The SPEC CPU suite is useful for CPU benchmarking for both desktop systems
and single-processor servers.
        Although SPEC CPU2000 is aimed at CPU performance, two different types of
graphics benchmarks were created by SPEC: SPECviewperf is used for benchmarking
systems supporting the OpenGL graphics library, while SPECapc consists of applications
that make extensive use of graphics. SPECviewperf measures the 3D rendering performance
of systems running under OpenGL using a 3-D model and a series of OpenGL calls that
transform the model. SPECapc consists of runs of three large applications:
1. Pro/Engineer: a solid modeling application that does extensive 3-D rendering.
2. SolidWorks 99: a 3-D CAD/CAM design tool running a series of five tests varying from
I/O intensive to CPU intensive.
3. Unigraphics V15: The benchmark is based on an aircraft model and covers a wide
spectrum of Unigraphics functionality, including assembly, drafting, numeric control
machining, solid modeling, and optimization.

2. Server Benchmarks:
        Just as servers have multiple functions, so there are multiple types of benchmarks.
The simplest benchmark is perhaps a CPU throughput oriented benchmark. SPEC
CPU2000 uses the SPEC CPU benchmarks to construct a simple throughput benchmark
where the processing rate of a multiprocessor can be measured by running multiple copies
(usually as many as there are CPUs) of each SPEC CPU benchmark and converting the CPU
time into a rate. This leads to a measurement called the SPECRate.
        Other than SPECRate, most server applications and benchmarks have significant I/O
activity arising from either disk or network traffic, including benchmarks for file server
systems, for web servers, and for database and transaction processing systems. SPEC offers
both a file server benchmark (SPECSFS) and a web server benchmark (SPECWeb).
SPECSFS is a benchmark for measuring NFS (Network File System) performance using a
script of file server requests; it tests the performance of the I/O system (both disk and
network I/O) as well as the CPU. SPECSFS is a throughput oriented benchmark but with
important response time requirements. SPECWEB is a web-server benchmark that simulates
multiple clients requesting both static and dynamic pages from a server, as well as clients
posting data to the server.
        Transaction processing benchmarks measure the ability of a system to handle
transactions, which consist of database accesses and updates. The first TPC benchmark,
TPC-A, was published in 1985 and has since been replaced and enhanced by four different
benchmarks. TPC-C, initially created in 1992, simulates a complex query environment.
TPC-H models ad-hoc decision support meaning that the queries are unrelated and
knowledge of past queries cannot be used to optimize future queries; the result is that query
execution times can be very long. TPC-R simulates a business decision support system
where users run a standard set of queries. In TPC-R, pre-knowledge of the queries is taken for
granted and the DBMS system can be optimized to run these queries. TPC-W web-based
transaction benchmark that simulates the activities of a business oriented transactional web
server. All the TPC benchmarks measure performance in transactions per second.

3. Embedded Benchmarks:
        The enormous variety in embedded applications, as well as differences in
performance requirements (hard real-time, soft real-time, and overall cost-performance),
make the use of a single set of benchmarks unrealistic. For those embedded applications that
can be characterized well by kernel performance, the best standardized set of benchmarks
appears to be a new benchmark set: the EDN Embedded Microprocessor Benchmark
Consortium (or EEMBC–pronounced embassy). The EEMBC benchmarks fall into five
classes: automotive/industrial, consumer, networking, office automation, and
                              The EEMBC benchmark suite

Reporting Performance Results:
        The guiding principle of reporting performance measurements should be
reproducibility— list everything another experimenter would need to duplicate the results. A
SPEC benchmark report requires a fairly complete description of the machine, the compiler
flags, as well as the publication of both the baseline and optimized results.
        A system’s software configuration can significantly affect the performance results for
a benchmark. For this reason, these benchmarks are sometimes run in single-user mode to
reduce overhead. Additionally, operating system enhancements are sometimes made to
increase performance on the TPC benchmarks. Likewise, compiler technology can play a big
role in CPU performance.
        Another way to customize the software to improve the performance of a benchmark
has been through the use of benchmark-specific flags; these flags often caused
transformations that would be illegal on many programs or would slow down performance on
others. To restrict this process and increase the significance of the SPEC results, the SPEC
organization created a baseline performance measurement in addition to the optimized
performance measurement. In addition to the question of flags and optimization, another key
question is whether source code modifications or hand-generated assembly language are
There are four broad categories of apporoaches here:

1. No source code modifications are allowed. The SPEC benchmarks fall into this class, as
do most of the standard PC benchmarks.
2. Source code modification are allowed, but are essentially difficult or impossible.
Benchmarks like TPC-C rely on standard databases, such as Oracle or Microsoft’s SQL
3. Source modifications are allowed. Several supercomputer benchmark suites allow
modification of the source code. EEMBC also allows source-level changes to its benchmarks
and reports these as “optimized” measurements, versus “out-of-the-box” measurements that
allow no changes.
4. Hand-coding is allowed. EEMBC allows assembly language coding of its benchmarks.
The small size of its kernels makes this approach attractive, although in practice with larger
embedded applications it is unlikely to be used, except for small loops.

Comparing and Summarizing Performance:
        Comparing performance of computers is rarely a dull event, especially when the
designers are involved. For example, two articles on summarizing performance in the same
journal took opposing points of view. Figure 1.15, taken from one of the articles, is an
example of the confusion that can arise.
Using our definition of faster than, the following statements hold:
        A is 10 times faster than B for program P1.
       B is 10 times faster than A for program P2.
       A is 20 times faster than C for program P1.
       C is 50 times faster than A for program P2.
       B is 2 times faster than C for program P1.
       C is 5 times faster than B for program P2.

               Execution times of two programs on three machines.

Total Execution Time: A Consistent Summary Measure
        The simplest approach to summarizing relative performance is to use total execution
time of the two programs. Thus
        B is 9.1 times faster than A for programs P1 and P2.
        C is 25 times faster than A for programs P1 and P2.
        C is 2.75 times faster than B for programs P1 and P2.
If the workload consisted of running programs P1 and P2 an equal number of times, the
statements above would predict the relative execution times for the workload on each
machine. An average of the execution times that tracks total execution time is the arithmetic

where Timei is the execution time for the ith program of a total of n in the workload.

Weighted Execution Time:
        Are programs P1 and P2 in fact run equally in the workload as assumed by the
arithmetic mean? If not, then there are two approaches that have been tried for summarizing
performance. The first approach when given an unequal mix of programs in the workload is
to assign a weighting factor wi to each program to indicate the relative frequency of the
program in that workload. If, for example, 20% of the tasks in the workload were program P1
and 80% of the tasks in the workload were program P2, then the weighting factors would be
0.2 and 0.8. (Weighting factors add up to 1.) By summing the products of weighting factors
and execution times, a clear picture of performance of the workload is obtained. This is called
the weighted arithmetic mean:

Normalized Execution Time and the Pros and Cons of Geometric Means:
       A second approach to unequal mixture of programs in the workload is to normalize
execution times to a reference machine and then take the average of the normalized execution
times. Average normalized execution time can be expressed as either an arithmetic or
geometric mean. The formula for the geometric mean is
where Execution time ratioi is the execution time, normalized to the reference machine, for
the ith program of a total of n in the workload. Because the weightings in weighted arithmetic
means are set proportionate to execution times on a given machine, they are influenced not
only by frequency of use in the workload, but also by the peculiarities of a particular machine
and the size of program input. The geometric mean of normalized execution times, on the
other hand, is independent of the running times of the individual programs, and it doesn’t
matter which machine is used to normalize.
        The strong drawback to geometric means of normalized execution times is that they
violate our fundamental principle of performance measurement—they do not predict
execution time. An additional drawback of using geometric mean as a method for
summarizing performance for a benchmark suite (as SPEC CPU2000 does) is that it
encourages hardware and software designers to focus their attention on the benchmarks
where performance is easiest to improve rather than on the benchmarks that are slowest. The
ideal solution is to measure a real workload and weight the programs according to their
frequency of execution.

F) Quantitative Principles of Computer Design:

 Make the Common Case Fast:
        Improving the frequent event, rather than the rare event, will obviously help
performance, too. In addition, the frequent case is often simpler and can be done faster than
the infrequent case. For example, when adding two numbers in the CPU, we can expect
overflow to be a rare circumstance and can therefore improve performance by optimizing the
more common case of no overflow.

Amdahl’s Law:
       Amdahl’s Law states that the performance improvement to be gained from using
some faster mode of execution is limited by the fraction of the time the faster mode can be
used. Amdahl’s Law defines the speedup that can be gained by using a particular feature.
Speedup is the ratio


       Speedup tells us how much faster a task will run using the machine with the
enhancement as opposed to the original machine. Amdahl’s Law gives us a quick way to find
the speedup from some enhancement, which depends on two factors:

1. The fraction of the computation time in the original machine that can be converted to
take advantage of the enhancement—For example, if 20 seconds of the execution time of a
program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This
value, which we will call Fractionenhanced, is always less than or equal to 1.

2. The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program—
This value is the time of the original mode over the time of the enhanced mode: If the
enhanced mode takes 2 seconds for some portion of the program that can completely use the
mode, while the original mode took 5 seconds for the same portion, the improvement is 5/2.
We will call this value, which is always greater than 1, Speedupenhanced.
        The execution time using the original machine with the enhanced mode will be the
time spent using the unenhanced portion of the machine plus the time spent using the

EXAMPLE: Suppose that we are considering an enhancement to the processor of a server
system used for web serving. The new CPU is 10 times faster on computation in the web
serving application than the original processor. Assuming that the original CPU is busy with
computation 40% of the time and is waiting for I/O 60% of the time, what is the overall
speedup gained by incorporating the enhancement?

Amdahl’s Law expresses the law of diminishing returns: The incremental improvement in
speedup gained by an additional improvement in the performance of just a portion of the
computation diminishes as improvements are added. A common mistake in applying
Amdahl’s Law is to confuse “fraction of time converted to use an enhancement” and
“fraction of time after enhancement is in use.”
        Amdahl’s Law can serve as a guide to how much an enhancement will improve
performance and how to distribute resources to improve cost/performance. The goal, clearly,
is to spend resources proportional to where time is spent. Amdahl’s Law is particularly useful
for comparing the overall system performance of two alternatives, but it can also be applied
to compare two CPU design alternatives.

The CPU Performance Equation:
         Essentially all computers are constructed using a clock running at a constant rate.
These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock
cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by
its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways:
         In addition to the number of clock cycles needed to execute a program, we can also
count the number of instructions executed—the instruction path length or instruction count
(IC). If we know the number of clock cycles and the instruction count we can calculate the
average number of clock cycles per instruction (CPI). Designers sometimes also use
Instructions per Clock or IPC, which is the inverse of CPI. CPI is computed as:

Expanding the first formula into the units of measurement and inverting the clock rate shows
how the pieces fit together:

         As this formula demonstrates, CPU performance is dependent upon three
characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count.
Furthermore, CPU time is equally dependent on these three characteristics: A 10%
improvement in any one of them leads to a 10% improvement in CPU time. Unfortunately, it
is difficult to change one parameter in complete isolation from others because the basic
technologies involved in changing each characteristic are interdependent. Sometimes it is
useful in designing the CPU to calculate the number of total CPU clock cycles as

where ICi represents number of times instruction i is executed in a program and CPIi
represents the average number of instructions per clock for instruction i. This form can be
used to express CPU time as

EXAMPLE: Suppose we have made the following measurements:
         Frequency of FP operations (other than FPSQR) = 25%
         Average CPI of FP operations = 4.0
         Average CPI of other instructions = 1.33
              Frequency of FPSQR= 2%
              CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease
the average CPI of all FP operations to 2.5. Compare these two design alternatives using the
CPU performance equation.

ANSWER: First, observe that only the CPI changes; the clock rate and instruction count
remain identical. We start by finding the original CPI with neither enhancement:

We can compute the CPI for the enhanced FPSQR by subtracting the
cycles saved from the original CPI:

Measuring and Modeling the Components of the CPU Performance Equation:
        To use the CPU performance equation as a design tool, we need to be able to measure
the various factors. For an existing processor, it is easy to obtain the execution time by
measurement, and the clock speed is known. The challenge lies in discovering the instruction
count or the CPI. Most newer processors include counters for both instructions executed and
for clock cycles.
        There are three general classes of simulation techniques that are used. In general, the
more sophisticated techniques yield more accuracy, particularly for more recent architectures,
at the cost of longer execution time The first and simplest technique, and hence the least
costly, is profile-based, static modeling. In this technique a dynamic execution profile of the
program, which indicates how often each instruction is executed, is obtained by one of three
1. By using hardware counters on the processor,which are periodically saved. This technique
often gives an approximate profile, but one that is within a few percent of exact.
2. By using instrumented execution, in which instrumentation code is compiled into the
program. This code is used to increment counters, yielding an exact
profile. (This technique can also be used to create a trace of memory address that are
accessed, which is useful for other simulation techniques.)
3. By interpreting the program at the instruction set level, compiling instruction counts in the
        Once the profile is obtained, it is used to analyze the program in a static fashion by
looking at the code. Obviously with the profile, the total instruction count is easy to obtain. It
is also easy to get a detailed dynamic instruction mix telling what types of instructions were
executed with what frequency. Finally, for simple processors, it is possible to compute an
approximation to the CPI. It is a reasonable and very fast technique for modeling the
performance of short, integer pipelines, ignoring the memory system behavior.
        Trace-driven simulation is a more sophisticated technique for modeling performance
and is particularly useful for modeling memory system performance. In trace-driven
simulation, a trace of the memory references executed is created, usually either by simulation
or by instrumented execution. The trace includes what instructions were executed (given by
the instruction address), as well as the data addresses accessed.
        The third technique, which is the most accurate and most costly, is executiondriven
simulation. In execution-driven simulation a detailed simulation of the memory system and
the processor pipeline are done simultaneously.

Locality of Reference:
        The most important program property that we regularly exploit is locality of
reference: Programs tend to reuse data and instructions they have used recently. A widely
held rule of thumb is that a program spends 90% of its execution time in only 10% of the
code. An implication of locality is that we can predict with reasonable accuracy what
instructions and data a program will use in the near future based on its accesses in the recent
        Locality of reference also applies to data accesses, though not as strongly as to code
accesses. Two different types of locality have been observed. Temporal locality states that
recently accessed items are likely to be accessed in the near future. Spatial locality says that
items whose addresses are near one another tend to be referenced close together in time.

Take Advantage of Parallelism:
        Taking advantage of parallelism is one of the most important methods for improving
performance. Our first example is the use of parallelism at the system level. To improve the
throughput performance on a typical server benchmark, such as SPECWeb or TPC, multiple
processors and multiple disks can be used. The workload of handling requests can then be
spread among the CPUs or disks resulting in improved throughput. This is the reason that
scalability is viewed as a valuable asset for server applications.
        At the level of an individual processor, taking advantage of parallelism among
instructions is critical to achieving high performance. One of the simplest ways to do this is
through pipelining.
        Parallelism can also be exploited at the level of detailed digital design. For example,
set associative caches use multiple banks of memory that are typical searched in parallel to
find a desired item.
                    Instruction Set Principles and Examples

A) Classifying Instruction Set Architectures:
       The type of internal storage in a processor is the most basic differentiation in
determining the instruction set architecture. The major choices are a stack, an accumulator,
or a set of registers. Operands may be named explicitly or implicitly: The operands in
stack architecture are implicitly on the top of the stack, and in an accumulator
architecture one operand is implicitly the accumulator. The general-purpose register
architectures have only explicit operands—either registers or memory locations. Figure 2.1
shows a block diagram of such architectures and Figure 2.2 shows how the code sequence C
= A + B would typically appear in these three classes of instruction sets. The explicit
operands may be accessed directly from memory or may need to be first loaded into
temporary storage, depending on the class of architecture and choice of specific instruction.
                                          Figure 2.1.

        The arrows indicate whether the operand is an input or the result of the ALU
operation, or both an input and result. Lighter shades indicate inputs and the dark shade
indicates the result. In (a), a Top Of Stack register (TOS), points to the top input operand,
which is combined with the operand below. The first operand is removed from the stack, the
result takes the place of the second operand, and TOS is updated to point to the result. All
operands are implicit. In (b), the Accumulator is both an implicit input operand and a result.
In (c) one input operand is a register, one is in memory, and the result goes to a register. All
operands are registers in (d), and, like the stack architecture, can be transferred to memory
only via separate instructions: push or pop for (a) and load or store for (d).
        There are really two classes of register computers. One class can access memory as
part of any instruction, called register-memory architecture, and the other can access
memory only with load and store instructions, called load-store or register-register
architecture. A third class, not found in computers shipping today, keeps all operands in
memory and is called memory-memory architecture. Some instruction set architectures
have more registers than a single accumulator, but place restrictions on uses of these special
registers. Such architecture is sometimes called an extended accumulator or special
purpose register computer.


        Although most early computers used stack or accumulator-style architectures,
virtually every new architecture designed after 1980 uses load-store register architecture. The
major reasons for the emergence of general-purpose register (GPR) computers are twofold.
First, registers—like other forms of storage internal, to the processor—are faster than
memory. Second, registers are more efficient for a compiler to use than other forms of
internal storage. For example, on a register computer the expression (A*B)–(B*C)–(A*D)
may be evaluated by doing the multiplications in any order, which may be more efficient
because of the location of the operands or because of pipelining concerns. Nevertheless, on a
stack computer the hardware must evaluate the expression in only one order, since operands
are hidden on the stack, and it may have to load an operand multiple times.
        More importantly, registers can be used to hold variables. When variables are
allocated to registers, the memory traffic reduces, the program speeds up (since registers are
faster than memory), and the code density improves (since a register can be named with
fewer bits than can a memory location).
        How many registers are sufficient? The answer, of course, depends on the
effectiveness of the compiler. Most compilers reserve some registers for expression
evaluation, use some for parameter passing, and allow the remainder to be allocated to hold
        Two major instruction set characteristics divide GPR architectures. Both
characteristics concern the nature of operands for a typical arithmetic or logical instruction
(ALU instruction). The first concerns whether an ALU instruction has two or three
operands. In the three-operand format, the instruction contains one result operand and two
source operands. In the two-operand format, one of the operands is both a source and a result
for the operation. The second distinction among GPR architectures concerns how many of
the operands may be memory addresses in ALU instructions. The number of memory
operands supported by a typical ALU instruction may vary from none to three.

Typical combinations of memory operands and total operands per typical ALU instruction with examples
                                          of computers.
 Advantages and disadvantages of the three most common types of general-purpose register computers.

B) Memory Addressing:
       Independent of whether the architecture is register-register or allows any operand to
be a memory reference, it must define how memory addresses are interpreted and how they
are specified.

Interpreting Memory Addresses:
        How is a memory address interpreted? That is, what object is accessed as a
function of the address and the length? All the instruction sets discussed in this book––except
some DSPs––are byte addressed and provide access for bytes (8 bits), half words (16 bits),
and words (32 bits). Most of the computers also provide access for double words (64 bits).
        There are two different conventions for ordering the bytes within a larger object.
Little Endian byte order puts the byte whose address is “x...x000” at the least-significant
position in the double word (the little end). The bytes are numbered:

        Big Endian byte order puts the byte whose address is “x...x000” at the most-
significant position in the double word (the big end). The bytes are numbered:

        When operating within one computer, the byte order is often unnoticeable—only
programs that access the same locations as both, say, words and bytes can notice the
difference. Byte order is a problem when exchanging data among computers with different
orderings, however. Little Endian ordering also fails to match normal ordering of words when
strings are compared. Strings appear “SDRAWKCAB” (backwards) in the registers. A
second memory issue is that in many computers, accesses to objects larger than a byte must
be aligned. Even if data are aligned, supporting byte, half-word, and word accesses requires
an alignment network to align bytes, half words, and words in 64-bit registers.

Addressing Modes:
.       Given an address, we now know what bytes to access in memory. When a memory
location is used, the actual memory address specified by the addressing mode is called the
effective address. We have kept addressing modes that depend on the program counter,
called PC-relative addressing, separate. PC-relative addressing is used primarily for
specifying code addresses in control transfer instructions.
        Figure 2.6 shows the most common names for the addressing modes, though the
names differ among architectures. The left arrow (->) is used for assignment. We also use the
array Mem as the name for main memory and the array Regs for registers. Thus, Mem
[Regs [R1]] refers to the contents of the memory location whose address is given by the
contents of register 1 (R1).
        Addressing modes have the ability to significantly reduce instruction counts; they also
add to the complexity of building a computer and may increase the average CPI (clock cycles
per instruction) of computers that implement those modes. Immediate and displacement
addressing dominate addressing mode usage.
                                         Figure 2.6

Displacement Addressing Mode:
       The major question that arises for a displacement-style addressing mode is that of the
range of displacements used. Based on the use of various displacement sizes, a decision of
what sizes to support can be made. Choosing the displacement field sizes is important
because they directly affect the instruction length.
Immediate or Literal Addressing Mode:
       Immediates can be used in arithmetic operations, in comparisons (primarily for
branches), and in moves where a constant is wanted in a register. The last case occurs for
constants written in the code–which tends to be small–and for address constants, which tend
to be large. Another important instruction set measurement is the range of values for
immediates. Like displacement values, the size of immediate values affects instruction length.
Small immediate values are most heavily used. Large immediates are sometimes used,
however, most likely in addressing calculations.

C) Addressing Modes for Signal Processing:
        Since DSPs deal with infinite, continuous streams of data, they routinely rely on
circular buffers. Hence, as data is added to the buffer, a pointer is checked to see if it is
pointing at the end of the buffer. If not, it increments the pointer to the next address; if it is,
the pointer is set instead to the start of the buffer. Similar issues arise when emptying a
        Every recent DSP has a modulo or circular addressing mode to handle this case
automatically, our first novel DSP addressing mode. It keeps a start register and an end
register with every address register, allowing the autoincrement and autodecrement
addressing modes to reset when the reach the end of the buffer.
        Even though DSPs are tightly targeted to a small number of algorithms, its surprising
this next addressing mode is included for just one application: Fast Fourier Transform (FFT).
FFTs start or end their processing with data shuffled in a particular order. For eight data items
in a radix-2 FFT, the transformation is listed below, with addresses in parentheses shown in

        Without special support such address transformation would take an extra memory
access to get the new address, or involve a fair amount of logical instructions to transform the
address. The DSP solution is based on the observation that the resulting binary address is
simply the reverse of the initial address! For example, address 1002 (4) becomes 0012(1).
Hence, many DSPs have this second novel addressing mode–– bit reverse addressing––
whereby the hardware reverses the lower bits of the address, with the number of bits reversed
depending on the step of the FFT algorithm.
        As DSP programmers migrate towards larger programs and hence become more
attracted to compilers, they have been trying to use the compiler technology developed for
the desktop and embedded computers. Such compilers have no hope of taking high-level
language code and producing these two addressing modes, so they are limited to assembly
language programmer.
        First, because of their popularity, we would expect a new architecture to support at
least the following addressing modes: displacement, immediate, and register indirect.
Desktop and server processors rely on compilers and so addressing modes must match the
ability of the compilers to use them, while historically DSPs rely on hand-coded libraries to
exercise novel addressing modes.

D) Type and Size of Operands:
        How is the type of an operand designated? Normally, encoding in the opcode
designates the type of an operand—this is the method used most often. Alternatively, the data
can be annotated with tags that are interpreted by the hardware. These tags specify the type of
the operand, and the operation is chosen accordingly.
        Usually the type of an operand— integer, single-precision floating point, character,
and so on—effectively gives its size. Common operand types include character (8 bits), half
word (16 bits), word (32 bits), single-precision floating point (also 1 word), and double
precision floating point (2 words). Integers are almost universally represented as two’s
complement binary numbers. Characters are usually in ASCII, but the 16- bit Unicode (used
in Java) is gaining popularity with the internationalization of computers. Until the early
1980s, most computer manufacturers chose their own floating-point representation. Almost
all computers since that time follow the same standard for floating point, the IEEE standard
        Some architectures provide operations on character strings, although such operations
are usually quite limited and treat each byte in the string as a single character. Typical
operations supported on character strings are comparisons and moves.
        For business applications, some architectures support a decimal format, usually called
packed decimal or binary-coded decimal; —4 bits are used to encode the values 0–9, and 2
decimal digits are packed into each byte. Numeric character strings are sometimes called
unpacked decimal, and operations—called packing and unpacking—are usually provided
for converting back and forth between them.
        One reason to use decimal operands is to get results that exactly match decimal
numbers, as some decimal fractions do not have an exact representation in binary. Our SPEC
benchmarks use byte or character, half word (short integer), word (integer), double word
(long integer) and floating-point data types.

E) Operations in the Instruction Set:
        The operators supported by most instruction set architectures can be categorized as in
Figure 2.15. One rule of thumb across all architectures is that the most widely executed
instructions are the simple operations of an instruction set. For example Figure 2.16 shows
10 simple instructions that account for 96% of instructions executed for a collection of
integer programs running on the popular Intel 80x86.
      Figure-2.15: Categories of instruction operators and examples of each.
        All computers generally provide a full set of operations for the first three categories.
The support for system functions in the instruction set varies widely among architectures, but
all computers must have some instruction support for basic system functions. The amount of
support in the instruction set for the last four categories may vary from none to an extensive
set of special instructions. Floating-point instructions will be provided in any computer that is
intended for use in an application that makes much use of floating point. These instructions
are sometimes part of an optional instruction set. Decimal and string instructions are
sometimes primitives, as in the VAX or the IBM 360, or may be synthesized by the compiler
from simpler instructions. Graphics instructions typically operate on many smaller data items
in parallel; for example, performing eight 8-bit additions on two 64-bit operands.

                 FIGURE 2.16: The top 10 instructions for the 80x86.

F) Instructions for Control Flow:
       There is no consistent terminology for instructions that change the flow of control. In
the 1950s they were typically called transfers. Beginning in 1960 the name branch began to
be used. Later, computers introduced additional names. Throughout this book we will use
jump when the change in control is unconditional and branch when the change is
We can distinguish four different types of control-flow change:
       1. Conditional branches
       2. Jumps
       3. Procedure calls
       4. Procedure returns

Addressing Modes for Control Flow Instructions:
        The destination address of a control flow instruction must always be specified. This
destination is specified explicitly in the instruction in the vast majority of cases—procedure
return being the major exception—since for return the target is not known at compile time.
The most common way to specify the destination is to supply a displacement that is added to
the program counter, or PC. Control flow instructions of this sort are called PC-relative.
PC-relative branches or jumps are advantageous because the target is often near the current
instruction, and specifying the position relative to the current PC requires fewer bits. Using
PC-relative addressing also permits the code to run independently of where it is loaded. This
property, called position independence, can eliminate some work when the program is
linked and is also useful in programs linked dynamically during execution.
        To implement returns and indirect jumps when the target is not known at compile
time, a method other than PC-relative addressing is required. Here, there must be a way to
specify the target dynamically, so that it can change at runtime. These register indirect jumps
are also useful for four other important features:
1. Case or switch statements found in most programming languages (which select among
one of several alternatives);
2. Virtual functions or methods in object-oriented languages like C++ or Java (which allow
different routines to be called depending on the type of the argument);
3. High order functions or function pointers in languages like C or C++ (which allows
functions to be passed as arguments giving some of the flavor of object oriented
programming), and
4. Dynamically shared libraries (which allow a library to be loaded and linked at runtime
only when it is actually invoked by the program rather than loaded and linked statically
before the program is run).
        In all four cases the target address is not known at compile time, and hence is usually
loaded from memory into a register before the register indirect jump.

Conditional Branch Options:
       Since most changes in control flow are branches, deciding how to specify the branch
condition is important. One of the most noticeable properties of branches is that a large
number of the comparisons are simple tests, and a large number are comparisons with zero.
Thus, some architectures choose to treat these comparisons as special cases, especially if a
compare and branch instruction is being used. DSPs add another looping structure, usually
called a repeat instruction. It allows a single instruction or a block of instructions to be
repeated up to, say, 256 times. Figure 2.21 shows the three primary techniques in use today
and their advantages and disadvantages.

Procedure Invocation Options:
        Procedure calls and returns include control transfer and possibly some state saving; at
a minimum the return address must be saved somewhere, sometimes in a special link register
or just a GPR. Some older architectures provide a mechanism to save many registers, while
newer architectures require the compiler to generate stores and loads for each register saved
and restored.
        There are two basic conventions in use to save registers: either at the call site or inside
the procedure being called. Caller saving means that the calling procedure must save the
registers that it wants preserved for access after the call, and thus the called procedure need
not worry about registers. Callee saving is the opposite: the called procedure must save the
registers it wants to use, leaving the caller is unrestrained.
        There are times when caller save must be used because of access patterns to globally
visible variables in two different procedures. For example, suppose we have a procedure P1
that calls procedure P2, and both procedures manipulate the global variable x. If P1 had
allocated x to a register, it must be sure to save x to a location known by P2 before the call to

G) Encoding an Instruction Set:
         In general the instructions are encoded into a binary representation for execution by
the processor. This representation affects not only the size of the compiled program; it affects
the implementation of the processor, which must decode this representation to quickly find
the operation and its operands. The operation is typically specified in one field, called the
         This decision depends on the range of addressing modes and the degree of
independence between opcodes and modes. Some older computers have one to five operands
with 10 addressing modes for each operand. For such a large number of combinations,
typically a separate address specifier is needed for each operand: the address specifier tells
what addressing mode is used to access the operand. At the other extreme are load-store
computers with only one memory operand and only one or two addressing modes; obviously,
in this case, the addressing mode can be encoded as part of the opcode.
         When encoding the instructions, the number of registers and the number of addressing
modes both have a significant impact on the size of instructions, as the register field and
addressing mode field may appear many times in a single instruction. The architect must
balance several competing forces when encoding the instruction set:
1. The desire to have as many registers and addressing modes as possible.
2. The impact of the size of the register and addressing mode fields on the average instruction
size and hence on the average program size.
3. A desire to have instructions encoded into lengths that will be easy to handle in a pipelined
         Figure 2.23 shows three popular choices for encoding the instruction set. The first we
call variable, since it allows virtually all addressing modes to be with all operations. This
style is best when there are many addressing modes and operations. The second choice we
call fixed, since it combines the operation and the addressing mode into the opcode. Often
fixed encoding will have only a single size for all instructions; it works best when there are
few addressing modes and operations. Given these two poles of instruction set design of
variable and fixed, the third alternative immediately springs to mind: Reduce the variability
in size and work of the variable architecture but provide multiple instruction lengths to reduce
code size. This hybrid approach is the third encoding alternative.

Reduced Code Size in RISCs:
        As RISC computers started being used in embedded applications, the 32-bit fixed
format became a liability since cost and hence smaller code are important. In response,
several manufacturers offered a new hybrid version of their RISC instruction sets, with both
16-bit and 32-bit instructions. The narrow instructions support fewer operations, smaller
address and immediate fields, fewer registers, and two-address format rather than the classic
three-address format of RISC computers.
        In contrast to these instruction set extensions, IBM simply compresses its standard
instruction set, and then adds hardware to decompress instructions as they are fetched from
memory on an instruction cache miss. Thus, the instruction cache contains full 32-bit
instructions, but compressed code is kept in main memory, ROMs, and the disk.
        CodePack starts with run-length encoding compression on any PowerPC program,
and then loads the resulting compression tables in a 2KB table on chip. Hence, every program
has its own unique encoding. IBM claims an overall performance cost of 10%, resulting in a
code size reduction of 35% to 40%. Hitachi simply invented a RISC instruction set with a
fixed, 16-bit format, called SuperH, for embedded applications. It has 16 rather than 32
registers to make it fit the narrower format and fewer instructions, but otherwise looks like a
classic RISC architecture.

FIGURE 2.23 Three basic variations in instruction encoding: variable length, fixed
                         length, and hybrid.

.H) The Role of Compilers:
        Today almost all programming is done in high-level languages for desktop and server
applications. Because the compiler will be significantly affect the performance of a computer,
understanding compiler technology today is critical to designing and efficiently implementing
an instruction set. Once it was popular to try to isolate the compiler technology and its effect
on hardware performance from the architecture and its performance, just as it was popular to
try to separate architecture from its implementation. This separation is essentially impossible
with today’s desktop compilers and computers.

The Structure of Recent Compilers:
        Compilers typically consist of two to four passes, with more highly optimizing
compilers having more passes. A compiler writer’s first goal is correctness—all valid
programs must be compiled correctly. The second goal is usually speed of the compiled code.
Typically, a whole set of other goals follows these two, including fast compilation, debugging
support, and interoperability among languages. The complexity of writing a correct compiler
is a major limitation on the amount of optimization that can be done.
        Although the multiple-pass structure helps reduce compiler complexity, it also means
that the compiler must order and perform some transformations before others. Compiler
writers call this problem the phase-ordering problem.
        How does this ordering of transformations interact with the instruction set
architecture? A good example occurs with the optimization called global common sub
expression elimination. This optimization finds two instances of an expression that compute
the same value and saves the value of the first computation in a temporary. It then uses the
temporary value, eliminating the second computation of the common expression.
        For this optimization to be significant, the temporary must be allocated to a register.
Otherwise, the cost of storing the temporary in memory and later reloading it may negate the
savings gained by not recomputing the expression. Optimizations performed by modern
compilers can be classified by the style of the transformation, as follows:
1. High-level optimizations are often done on the source with output fed to later
optimization passes.
2. Local optimizations optimize code only within a straight-line code fragment (called a
basic block by compiler people).
3. Global optimizations extend the local optimizations across branches and introduce a set of
transformations aimed at optimizing loops.
4. Register allocation.
5. processor-dependent optimizations attempt to take advantage of specific architectural

Register Allocation:
        Because of the central role that register allocation plays, both in speeding up the code
and in making other optimizations useful, it is one of the most important if not the most
important—optimizations. Register allocation algorithms today are based on a technique
called graph coloring. The basic idea behind graph coloring is to construct a graph
representing the possible candidates for allocation to a register and then to use the graph to
allocate registers. Roughly speaking, the problem is how to use a limited set of colors so that
no two adjacent nodes in a dependency graph have the same color. The emphasis in the
approach is to achieve 100% register allocation of active variables. The problem of coloring
a graph in general can take exponential time as a function of the size of the graph (NP-
        Graph coloring works best when there are at least 16 (and preferably more) general-
purpose registers available for global allocation for integer variables and additional registers
for floating point. Unfortunately, graph coloring does not work very well when the number of
registers is small because the heuristic algorithms for coloring the graph are likely to fail.

The Impact of Compiler Technology on the Architect’s Decisions:
        The interaction of compilers and high-level languages significantly affects how
programs use an instruction set architecture. There are two important questions:
How are variables allocated and addressed? How many registers are needed to allocate
variables appropriately? To address these questions, we must look at the three separate
areas in which current high-level languages allocate their data:
        The stack is used to allocate local variables. The stack is grown and shrunk on
procedure call or return, respectively. Objects on the stack are addressed relative to the stack
pointer and are primarily scalars (single variables) rather than arrays.
        The global data area is used to allocate statically declared objects, such as global
variables and constants.
        The heap is used to allocate dynamic objects that do not adhere to a stack discipline.
Objects in the heap are accessed with pointers and are typically not scalars.
        Register allocation is much more effective for stack-allocated objects than for global
variables, and register allocation is essentially impossible for heap-allocated objects because
they are accessed with pointers. Global variables and some stack variables are impossible to
allocate because they are aliased, which means that there are multiple ways to refer to the
address of a variable, making it illegal to put it into a register.

How the Architect Can Help the Compiler Writer:
         Compiler writers often are working under their own corollary of a basic principle in
architecture: Make the frequent cases fast and the rare case correct. That is, if we know
which cases are frequent and which are rare, and if generating code for both is
straightforward, then the quality of the code for the rare case may not be very important—but
it must be correct! Some instruction set properties help the compiler writer.
1. Regularity; —Whenever it makes sense, the three primary components of an instruction
set—the operations, the data types, and the addressing modes should be orthogonal. Two
aspects of architecture are said to be orthogonal if they are independent.
2. Provide primitives, not solutions—Special features that “match” a language construct or
a kernel function are often unusable. Attempts to support high level languages may work only
with one language, or do more or less than is required for a correct and efficient
implementation of the language.
3. Simplify trade-offs among alternatives—One of the toughest jobs a compiler writer has
is figuring out what instruction sequence will be best for every segment of code that arises. In
earlier days, instruction counts or total code size might have been good metrics, but—as we
saw in the last chapter—this is no longer true. With caches and pipelining, the trade-offs have
become very complex. Anything the designer can do to help the compiler writer understand
the costs of alternative code sequences would help improve the code.
4. Provide instructions that bind the quantities known at compile time as constants—
A compiler writer hates the thought of the processor interpreting at runtime a value that was
known at compile time.
                           Instruction-Level Parallelism:

A) Instruction-Level Parallelism: Concepts and Challenges:
        All processors since about 1985, including those in the embedded space, use
pipelining to overlap the execution of instructions and improve performance. This potential
overlap among instructions is called instruction-level parallelism (ILP) since the
instructions can be evaluated in parallel. There are two largely separable approaches to
exploiting ILP. They are techniques that are largely dynamic and depend on the hardware
to locate the parallelism. And techniques that are static and rely much more on software.
        The dynamic, hardware intensive approaches dominate the desktop and server
Markets. The static, compiler-intensive approaches, have seen broader adoption in the
embedded market than the desktop or server markets. The value of the CPI (Cycles per
Instruction) for a pipelined processor is the sum of the base CPI and all contributions from
    Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
        The ideal pipeline CPI is a measure of the maximum performance attainable by the
implementation. By reducing each of the terms of the right-hand side, we minimize the
overall pipeline CPI and thus increase the IPC (Instructions per Clock).

Instruction-Level Parallelism:
        The amount of parallelism available within a basic block–a straight- line code
sequence with no branches in except to the entry and no branches out except at the exit–is
quite small. For typical MIPS programs the average dynamic branch frequency often between
15% and 25%, meaning that between four and seven instructions execute between a pair of
branches. The simplest and most common way to increase the amount of parallelism
available among instructions is to exploit parallelism among iterations of a loop. This type of
parallelism is often called loop-level parallelism. Here is a simple example of a loop, which
adds two 1000-element arrays, that is completely parallel:
                for (i=1; i<=1000; i=i+1)
                x[i] = x[i] + y[i];
Every iteration of the loop can overlap with any other iteration, although within each loop
iteration there is little or no opportunity for overlap. There are a number of techniques we
will examine for converting such loop-level parallelism into instruction-level parallelism.
Basically, such techniques work by unrolling the loop either statically by the compiler or
dynamically by the hardware. An important alternative method for exploiting loop-level
parallelism is the use of vector instructions. Essentially, a vector instruction operates on a
sequence of data items. For example, the above code sequence could execute in four
instructions on some vector processors: two instructions to load the vectors x and y from
memory, one instruction to add the two vectors, and an instruction to store back the result

Data Dependence and Hazards:
        Determining how one instruction depends on another is critical to determining how
much parallelism exists in a program and how that parallelism can be exploited. In particular,
to exploit instruction-level parallelism we must determine which instructions can be executed
in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline
without causing any stalls, assuming the pipeline has sufficient resources (and hence no
structural hazards exist). If two instructions are dependent they are not parallel and must be
executed in order, though they may often be partially overlapped. The key in both cases is to
determine whether an instruction is dependent on another instruction.

Data Dependences:
        There are three different types of dependences: data dependences (also called true
data dependences), name dependences, and control dependences. An instruction j is data
dependent on instruction i if either of the following holds:
1) Instruction i produces a result that may be used by instruction j, or
2) Instruction j is data dependent on instruction k, and instruction k is data dependent on
instruction i.
        For example, consider the following code sequence that increments a vector of values
in memory (starting at 0(R1) and with the last element at 8(R2)) by a scalar in register F2:

        If two instructions are data dependent they cannot execute simultaneously or be
completely overlapped. The dependence implies that there would be a chain of one or more
data hazards between the two instructions. Executing the instructions simultaneously will
cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or
eliminating the overlap.
        Dependences are a property of programs. Whether a given dependence results in an
actual hazard being detected and whether that hazard actually causes a stall are properties of
the pipeline organization. A dependence can be overcome in two different ways:
maintaining the dependence but avoiding a hazard, and eliminating a dependence by
transforming the code. Scheduling the code is the primary method used to avoid a hazard
without altering a dependence.

Name Dependences:
        A name dependence occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between the instructions associated with
that name. There are two types of name dependences between an instruction i that precedes
instruction j in program order:
1. An antidependence between instruction i and instruction j occurs when instruction j writes
a register or memory location that instruction i reads. The original ordering must be preserved
to ensure that i reads the correct value.
2. An output dependence occurs when instruction i and instruction j write the same register
or memory location. The ordering between the instructions must be preserved to ensure that
the value finally written corresponds to instruction j.
        Since a name dependence is not a true dependence, instructions involved in a name
dependence can execute simultaneously or be reordered, if the name (register number or
memory location) used in the instructions is changed so the instructions do not conflict. This
renaming can be more easily done for register operands, where it is called register

Data Hazards:
       Because of the dependence, we must preserve what is called program order, that is
the order that the instructions would execute in, if executed sequentially one at a time as
determined by the original source program. The goal of both our software and hardware
techniques is to exploit parallelism by preserving program order only where it affects the
outcome of the program. Data hazards may be classified as one of three types, depending on
the order of read and write accesses in the instructions.
1) RAW (read after write) — j tries to read a source before i writes it, so j incorrectly gets
the old value. This hazard is the most common type and corresponds to a true data
2) WAW (write after write) — j tries to write an operand before it is written by i. The writes
end up being performed in the wrong order, leaving the value written by i rather than the
value written by j in the destination. This hazard corresponds to an output dependence.
3) WAR (write after read) — j tries to write a destination before it is read by i, so i
incorrectly gets the new value. This hazard arises from an antidependence.

Control Dependences:
       One of the simplest examples of a control dependence is the dependence of the
statements in the “then” part of an if statement on the branch. For example, in the code

        S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
In general, there are two constraints imposed by control dependences:
1. An instruction that is control dependent on a branch cannot be moved before the branch so
that its execution is no longer controlled by the branch. For example, we cannot take an
instruction from the then-portion of an if-statement and move it before the if-statement.
2. An instruction that is not control dependent on a branch cannot be moved after the branch
so that its execution is controlled by the branch. For example, we cannot take a statement
before the if-statement and move it into the then-portion.
        Control dependence is preserved by two properties in a simple pipeline. First,
instructions execute in program order. This ordering ensures that an instruction that occurs
before a branch is executed before the branch. Second, the detection of control or branch
hazards ensures that an instruction that is control dependent on a branch is not
executed until the branch direction is known. Instead, the two properties critical to
program correctness–and normally preserved by maintaining both data and control
dependence–are the exception behavior and the data flow. Preserving the exception
behavior means that any changes in the ordering of instruction execution must not change
how exceptions are raised in the program. The second property preserved by maintenance of
data dependences and control dependences is the data flow. The data flow is the actual flow
of data values among instructions that produce results and those that consume them. Branches
make the data flow dynamic, since they allow the source of data for a given instruction to
come from many points.

B) Overcoming Data Hazards:
        Dynamic scheduling is a technique in which the hardware rearranges the instruction
execution to reduce the stalls while maintaining data flow and exception behavior. Dynamic
scheduling offers several advantages: It enables handling some cases when dependences are
unknown at compile time (e.g., because they may involve a memory reference), and it
simplifies the compiler. Perhaps most importantly, it also allows code that was compiled with
one pipeline in mind to run efficiently on a different pipeline. The advantages of dynamic
scheduling are gained at a cost of a significant increase in hardware complexity.

Dynamic Scheduling: The Idea:
         A major limitation of the simple pipelining techniques is that they all use in-order
instruction issue and execution: Instructions are issued in program order and if an instruction
is stalled in the pipeline, no later instructions can proceed. Thus, if there is a dependence
between two closely spaced instructions in the pipeline, this will lead to a hazard and a stall
will result. If there are multiple functional units, these units could lie idle. If instruction j
depends on a long-running instruction i, currently in execution in the pipeline, then all
instructions after j must be stalled until i is finished and j can execute. For example, consider
this code:
                        DIV.D F0,F2,F4
                        ADD.D F10,F0,F8
                        SUB.D F12,F8,F14
         The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D
causes the pipeline to stall; yet SUB.D is not data dependent on anything in the pipeline. This
hazard creates a performance limitation that can be eliminated by not requiring instructions to
execute in program order.
         In the classic five-stage pipeline developed in the first chapter, both structural and
data hazards could be checked during instruction decode (ID): To allow us to begin executing
the SUB.D in the above example, we must separate the issue process into two parts: checking
for any structural hazards and waiting for the absence of a data hazard. Thus, this pipeline
does out-of-order execution, which implies out-of-order completion.
         Out-of-order execution introduces the possibility of WAR and WAW hazards, which
do not exist in the five-stage integer pipeline and its logical extension to an in-order floating-
point pipeline. Consider the following MIPS floating-point code sequence:
                DIV.D F0,F2,F4
                ADD.D F6,F0,F8
                SUB.D F8,F10,F14
                MULT.D F6,F10,F8
         There is an antidependence between the ADD.D and the SUB.D, and if the pipeline
executes the SUB.D before the ADD.D (which is waiting for the DIV.D), it will violate the
antidependence, yielding a WAR hazard. Likewise, to avoid violating output dependences,
such as the write of F6 by MULT.D, WAW hazards must be handled.
         Out-of-order completion also creates major complications in handling exceptions.
Dynamically scheduled processors preserve exception behavior by ensuring that no
instruction can generate an exception until the processor knows that the instruction raising the
exception will be executed. Although exception behavior must be preserved, dynamically
scheduled processors may generate imprecise exceptions. An exception is imprecise if the
processor state when an exception is raised does not look exactly as if the instructions were
executed sequentially in strict program order. Imprecise exceptions can occur because of two
1. the pipeline may have already completed instructions that are later in program order
than the instruction causing the exception, and
2. the pipeline may have not yet completed some instructions that are earlier in program
order than the instruction causing the exception.
        To allow out-of-order execution, we essentially split the ID pipe stage of our simple
five-stage pipeline into two stages:
1. Issue—Decode instructions, check for structural hazards.
2. Read operands—Wait until no data hazards, then read operands.
        An instruction fetch stage precedes the issue stage and may fetch either into an
instruction register or into a queue of pending instructions; instructions are then issued from
the register or queue. The EX stage follows the read operands stage, just as in the five-stage
pipeline. Execution may take multiple cycles, depending on the operation. Having multiple
instructions in execution at once requires multiple functional units, pipelined functional units,
or both. Scoreboarding; is a technique for allowing instructions to execute out-of-order
when there are sufficient resources and no data dependences.

Dynamic Scheduling Using Tomasulo’s Approach:
        A key approach to allow execution to proceed in the presence of dependences was
used by the IBM 360/91 floating-point unit. Invented by Robert Tomasulo, this scheme tracks
when operands for instructions are available, to minimize RAW hazards, and introduces
register renaming, to minimize WAW and RAW hazards. The 360 architecture had only four
double-precision floating-point registers, which limits the effectiveness of compiler
scheduling; this fact was another motivation for the Tomasulo approach. In addition, the IBM
360/91 had long memory accesses and long floating-point delays, which Tomasulo’s
algorithm was designed to overcome.
        We explain the algorithm, which focuses on the floating-point unit and load/ store
unit, in the context of the MIPS instruction set. RAW hazards are avoided by executing an
instruction only when its operands are available. WAR and WAW hazards, which arise from
name dependences, are eliminated by register renaming. To better understand how register
renaming eliminates WAR and WAW hazards consider the following example code sequence
that includes both a potential WAR and WAW hazard:
                DIV.D F0,F2,F4
                ADD.D F6,F0,F8
                S.D F6,0(R1)
                SUB.D F8,F10,F14
                MULT.D F6,F10,F8
        There is an antidependence between the ADD.D and the SUB.D and an output
dependence between the ADD.D and the MULT.D leading to three possible hazards: a WAR
hazard on the use of F8 by ADD.D and on the use of F8 by the MULT.D, and a WAW hazard
since the ADD.D may finish later than the MULT.D. These name dependences can both be
eliminated by register renaming. For simplicity, assume the existence of two temporary
registers, S and T. Using S and T, the sequence can be rewritten without any dependences as:
                DIV.D F0,F2,F4
                ADD.D S,F0,F8
                S.D S,0(R1)
                SUB.D T,F10,F14
                MULT.D F6,F10,T
        In addition, any subsequent uses of F8 must be replaced by the register T. In this code
segment, the renaming process can be done statically by the compiler. In Tomasulo’s scheme,
register renaming is provided by the reservation stations, which buffer the operands of
instructions waiting to issue, and by the issue logic. The basic idea is that a reservation station
fetches and buffers an operand as soon as it is available, eliminating the need to get the
operand from a register. In addition, pending instructions designate the reservation station
that will provide their input. Finally, when successive writes to a register overlap in
execution, only the last one is actually used to update the register.
        The use of reservation stations, rather than a centralized register file, leads to two
other important properties. First, hazard detection and execution control are distributed: The
information held in the reservation stations at each functional unit determine when an
instruction can begin execution at that unit. Second, results are passed directly to functional
units from the reservation stations where they are buffered, rather than going through the
        Figure 3.2 shows the basic structure of a Tomasulo-based MIPS processor, including
both the floating-point unit and the load/store unit; none of the execution control tables are
shown. Each reservation station holds an instruction that has been issued and is awaiting
execution at a functional unit, and either the operand values for that instruction, if they have
already been computed, or else the names of the functional units that will be provide the
operand values.
   FIGURE 3.2 The basic structure of a MIPS floating point unit using Tomasulo’s algorithm.

     The load buffers and store buffers hold data or addresses coming from and going to
memory and behave almost exactly like reservation stations, so we distinguish them only
when necessary. The floating-point registers are connected by a pair of buses to the
functional units and by a single bus to the store buffers. All results from the functional units
and from memory are sent on the common data bus, which goes everywhere except to the
load buffer. All reservation stations have tag fields, employed by the pipeline control. There
are only three steps:
1. Issue—Get the next instruction from the head of the instruction queue, which is
maintained in FIFO order to ensure the maintenance of correct data flow. If there is a
matching reservation station that is empty, issue the instruction to the station with the
operand values, if they are currently in the registers. If there is not an empty reservation
station, then there is a structural hazard and the instruction stalls until a station or buffer is
freed. If the operands are not in the registers, enter the functional units that will produce the
operands into the Qi and Qj fields. This step renames registers, eliminating WAR and WAW

2. Execute—If one or more of the operands is not yet available, monitor the common data
bus (CDB) while waiting for it to be computed. When an operand becomes available, it is
placed into the corresponding reservation station. When all the operands are available, the
operation can be executed at the corresponding functional unit. By delaying instruction
execution until the operands are available RAW, hazards are avoided.
        Loads and stores require a two-step execution process. The first step computes the
effective address when the base register is available, and the effective address is then placed
in the load or store buffer. Loads in the load buffer execute as soon as the memory unit is
available. Stores in the store buffer wait from the value to be stored before being sent to the
memory unit. Loads and stores are maintained in program order through the effective address
calculation, which will help to prevent hazards through memory.

3. Write result—When the result is available, write it on the CDB and from there into the
registers and into any reservation stations (including store buffers) waiting for this result.
        The data structures used to detect and eliminate hazards are attached to the reservation
stations, to the register file, and to the load and store buffers with slightly different
information attached to different objects. Each reservation station has six fields:

Op—The operation to perform on source operands S1 and S2.
Qj, Qk—The reservation stations that will produce the corresponding source operand; a
value of zero indicates that the source operand is already available in Vj or Vk, or is
Vj, Vk—The value of the source operands. Note that only one of the V field or the Q field is
valid for each operand. For loads, the Vk field is used to the offset from the instruction.
A–used to hold information for the memory address calculation for a load or store. Initially,
the immediate field of the instruction is stored here; after the address calculation, the
effective address is stored here.
Busy—Indicates that this reservation station and its accompanying functional unit are
occupied. The register file has a field, Qi:
Qi—The number of the reservation station that contains the operation whose result should be
stored into this register. If the value of Qi is blank (or 0), no currently active instruction is
computing a result destined for this register, meaning that the value is simply the register
        The load and store buffers each have a field, A, which holds the result of the effective
address once the first step of execution has been completed.
C) Reducing Branch Costs:
        The goal of dynamic hardware prediction is allow the processor to resolve the
outcome of a branch early, thus preventing control dependences from causing stalls. The
effectiveness of a branch prediction scheme depends not only on the accuracy, but also on the
cost of a branch when the prediction is correct and when the prediction is incorrect.

Basic Branch Prediction and Branch-Prediction Buffers:
        The simplest dynamic branch-prediction scheme is a branch-prediction buffer or
branch history table. A branch-prediction buffer is a small memory indexed by the lower
portion of the address of the branch instruction. The memory contains a bit that says whether
the branch was recently taken or not. This scheme is the simplest sort of buffer; it has no tags
and is useful only to reduce the branch delay when it is longer than the time to compute the
possible target PCs. We don’t know, in fact, if the prediction is correct.
        The prediction is a hint that is assumed to be correct, and fetching begins in the
predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored
back. This simple one-bit prediction scheme has a performance shortcoming: Even if a
branch is almost always taken, we will likely predict incorrectly twice, rather than once,
when it is not taken. To remedy this, two-bit prediction schemes are often used. In a two-bit
scheme, a prediction must miss twice before it is changed. Figure 3.7 shows the finite-state
processor for a two-bit prediction scheme.
                    FIGURE 3.7 The states in a two-bit prediction scheme.

        The two-bit scheme is actually a specialization of a more general scheme that has an
n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the
counter can take on values between 0 and 2n – 1: when the counter is greater than or equal to
one half of its maximum value (2n–1), the branch is predicted as taken; otherwise, it is
predicted untaken. As in the two-bit scheme, the counter is incremented on a taken branch
and decremented on an untaken branch.
        A branch-prediction buffer can be implemented as a small, special “cache” accessed
with the instruction address during the IF pipe stage. If the instruction is decoded as a branch
and if the branch is predicted as taken, fetching begins from the target as soon as the PC is
known. Otherwise, sequential fetching and executing continue. Although this scheme is
useful for most pipelines, the five-stage, classic pipeline finds out both whether the branch is
taken and what the target of the branch is at roughly the same time, assuming no hazard in
accessing the register specified in the conditional branch.
        What kind of accuracy can be expected from a branch-prediction buffer using two bits
per entry on real applications? For the SPEC89 benchmarks a branch prediction buffer with
4096 entries results in a prediction accuracy ranging from over 99% to 82%, or a
misprediction rate of 1% to 18%.

Correlating Branch Predictors:
        These two-bit predictor schemes use only the recent behavior of a single branch to
predict the future behavior of that branch. It may be possible to improve the prediction
accuracy if we also look at the recent behavior of other branches rather than just the branch
we are trying to predict. Consider a small code fragment from the SPEC92 benchmark eqntott
(the worst case for the two-bit predictor):

               if (aa==2)
               if (bb==2)
               if (aa!=bb) {
      Here is the MIPS code that we would typically generate for this code fragment
assuming that aa and bb are assigned to registers R1 and R2:

              DSUBUI R3,R1,#2
              BNEZ R3,L1 ;branch b1 (aa!=2)
              DADD R1,R0,R0 ;aa=0
         L1: DSUBUI R3,R2,#2
              BNEZ R3,L2 ;branch b2(bb!=2)
              DADD R2,R0,R0 ; bb=0
         L2: DSUBU R3,R1,R2 ;R3=aa-bb
              BEQZ R3,L3 ;branch b3 (aa==bb)

        Let’s label these branches b1, b2, and b3. The key observation is that the behavior of
branch b3 is correlated with the behavior of branches b1 and b2. Clearly, if branches b1 and
b2 are both not taken (i.e., the if conditions both evaluate to true and aa and bb are both
assigned 0), then b3 will be taken, since aa and bb are clearly equal. Branch predictors that
use the behavior of other branches to make a prediction are called correlating predictors or
two-level predictors. To see how such predictors work, let’s choose a simple hypothetical
case. Consider the following simplified code fragment (chosen for illustrative purposes):

              if (d==0)
              if (d==1)
       Here is the typical code sequence generated for this fragment, assuming that d is
assigned to R1:
              BNEZ R1,L1;branch b1(d!=0)
              DADDIU R1,R0,#1;d==0, so d=1
        L1: DADDIU R3,R1,#-1
              BNEZ R3,L2;branch b2(d!=1)
       The branches corresponding to the two if statements are labeled b1 and b2. The
possible sequences for an execution of this fragment, assuming d has values 0, 1, and 2, are
shown in Figure 3.10.

               FIGURE 3.10 Possible execution sequences for a code fragment.

        A one-bit predictor initialized to not taken has the behavior shown in Figure 3.11. As
the figure shows, all the branches are mispredicted!
              FIGURE 3.11 Behavior of a one-bit predictor initialized to not taken.

        Alternatively, consider a predictor that uses one bit of correlation. The easiest way to
think of this is that every branch has two separate prediction bits: one prediction assuming the
last branch executed was not taken and another prediction that is used if the last branch
executed was taken. We write the pair of prediction bits together, with the first bit being the
prediction if the last branch in the program is not taken and the second bit being the
prediction if the last branch in the program is taken.

        FIGURE 3.12 Combinations and meaning of the taken/not taken prediction bits.

        The simplicity of the hardware comes from a simple observation: The global history
of the most recent m branches can be recorded in an m-bit shift register, where each bit
records whether the branch was taken or not taken. The branch-prediction buffer can then be
indexed using a concatenation of the low-order bits from the branch address with the m-bit
global history.
     FIGURE 3.14 A (2,2) branch-prediction buffer uses a two-bit global history to choose
                   from among four predictors for each branch address.

Tournament Predictors: Adaptively Combining Local and Global Predictors:
        The primary motivation for correlating branch predictors came from the observation
that the standard 2-bit predictor using only local information failed on some important
branches and that by adding global information, the performance could be improved.
Tournament predictors take this insight to the next level, by using multiple predictors, usually
one based on global information and one based on local information, and combining them
with a selector. Tournament predictors can achieve both better accuracy at medium sizes
(8Kb-32Kb) and also make use of very large numbers of prediction bits effectively.
        Tournament predictors are the most popular form of multilevel branch predictors. A
multilevel branch predictor use several levels of branch prediction tables together with an
algorithm for choosing among the multiple predictors. The four states of the counter dictate
whether to use predictor 1 or predictor 2. The state transition diagram is shown in Figure
     FIGURE 3.16 The state transition diagram for a tournament predictor has four states
                         corresponding to which predictor to use.
D) High Performance Instruction Delivery:
        In a high performance pipeline, especially one with multiple issues, predicting
branches well is not enough: we actually have to be able to deliver a high bandwidth
instruction stream. In recent multiple issue processors, this has meant delivering 4-8
instructions every clock cycle. We consider three concepts: a branch target buffer, an
integrated instruction fetch unit, and dealing with indirect branches, by predicting
return addresses.

Branch Target Buffers:
         To reduce the branch penalty for our five-stage pipeline, we need to know from what
address to fetch by the end of IF. This requirement means we must know whether the as-yet-
undecoded instruction is a branch and, if so, what the next PC should be. If the instruction is
a branch and we know what the next PC should be, we can have a branch penalty of zero. A
branch-prediction cache that stores the predicted address for the next instruction after a
branch is called a branch-target buffer or branch-target cache.
         For a branch-target buffer, we access the buffer during the IF stage using the
instruction address of the fetched instruction, a possible branch, to index the buffer. If we get
a hit, then we know the predicted instruction address at the end of the IF cycle, which is one
cycle earlier than for a branch-prediction buffer. Because we are predicting the next
instruction address and will send it out before decoding the instruction, we must know
whether the fetched instruction is predicted as a taken branch. Figure 3.19 shows what the
branch-target buffer looks like. If the PC of the fetched instruction matches a PC in the
buffer, then the corresponding predicted PC is used as the next PC.
                              FIGURE 3.19 A branch-target buffer.

        If a matching entry is found in the branch-target buffer, fetching begins immediately
at the predicted PC. If we did not check whether the entry matched this PC, then the wrong
PC would be sent out for instructions that were not branches, resulting in a slower processor.
We only need to store the predicted taken branches in the branch-target buffer, since an
untaken branch follows the same strategy (fetch the next sequential instruction) as a
        Complications arise when we are using a two-bit predictor, since this requires that we
store information for both taken and untaken branches. One way to resolve this is to use both
a target buffer and a prediction buffer, which is the solution, used by several PowerPC
processors. Figure 3.20 shows the steps followed when using a branch-target buffer and
where these steps occur in the pipeline. From this we can see that there will be no branch
delay if a branch-prediction entry is found in the buffer and is correct. Otherwise, there will
be a penalty of at least two clock cycles. In practice, this penalty could be larger, since the
branch-target buffer must be updated.
    FIGURE 3.20 The steps involved in handling an instruction with a branch-target buffer.

        To evaluate how well a branch-target buffer works, we first must determine the
penalties in all possible cases. Figure 3.21 contains this information.
     FIGURE 3.21 Penalties for all possible combinations of whether the branch is in the
    buffer and what it actually does, assuming we store only taken branches in the buffer.
One variation on the branch-target buffer is to store one or more target instructions instead
of, or in addition to, the predicted target address. This variation has two potential
advantages. First, it allows the branch-target buffer access to take longer than the time
between successive instruction fetches, possibly allowing a larger branch-target buffer.
Second, buffering the actual target instructions allows us to perform an optimization called
branch folding. Branch folding can be used to obtain zero-cycle unconditional branches, and
sometimes zero-cycle conditional branches.

Integrated Instruction Fetch Units:
        To meet the demands of multiple issue processor many recent designers have chosen
to implement an integrated instruction fetch unit, as a separate autonomous unit that feeds
instructions to the rest of the pipeline. Recent designs have used an integrated instruction
fetch unit that integrates several functions:

1. Integrated branch prediction: The branch predictor becomes part of the instruction fetch
unit and is constantly predicting branches, so to drive the fetch pipeline.

2. Instruction prefetch: To deliver multiple instructions per clock, the instruction fetch unit
will likely need to fetch ahead. The unit autonomously manages the prefetching of
instructions, integrating it with branch prediction.

3. Instruction memory access and buffering: when fetching multiple instructions per cycle
a variety of complexities are encountered, including the difficulty that fetching multiple
instructions may require accessing multiple cache lines. The instruction fetch unit
encapsulates this complexity, using prefetch to try to hide the cost of crossing cache blocks.

Return Address Predictors:
         Another method that designers have studied and included in many recent processors is
a technique for predicting indirect jumps, that is, jumps whose destination address varies at
runtime. Though procedure returns can be predicted with a branch-target buffer, the accuracy
of such a prediction technique can be low if the procedure is called from multiple sites and
the calls from one site are not clustered in time. To overcome this problem, the concept of a
small buffer of return addresses operating as a stack has been proposed. This structure caches
the most recent return addresses: pushing a return address on the stack at a call and popping
one off at a return. If the cache is sufficiently large (i.e., as large as the maximum call depth),
it will predict the returns perfectly.
         Branch prediction schemes are limited both by prediction accuracy and by the penalty
for misprediction. As we have seen, typical prediction schemes achieve prediction accuracy
in the range of 80–95% depending on the type of program and the size of the buffer. In
addition to trying to increase the accuracy of the predictor, we can try to reduce the penalty
for misprediction. The penalty can be reduced by fetching from both the predicted and
unpredicted direction. Fetching both paths requires that the memory system be dual-ported,
have an interleaved cache, or fetch from one path and then the other.

E) Hardware-Based Speculation:
        As we try to exploit more instruction level parallelism, maintaining control
dependences becomes an increasing burden. Branch prediction reduces the direct stalls
attributable to branches, but for a processor executing multiple instructions per clock, just
predicting branches accurately may not be sufficient to generate the desired amount of
instruction level parallelism. Hence, exploiting more parallelism requires that we overcome
the limitation of control dependence.
        Overcoming control dependence is done by speculating on the outcome of branches
and executing the program as if our guesses were correct. This mechanism represents a
subtle, but important, extension over branch prediction with dynamic scheduling. In
particular, with speculation, we fetch, issue, and execute instructions, as if our branch
predictions were always correct; dynamic scheduling only fetches and issues such
instructions. Of course, we need mechanisms to handle the situation where the speculation is
        Hardware-based speculation combines three key ideas: dynamic branch prediction to
choose which instructions to execute, speculation to allow the execution of instructions
before the control dependences are resolved and dynamic scheduling to deal with the
scheduling of different combinations of basic blocks. Hardware-based speculation follows the
predicted flow of data values to choose when to execute instructions. This method of
executing programs is essentially a data-flow execution: operations execute as soon as their
operands are available.
        The hardware that implements Tomasulo’s algorithm can be extended to support
speculation. To do so, we must separate the bypassing of results among instructions, which is
needed to execute an instruction speculatively, from the actual completion of an instruction.
By making this separation, we can allow an instruction to execute and to bypass its results to
other instructions, without allowing the instruction to perform any updates that cannot be
undone, until we know that the instruction is no longer speculative. When an instruction is no
longer speculative, we allow it to update the register file or memory; we call this additional
step in the instruction execution sequence instruction commit.
        The key idea behind implementing speculation is to allow instructions to execute out
of order but to force them to commit in order and to prevent any irrevocable action until an
instruction commits. Adding this commit phase to the instruction execution sequence requires
some changes to the sequence as well as an additional set of hardware buffers that hold the
results of instructions that have finished execution but have not committed. This hardware
buffer, which we call the reorder buffer, is also used to pass results among instructions that
may be speculated.
        The reorder buffer (ROB, for short) provides additional registers in the same way as
the reservation stations in Tomasulo’s algorithm extend the register set. The ROB holds the
result of an instruction between the time the operation associated with the instruction
completes and the time the instruction commits. Hence, the ROB is a source of operands for
instructions, just as the reservation stations provide operands in Tomasulo’s algorithm. The
key difference is that in Tomasulo’s algorithm, once an instruction writes its result, any
subsequently issued instructions will find the result in the register file. With speculation, the
register file is not updated until the instruction thus, the ROB supplies operands in the
interval between completion of instruction execution and instruction commit.
        Each entry in the ROB contains three fields: the instruction type, the destination
field, and the value field. The instruction-type field indicates whether the instruction is a
branch (and has no destination result), a store (which has a memory address destination), or a
register operation. The destination field supplies the register number or the memory address
(for stores), where the instruction results should be written. The value field is used to hold the
value of the instruction result until the instruction commits. Here are the four steps involved
in instruction execution:

1. Issue—Get an instruction from the instruction queue. Issue the instruction if there is an
empty reservation station and an empty slot in the ROB, send the operands to the reservation
station if they available in either the registers or the ROB. Update the control entries to
indicate the buffers are in use. The number of the ROB allocated for the result is also sent to
the reservation station, so that the number can be used to tag the result when it is placed on
the CDB. If either all reservations are full or the ROB is full, then instruction issue is stalled
until both have available entries. This stage is sometimes called dispatch in a dynamically
scheduled processor.

2. Execute—If one or more of the operands is not yet available, monitor the CDB (common
data bus) while waiting for the register to be computed. This step checks for RAW hazards.
When both operands are available at a reservation station, execute the operation.

3. Write result—When the result is available, write it on the CDB (with the ROB tag sent
when the instruction issued) and from the CDB into the ROB, as well as to any reservation
stations waiting for this result. Mark the reservation station as available.

4. Commit—There are three different sequences of actions at commit depending on
whether the committing instruction is: a branch with an incorrect prediction, a store, or any
other instruction (normal commit). The normal commit case occurs when an instruction
reaches the head of the ROB and its result is present in the buffer; at this point, the processor
updates the register with the result and removes the instruction from the ROB. Committing a
store is similar except that memory is updated rather than a result register. When a branch
with incorrect prediction reaches the head of the ROB, it indicates that the speculation was
wrong. The ROB is flushed and execution is restarted at the correct successor of the branch.
If the branch was correctly predicted, the branch is finished. Some machines call this commit
phase completion or graduation.
        Figure 3.29 shows the hardware structure of the processor including the ROB.
     FIGURE 3.29 The basic structure of a MIPS FP unit using Tomasulo’s algorithm and
                             extended to handle speculation.
         Once an instruction commits, its entry in the ROB is reclaimed and the register or
memory destination is updated, eliminating the need for the ROB entry. If the ROB fills, we
simply stop issuing instructions until an entry is made free. In practice, machines that
speculate try to recover as early as possible after a branch is mispredicted. This recovery can
be done by clearing the ROB for all entries that appear after the mispredicted branch,
allowing those that are before the branch in the ROB to continue, and restarting the fetch at
the correct branch successor. In speculative processors, however, performance is more
sensitive to the branch prediction mechanisms, since the impact of a misprediction will be
higher. Thus, all the aspects of handling branches—prediction accuracy, misprediction
detection, and misprediction recovery—increase in importance.
         In Tomasulo’s algorithm, a store can update memory when it reaches Write Results
and the data value to store is available. In a speculative processor, a store updates memory
only when it reaches the head of the ROB. This difference ensures that memory is not
updated until an instruction is no longer speculative.
         Like Tomasulo’s algorithm, we must avoid hazards through memory. WAW and
WAR hazards through memory are eliminated with speculation, because the actual updating
of memory occurs in order, when a store is at the head of the ROB, and hence, no earlier
loads or stores can still be pending. RAW hazards through memory are maintained by two
1. not allowing a load to initiate the second step of its execution if any active ROB entry
occupied by a store has an Destination field that matches the value of the A field of the load,
2. Maintaining the program order for the computation of an effective address of a load with
respect to all earlier stores.

Multiple Issue with Speculation:
        The two challenges of multiple issue with Tomasulo’s algorithm—instruction issue
and monitoring the CDBs for instruction completion--become the major challenges for
multiple issue with speculation. In addition, to maintain throughput of greater than one
instruction per cycle, a speculative processor must be able to handle multiple instruction
commits per clock cycle.

Design Considerations for Speculative Machines:

1) Register renaming versus Reorder Buffers:
        One alternative to the use of a ROB is the explicit use of a larger physical set of
registers combined with register renaming. This approach builds on the concept of renaming
used in Tomasulo’s algorithm, but extends it. In Tomasulo’s algorithm, the values of the
architecturally visible registers (R0,..., R31 and F0,...,F31) are contained, at any point in
execution, in some combination of the register set and the reservation stations. With the
addition of speculation, register values may also temporarily reside in the ROB. In either
case, if the processor does not issue new instructions for a period of time, all existing
instructions will commit, or the register values will appear in the register file, which directly
corresponds to the architecturally visible registers.
        In the register renaming approach, an extended set of physical registers is used to hold
both the architecturally visible registers as well as temporary values. Thus, the extended
registers replace the function both of the ROB and the reservation stations. An advantage of
the renaming approach versus the ROB approach is that instruction commit is simplified,
since it requires only two simple actions: record that the mapping between an architectural
register number and physical register number is no longer speculative, and free up any
physical registers being used to hold the “older” value of the architectural register.
        With register renaming, deallocating registers is more complex, since before we free
up a physical register, we must know that it no it longer corresponds to an architectural
register, and that no further uses of the physical register are outstanding. In addition to
simplifying instruction commit, a renaming approach means that instruction issue need not
examine both the ROB and the register file for an operand, since all results are in the register

2) How much to speculate:
       One of the significant advantages of speculation is its ability to uncover events that
would otherwise stall the pipeline early, such as cache misses. This potential advantage,
however, comes with a significant potential disadvantage: the processor may speculate that
some costly exceptional event occurs and begin processing the event, when in fact, the
speculation was incorrect.

3) Speculating through multiple branches:
        Three different situations can benefit from speculating on multiple branches
simultaneously: a very high branch frequency, significant clustering of branches, and long
delays in functional units. Speculating on multiple branches slightly complicates the process
of speculation recovery.

F) Limitations of ILP:

1) The Hardware Model:
        An ideal processor is one where all artificial constraints on ILP are removed. The only
limits on ILP in such a processor are those imposed by the actual data flows either through
registers or memory. The assumptions made for an ideal or perfect processor are as follows:

1. Register renaming—There are an infinite number of virtual registers available and hence
all WAW and WAR hazards are avoided and an unbounded number of instructions can begin
execution simultaneously.

2. Branch prediction—Branch prediction is perfect. All conditional branches are predicted

3. Jump prediction—All jumps (including jump register used for return and computed
jumps) are perfectly predicted.

4. Memory-address alias analysis—All memory addresses are known exactly and a load
can be moved before a store provided that the addresses are not identical.
       Assumptions 2 and 3 eliminate all control dependences. Likewise, assumptions 1 and
4 eliminate all but the true data dependences. Together, these four assumptions mean that
any instruction in the of the program’s execution can be scheduled on the cycle immediately
following the execution of the predecessor on which it depends.
       Initially, we examine a processor that can issue an unlimited number of instructions at
once looking arbitrarily far ahead in the computation. For all the processor models we
examine, there are no restrictions on what types of instructions can execute in a cycle. For the
unlimited-issue case, this means there may be an unlimited number of loads or stores issuing
in one clock cycle. Latencies longer than one cycle would decrease the number of issues per
cycle, although not the number of instructions under execution at any point.
        Finally, we assume perfect caches, which is equivalent to saying that all loads and
stores always complete in one cycle. Finally, we assume perfect caches, which is equivalent
to saying that all loads and stores always complete in one cycle.

Limitations on the Window Size and Maximum Issue Count:
        To build a processor that even comes close to perfect branch prediction and perfect
alias analysis requires extensive dynamic analysis, since static compile-time schemes cannot
be perfect. Thus, a dynamic processor might be able to more closely match the amount of
parallelism uncovered by our ideal processor. How close could a real dynamically scheduled,
speculative processor come to the ideal processor? To gain insight into this question, consider
what the perfect processor must do:
1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches
2. Rename all register uses to avoid WAR and WAW hazards.
3. Determine whether there are any data dependencies among the instructions in the issue
packet; if so, rename accordingly.
4. Determine if any memory dependences exist among the issuing instructions and handle
them appropriately.
5. Provide enough replicated functional units to allow all the ready instructions to issue.
        Obviously, this analysis is quite complicated. For example, to determine whether n
issuing instructions have any register dependences among them, assuming all instructions are
register-register and the total number of registers is unbounded, requires

        In a real processor, issue occurs in-order and dependent instructions are handled by a
renaming process that accommodates dependent renaming in one clock. Once instructions are
issued, the detection of dependences is handled in a distributed fashion by the reservation
stations or scoreboard. The set of instructions that are examined for simultaneous execution
is called the window.
        The window size directly limits the number of instructions that begin execution in a
given cycle. In practice, real processors will have a more limited number of functional units,
as well as limited numbers of buses and register access ports, which serve as limits on the
number of instructions minitiated in the same clock. Thus, the maximum number of
instructions that may issue, begin execution, or commit in the same clock cycle is usually
much smaller than the window size.
        Obviously, the number of possible implementation constraints in a multiple issue
processor is large, including: issues per clock, functional units and unit latency, register file
ports, functional unit queues, issue limits for branches, and limitations on instruction commit.
Each of these acts as constraint on the ILP. We know that large window sizes are impractical
and inefficient. Thus we will assume a base window size of 2K entries and a maximum issue
capability of 64 instructions per clock.

The Effects of Realistic Branch and Jump Prediction:
        Our ideal processor assumes that branches can be perfectly predicted: The outcome of
any branch in the program is known before the first instruction is executed! Of course, no real
processor can ever achieve this. We assume a separate predictor is used for jumps. The five
levels of branch predictions are:
1. Perfect—All branches and jumps are perfectly predicted at the start of execution.
2. Tournament-based branch predictor—The prediction scheme uses a correlating two-bit
predictor and a noncorrelating two-bit predictor together with a selector, which chooses the
best predictor for each branch. The prediction buffer contains 213 (8K) entries, each
consisting of three two-bit fields, two of which are predictors and the third is a selector. The
correlating predictor is indexed using the exclusive-or of the branch address and the global
branch history. The noncorrelating predictor is the standard two-bit predictor indexed by the
branch address. The selector table is also indexed by the branch address and specifies
whether the correlating or noncorrelating predictor should be used. The selector is
incremented or decremented just as we would for a standard two-bit predictor.
3. Standard two-bit predictor with 512 two-bit entries—In addition, we assume a 16-entry
buffer to predict returns.
4. Static—A static predictor uses the profile history of the program and predicts that the
branch is always taken or always not taken based on the profile.
5. None—No branch prediction is used, though jumps are still predicted. Parallelism is
largely limited to within a basic block.
The Effects of Finite Registers:
        Our ideal processor eliminates all name dependences among register references using
an infinite set of physical registers. To date, the Alpha 21264 has provided the largest number
of extended registers: 41 integer and 41 FP registers, in addition to 32 integer and 32
floating point architectural registers. In addition, notice that the reduction in available
parallelism is significant even if 64 additional integer and 64 additional FP registers are
available for renaming. Although register renaming is obviously critical to performance, an
infinite number of registers is obviously not practical.

The Effects of Imperfect Alias Analysis:
       Our optimal model assumes that it can perfectly analyze all memory dependences, as
well as eliminate all register name dependences. Of course, perfect alias analysis is not
possible in practice. The three models are:

1. Global/stack perfect—This model does perfect predictions for global and stack references
and assumes all heap references conflict. This model represents an idealized version of the
best compiler-based analysis schemes currently in production.

2. Inspection—This model examines the accesses to see if they can be determined not to
interfere at compile time. In addition, addresses based on registers that point to different
allocation areas (such as the global area and the stack area) are assumed never to alias.

3. None—All memory references are assumed to conflict.
        In practice, dynamically scheduled processors rely on dynamic memory
disambiguation and are limited by three factors:
1. To implement perfect dynamic disambiguation for a given load, we must know the
memory addresses of all earlier stores that not yet committed, since a load may have
dependence through memory on a store.
2. Only a small number of memory references can be disambiguated per clock cycle.
3. The number of the load/store buffers determines how much earlier or later in the
instruction stream a load or store may be moved.
        Both the number of simultaneous disambiguations and the number of the load/ store
buffers will affect the clock cycle time.
                                ILP software approach:

A) Compiler Techniques:
Basic Pipeline Scheduling and Loop Unrolling:
         To keep a pipeline full, parallelism among instructions must be exploited by finding
sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline
stall, a dependent instruction must be separated from the source instruction by a distance in
clock cycles equal to the pipeline latency of that source instruction. A compiler’s ability to
perform this scheduling depends both on the amount of ILP available in the program and on
the latencies of the functional units in the pipeline. We look at how the compiler can increase
the amount of available ILP by unrolling loops. We will rely on an example similar to the one
we used in the last chapter, adding a scalar to a vector:
                for(i=1000; i>0; i=i–1)
                         x[i] = x[i] + s;
         The first step is to translate the above segment to MIPS assembly language. In the
following code segment, R1 is initially the address of the element in the array with the
highest address, and F2 contains the scalar value, s. Register R2 is precomputed, so that
8(R2) is the last element to operate on. The straightforward MIPS code, not scheduled for the
pipeline, looks like this:

        Let’s start by seeing how well this loop will run when it is scheduled on a simple
pipeline for MIPS with the latencies from Figure 4.1. A smarter compiler, capable of limited
symbolic optimization, could figure out the relationship and perform the interchange. The
chain of dependent instructions from the L.D to the ADD.D and then to the S.D determines
the clock cycle count for this loop. This chain must take at least 6 cycles because of
dependencies and pipeline latencies.
        In the above example, we complete one loop iteration and store back one array
element every 6 clock cycles, but the actual work of operating on the array element takes just
3 (the load, add, and store) of those 6 clock cycles. The remaining 3 clock cycles consist of
loop overhead—the DADDUI and BNE—and a stall. To eliminate these 3 clock cycles we
need to get more operations within the loop relative to the number of overhead instructions.
        A simple scheme for increasing the number of instructions relative to the branch and
overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple
times, adjusting the loop termination code. Loop unrolling can also be used to improve
scheduling. Because it eliminates the branch, it allows instructions from different iterations to
be scheduled together.
        In real programs we do not usually know the upper bound on the loop. Suppose it is n,
and we would like to unroll the loop to make k copies of the body. Instead of a single
unrolled loop, we generate a pair of consecutive loops. The first executes (n mod k) times
and has a body that is the original loop. The second is the unrolled body surrounded by an
outer loop that iterates (n/k) times. For large values of n, most of the execution time will be
spent in the unrolled loop body.
        In the above Example, unrolling improves the performance of this loop by eliminating
overhead instructions, although it increases code size substantially. How will the unrolled
loop perform when it is scheduled for the pipeline described earlier?
        The gain from scheduling on the unrolled loop is even larger than on the original loop.
This increase arises because unrolling the loop exposes more computation that can be
scheduled to minimize the stalls; the code above has no stalls. Scheduling the loop in this
fashion necessitates realizing that the loads and stores are independent and can be

Summary of the Loop Unrolling and Scheduling Example:
        To obtain the final unrolled code we had to make the following decisions and
1. Determine that it was legal to move the S.D after the DADDUI and BNE, and find the
amount to adjust the S.D offset.
2. Determine that unrolling the loop would be useful by finding that the loop iterations
were independent, except for the loop maintenance code.
3. Use different registers to avoid unnecessary constraints that would be forced by using the
same registers for different computations.
4. Eliminate the extra test and branch instructions and adjust the loop termination and
iteration code.
5. Determine that the loads and stores in the unrolled loop can be interchanged by
observing that the loads and stores from different iterations are independent.
6. Schedule the code, preserving any dependences needed to yield the same result as the
original code.
        The key requirement underlying all of these transformations is an understanding of
how an instruction depends on another and how the instructions can be changed or reordered
given the dependences. Before examining how these techniques work for higher issue rate
pipelines, let us examine how the loop unrolling and scheduling techniques affect data
        Here is the unrolled but unoptimized code with the extra DADDUI instructions, but
without the branches. The arrows show the data dependences that are within the unrolled
body and involve the DADDUI instructions. The underlined registers are the dependent uses.
        As the arrows show, the DADDUI instructions form a dependent chain that involves
the DADDUI, L.D, and S.D instructions. This chain forces the body to execute in order, as
well as making the DADDUI instructions necessary, which increases the instruction count.
        There are three different types of limits to the gains that can be achieved by loop
unrolling: a decrease in the amount of overhead amortized with each unroll, code size
limitations, and compiler limitations. First when we unrolled the loop four times, it generated
sufficient parallelism among the instructions that the loop could be scheduled with no stall
        A second limit to unrolling is the growth in code size that results. For larger loops,
the code size growth may be a concern either in the embedded space where memory may be
at a premium or if the larger code size causes a decrease in the instruction cache miss rate.
        Another factor often more important than code size is the potential shortfall in
registers that is created by aggressive unrolling and scheduling. This secondary affect that
results from instruction scheduling in large code segments is called register pressure. It
arises because scheduling code to increase ILP causes the number of live values to increase.
After aggressive instruction scheduling, it not be possible to allocate all the live values to
registers. The transformed code, while theoretically faster, may lose some or all of its
advantage, because it generates a shortage of registers. Without unrolling, aggressive
scheduling is sufficiently limited by branches so that register pressure is rarely a problem.

B) Static Branch Prediction:
        Static branch predictors are sometimes used in processors where the expectation is
that branch behavior is highly predictable at compile-time. Delayed branches expose a
pipeline hazard so that the compiler can reduce the penalty associated with the hazard. As we
saw, the effectiveness of this technique partly depends on whether we correctly guess which
way a branch will go. Being able to accurately predict a branch at compile time is also helpful
for scheduling data hazards. Consider the following code segment:

        The dependence of the DSUBU and BEQZ on the LD instruction means that a stall
will be needed after the LD. Suppose we knew that this branch was almost always taken and
that the value of R7 was not needed on the fall-through path. Then we could increase the
speed of the program by moving the instruction DADD R7, R8,R9 to the position after the
LD. Correspondingly, if we knew the branch was rarely taken and that the value of R4 was
not needed on the taken path, then we could contemplate moving the OR instruction after the
        There are several different methods to statically predict branch behavior. The simplest
scheme is to predict a branch as taken. This scheme has an average misprediction rate that is
equal to the untaken branch frequency, which for the SPEC programs is 34%. Unfortunately,
the misprediction rate ranges from not very accurate (59%) to highly accurate (9%).
        A better alternative is to predict on the basis of branch direction, choosing backward-
going branches to be taken and forward-going branches to be not taken. For some programs
and compilation systems, the frequency of forward taken branches may be significantly less
than 50%, and this scheme will do better than just predicting all branches as taken. In the
SPEC programs, however, more than half of the forward-going branches are taken. Hence,
predicting all branches as taken is the better approach.
        A still more accurate technique is to predict branches on the basis of profile
information collected from earlier runs. The key observation that makes this worthwhile is
that the behavior of branches is often bimodally distributed; that is, an individual branch is
often highly biased toward taken or untaken.
        On average, the predict-taken strategy has 20 instructions per mispredicted branch and
the profile-based strategy has 110. Static branch behavior is useful for scheduling instructions
when the branch delays are exposed by the architecture (either delayed or canceling
branches), for assisting dynamic predictors and for determining which code paths are more
frequent, which is a key step in code scheduling.

C) The VLIW Approach:
         Superscalar processors decide on the fly how many instructions to issue. A statically
scheduled superscalar must check for any dependence between instructions in the issue
packet as well as between any issue candidate and any instruction already in the pipeline. A
statically scheduled superscalar requires significant compiler assistance to achieve good
performance. In contrast, a dynamically-scheduled superscalar requires less compiler
assistance, but has significant hardware costs.
         An alternative to the superscalar approach is to rely on compiler technology not only
to minimize the potential data hazard stalls, but to actually format the instructions in a
potential issue packet so that the hardware need not check explicitly for dependences. Such
an approach offers the potential advantage of simpler hardware while still exhibiting good
performance through extensive compiler optimization.
         The first multiple-issue processors that required the instruction stream to be
explicitly organized to avoid dependences used wide instructions with multiple operations per
instruction. For this reason, this architectural approach was named VLIW, standing for Very
Long Instruction Word, and denoting that the instructions, since they contained several
instructions, were very wide (64 to 128 bits, or more). Early VLIWs were quite rigid in their
instruction formats and effectively required recompilation of programs for different versions
of the hardware.
         To reduce this inflexibility and enhance performance of the approach, several
innovations have been incorporated into more recent architectures of this type. This second
generation of VLIW architectures is the approach being pursued for desktop and server

The Basic VLIW Approach:
        VLIWs use multiple, independent functional units. Rather than attempting to issue
multiple, independent instructions to the units, a VLIW packages the multiple operations into
one very long instruction. Since the burden for choosing the instructions to be issued
simultaneously falls on the compiler, the hardware in a superscalar to make these issue
decisions is unneeded.
        Because VLIW approaches make sense for wider processors, we choose to focus our
example on such architecture. For example, a VLIW processor might have instructions that
contain five operations, including: one integer operation (which could also be a branch), two
floating-point operations, and two memory references. The instruction would have a set of
fields for each functional unit— perhaps 16 to 24 bits per unit, yielding an instruction length
of between 112 and 168 bits.
        To keep the functional units busy, there must be enough parallelism in a code
sequence to fill the available operation slots. This parallelism is uncovered by unrolling loops
and scheduling the code within the single larger loop body. If the unrolling generates
straighline code, then local scheduling techniques, which operate on a single basic block, can
be used. If finding and exploiting the parallelism requires scheduling code across branches, a
substantially more complex global scheduling algorithm must be used.
        For the original VLIW model, there are both technical and logistical problems. The
technical problems are the increase in code size and the limitations of lock-step
operation. Two different elements combine to increase code size substantially for a VLIW.
First, generating enough operations in a straight-line code fragment requires ambitiously
unrolling loops thereby increasing code size. Second, whenever instructions are not full, the
unused functional units translate to wasted bits in the instruction encoding.
        To combat this code size increase, clever encodings are sometimes used. Another
technique is to compress the instructions in main memory and expand them when they are
read into the cache or are decoded. Early VLIWs operated in lock-step; there was no hazard
detection hardware at all. This structure dictated that a stall in any functional unit pipeline
must cause the entire processor to stall, since all the functional units must be kept
        In more recent processors, the functional units operate more independently, and the
compiler is used to avoid hazards at issue time, while hardware checks allow for
unsynchronized execution once instructions are issued. Binary code compatibility has also
been a major logistical problem for VLIWs. In a strict VLIW approach, the code sequence
makes use of both the instruction set definition and the detailed pipeline structure, including
both functional units and their latencies. Thus, different numbers of functional units and unit
latencies require different versions of the code. This requirement makes migrating between
successive implementations, or between implementations with different issue widths, more
difficult than it is for a superscalar design.
        One possible solution to this migration problem and the problem of binary code
compatibility in general, is object-code translation or emulation. The major challenge for all
multiple-issue processors is to try to exploit large amounts of ILP. The potential advantages
of a multiple-issue processor versus a vector processor are twofold. First, a multiple-issue
processor has the potential to extract some amount of parallelism from less regularly
structured code, and, second, it has the ability to use a more conventional, and typically less
expensive, cache-based memory system.

D) H.W verses S.W Solutions:
        The hardware-intensive approaches to speculation and the software approaches
provide alternative approaches to exploiting ILP. Some of the tradeoffs, and the limitations,
for these approaches are listed below:
1. To speculate extensively, we must be able to disambiguate memory references. This
capability is difficult to do at compile time for integer programs that contain pointers. In a
hardware-based scheme, dynamic runtime disambiguation of memory addresses is done using
the techniques we saw earlier for Tomasulo’s algorithm. This disambiguation allows us to
move loads past stores at runtime.
2. Hardware-based speculation works better when control flow is unpredictable, and when
hardware-based branch prediction is superior to software-based branch prediction done at
compile time.
3. Hardware-based speculation maintains a completely precise exception model even for
speculated instructions. Recent software-based approaches have added special support to
allow this as well.
4. Hardware-based speculation does not require compensation or bookkeeping code,
which is needed by ambitious software speculation mechanisms.
5. Compiler-based approaches may benefit from the ability to see further in the code
sequence, resulting in better code scheduling than a purely hardware-driven approach.
6. Hardware-based speculation with dynamic scheduling does not require different code
sequences to achieve good performance for different implementations of architecture.
      Against these advantages stands a major disadvantage: supporting speculation in
hardware is complex and requires additional hardware resources.

E) Hardware Support for Exposing More Parallelism at Compile-Time:
        Techniques such as loop unrolling, software pipelining, and trace scheduling can be
used to increase the amount of parallelism available when the behavior of branches is fairly
predictable at compile time. When the behavior of branches is not well known, compiler
techniques alone may not be able to uncover much ILP. In such cases, the control
dependences may severely limit the amount of parallelism that can be exploited. Similarly,
potential dependences between memory reference instructions could prevent code movement
that would increase available ILP.
        The first is an extension of the instruction set to include conditional or predicated
instructions. Such instructions can be used to eliminate branches converting control
dependence into data dependence and potentially improving performance.
        Hardware speculation with in-order commit preserved exception behavior by
detecting and raising exceptions only at commit time when the instruction was no longer
speculative. To enhance the ability of the compiler to speculatively move code over
branches, while still preserving the exception behavior, we consider several different
methods, which either includes explicit checks for exceptions or techniques to ensure that
only those exceptions that should arise are generated.
        Finally, the hardware speculation schemes of the last chapter provided support for
reordering loads and stores, by checking for potential address conflicts at runtime. To allow
the compiler to reorder loads and stores when it suspects they do not conflict, but cannot be
absolutely certain, a mechanism for checking for such conflicts can be added to the hardware.

Conditional or Predicated Instructions:
        The concept behind conditional instructions is quite simple: An instruction refers to a
condition, which is evaluated as part of the instruction execution. If the condition is true, the
instruction is executed normally; if the condition is false, the execution continues as if the
instruction was a no-op. Many newer architectures include some form of conditional
instructions. The most common example of such an instruction is conditional move, which
moves a value from one register to another if the condition is true. Such an instruction can be
used to completely eliminate a branch in simple code sequences.
        Conditional moves are used to change control dependence into data dependence. This
enables us to eliminate the branch and possibly improve the pipeline behavior. As issue rates
increase, designers are faced with one of two choices: execute multiple branches per clock
cycle or find a method to eliminate branches to avoid this requirement. Handling multiple
branches per clock is complex, since one branch must be control dependent on the other. The
difficulty of accurately predicting two branch outcomes, updating the prediction tables, and
executing the correct sequence, has so far caused most designers to avoid processors that
execute multiple branches per clock. Conditional moves and predicated instructions provide a
way of reducing the branch pressure. In addition, a conditional move can often eliminate a
branch that is hard to predict, increasing the potential gain.
        In particular, using conditional move to eliminate branches that guard the execution of
large blocks of code can be inefficient, since many conditional moves may need to be
introduced. To remedy the inefficiency of using conditional moves, some architectures
support full predication, whereby the execution of all instructions is controlled by a predicate.
When the predicate is false, the instruction becomes a no-op. Full predication allows us to
simply convert large blocks of code that are branch dependent.
        Correct code generation and the conditional execution of predicated instructions
ensure that we maintain the data flow enforced by the branch. To ensure that the exception
behavior is also maintained, a predicated instruction must not generate an exception if the
predicate is false.
        The major complication in implementing predicated instructions is deciding when to
annul an instruction. Predicated instructions may either be annulled during instruction issue
or later in the pipeline before they commit any results or raise an exception. If predicated
instructions are annulled early in the pipeline, the value of the controlling condition must be
known early to prevent a stall for a data hazard. Instead, all existing processors annul
instructions later in the pipeline, which means that annulled instructions will consume
functional unit resources and potentially have a negative impact on performance.
        Nonetheless, the usefulness of conditional instructions is limited by several factors:
1. Predicated instructions that are annulled (i.e., whose conditions are false) still take some
processor resources. An annulled predicated instruction requires fetch resources at a
minimum, and in most processors functional unit execution time. Therefore, moving an
instruction across a branch and making it conditional will slow the program down whenever
the moved instruction would not have been normally executed.
2. Predicated instructions are most useful when the predicate can be evaluated early. If the
condition evaluation and predicated instructions cannot be separated (because of data
dependences in determining the condition), then a conditional instruction may result in a stall
for a data hazard. With branch prediction and speculation, such stalls can be avoided, at least
when the branches are predicted accurately.
3. The use of conditional instructions can be limited when the control flow involves more
than a simple alternative sequence.
4. Conditional instructions may have some speed penalty compared with unconditional

Compiler Speculation with Hardware Support:
       In many cases, we would like to move speculated instructions not only before branch,
but before the condition evaluation, and predication cannot achieve this. To speculate
ambitiously requires three capabilities:
1. The ability of the compiler to find instructions that, with the possible use of register
renaming, can be speculatively moved and not affect the program data flow.
2. The ability to ignore exceptions in speculated instructions, until we know that such
exceptions should really occur, and
3. The ability to speculatively interchange loads and stores, or stores and stores, which
may have address conflicts.
       The first of these is a compiler capability, while the last two require hardware

Hardware Support for Preserving Exception Behavior:
        To speculate ambitiously, we must be able to move any type of instruction and still
preserve its exception behavior. There are four methods that have been investigated for
supporting more ambitious speculation without introducing erroneous exception behavior:
1. The hardware and operating system cooperatively ignore exceptions for speculative
2. Speculative instructions that never raise exceptions are used, and checks are introduced
to determine when an exception should occur.
3. A set of status bits, called poison bits, are attached to the result registers written by
speculated instructions when the instructions cause exceptions. The poison bits cause a fault
when a normal instruction attempts to use the register.
4. A mechanism is provided to indicate that an instruction is speculative and the
hardware buffers the instruction result until it is certain that the instruction is no longer
        Exceptions that can be resumed can be accepted and processed for speculative
instructions just as if they were normal instructions. The cost of these exceptions may be
high, however, and some processors use hardware support to avoid taking such exceptions.
Exceptions that indicate a program error should not occur in correct programs, and the result
of a program that gets such an exception is not well defined, except perhaps when the
program is running in a debugging mode.
        In the simplest method for preserving exceptions, the hardware and the operating
system simply handle all resumable exceptions when the exception occurs and simply return
an undefined value for any exception that would cause termination. If the instruction
generating the terminating exception was not speculative, then the program is in error. If the
instruction generating the terminating exception is speculative, then the program may be
correct and the speculative result will simply be unused; thus, returning an undefined value
for the instruction cannot be harmful. This scheme can never cause a correct program to fail,
no matter how much speculation is done. An incorrect program, which formerly might have
received a terminating exception, will get an incorrect result. In such a scheme, it is not
necessary to know that an instruction is speculative.
        A second approach to preserving exception behavior when speculating introduces
speculative versions of instructions that do not generate terminating exceptions and
instructions to check for such exceptions. This combination preserves the exception behavior
        A third approach for preserving exception behavior tracks exceptions as they occur
but postpones any terminating exception until a value is actually used, preserving the
occurrence of the exception, although not in a completely precise fashion. The scheme is
simple: A poison bit is added to every register and another bit is added to every instruction to
indicate whether the instruction is speculative. The poison bit of the destination register is set
whenever a speculative instruction results in a terminating exception; all other exceptions are
handled immediately. If a speculative instruction uses a register with a poison bit turned on,
the destination register of the instruction simply has its poison bit turned on. If a normal
instruction attempts to use a register source with its poison bit turned on, the instruction
causes a fault Since poison bits exist only on register values and not memory values, stores
are never speculative and thus trap if either operand is “poison.”
        The fourth and final approach listed above relies on a hardware mechanism that
operates like a reorder buffer. In such an approach, instructions are marked by the compiler
as speculative and include an indicator of how many branches the instruction was
speculatively moved across and what branch action (taken/not taken) the compiler assumed.
This last piece of information basically tells the hardware the location of the code block
where the speculated instruction originally was.
        All instructions are placed in a reorder buffer when issued and are forced to commit in
order, as in a hardware speculation approach. The reorder buffer tracks when instructions are
ready to commit and delays the “write back” portion of any speculative instruction.
Speculative instructions are not allowed to commit until the branches they have been
speculatively moved over are also ready to commit, or, alternatively, until the corresponding
sentinel is reached. At that point, we know whether the speculated instruction should have
been executed or not. If it should have been executed and it generated a terminating
exception, then we know that the program should be terminated. If the instruction should not
have been executed, then the exception can be ignored.

Hardware Support for Memory Reference Speculation:
        Moving loads across stores is usually done when the compiler is certain the addresses
do not conflict. To allow the compiler to undertake such code motion, when it cannot be
absolutely certain that such a movement is correct, a special instruction to check for address
conflicts can be included in the architecture. The special instruction is left at the original
location of the load instruction and the load is moved up across one or more stores.
        When a speculated load is executed, the hardware saves the address of the accessed
memory location. If a subsequent store changes the location before the check instruction, then
the speculation has failed. If the location has not been touched then the speculation is
successful. Speculation failure can be handled in two ways. If only the load instruction was
speculated, then it suffices to redo the load at the point of the check instruction (which could
supply the target register in addition to the memory address). If additional instructions that
depended on the load were also speculated, then a fix-up sequence that re-executes all the
speculated instructions starting with the load is needed. In this case, the check instruction
specifies the address where the fix-up code is located.

F) Advanced Compiler Support for Exposing and Exploiting ILP:

Detecting and Enhancing Loop-Level Parallelism:
        Loop-level parallelism is normally analyzed at the source level or close to it, while
most analysis of ILP is done once instructions have been generated by the compiler. Loop-
level analysis involves determining what dependences exist among the operands in a loop
across the iterations of that loop. The analysis of loop-level parallelism focuses on
determining whether data accesses in later iterations are dependent on data values produced
in earlier iterations, such a dependence is called a loop-carried dependence.
        Because finding loop-level parallelism involves recognizing structures such as loops,
array references, and induction variable computations, the compiler can do this analysis more
easily at or near the source level, as opposed to the machine code level.

EXAMPLE: Consider a loop like this one:
              for (i=1; i<=100; i=i+1) {
              A[i+1] = A[i] + C[i]; /* S1 */
              B[i+1] = B[i] + A[i+1]; /* S2 */
Assume that A, B, and C are distinct, non overlapping arrays. What are the data dependences
among the statements S1 and S2 in the loop?

ANSWER: There are two different dependences:
1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1],
which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.
        Because the dependence of statement S1 on an earlier iteration of S1, this dependence
is loop-carried. The second dependence above (S2 depending on S1) is within iteration and is
not loop-carried. Thus, if this were the only dependence, multiple iterations of the loop could
execute in parallel, as long as each pair of statements in an iteration were kept in order.
        It is also possible to have a loop-carried dependence that does not prevent parallelism,
as the next example shows.
EXAMPLE: Consider a loop like this one:
               for (i=1; i<=100; i=i+1) {
               A[i] = A[i] + B[i]; /* S1 */
               B[i+1] = C[i] + D[i]; /* S2 */
What are the dependences between S1 and S2? Is this loop parallel? If not, show how to
make it parallel.

ANSWER:           Statement S1 uses the value assigned in the previous iteration by statement
S2, so there is a loop-carried dependence between S2 and S1. Despite this loop-carried
dependence, this loop can be made parallel. Unlike the earlier loop, this dependence is not
circular: Neither statement depends on itself, nor although S1 depends on S2, S2 does not
depend on S1. Although there are no circular dependences in the above loop, it must be
transformed to conform to the partial ordering and expose the parallelism. Two observations
are critical to this transformation:
1. There is no dependence from S1 to S2. If there were, then there would be a cycle in the
dependences and the loop would not be parallel. Since this other dependence is absent,
interchanging the two statements will not affect the execution of S2.
2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior
to initiating the loop.
         These two observations allow us to replace the loop above with the following code
                 A[1] = A[1] + B[1];
                 for (i=1; i<=99; i=i+1) {
                 B[i+1] = C[i] + D[i];
                 A[i+1] = A[i+1] + B[i+1];
                 B[101] = C[100] + D[100];
         Our analysis needs to begin by finding all loop-carried dependences. Often loop-
carried dependences are in the form of a recurrence:
                 for (i=2;i<=100;i=i+1) {
                 Y[i] = Y[i-1] + Y[i];
         A recurrence is when a variable is defined based on the value of that variable in an
earlier iteration, often the one immediately preceding, as in the above fragment.

Finding Dependences:
         Finding the dependences in a program is an important part of three tasks: (1) good
scheduling of code, (2) determining which loops might contain parallelism, and (3)
eliminating name dependences. The complexity of dependence analysis arises because of the
presence of arrays and pointers in languages like C or C++ or pass-by-reference parameter
passing in FORTRAN.
         How does the compiler detect dependences in general? Nearly all dependence
analysis algorithms work on the assumption that array indices are affine. In simplest terms, a
one-dimensional array index is affine if it can be written in the form a* i + b, where a and b
are constants, and i is the loop index variable. The index of a multidimensional array is affine
if the index in each dimension is affine.
         Determining whether there is dependence between two references to the same array in
a loop is thus equivalent to determining whether two affine functions can have the same value
for different indices between the bounds of the loop.
        In addition to detecting the presence of dependence, a compiler wants to classify the
type of dependence. This classification allows a compiler to recognize name dependences and
eliminate them at compile time by renaming and copying.

EXAMPLE: The following loop has multiple types of dependences. Find all the true
dependences, output dependences, and antidependences, and eliminate the output
dependences and antidependences by renaming.
                for (i=1; i<=100; i=i+1) {
                Y[i] = X[i] / c; /*S1*/
                X[i] = X[i] + c; /*S2*/
                Z[i] = Y[i] + c; /*S3*/
                Y[i] = c - Y[i]; /*S4*/
ANSWER:          The following dependences exist among the four statements:
1. There are true dependences from S1 to S3 and from S1 to S4 because of Y[i]. These are
not loop carried, so they do not prevent the loop from being considered parallel. These
dependences will force S3 and S4 to wait for S1 to complete.
2. There is an antidependence from S1 to S2, based on X[i].
3. There is an antidependence from S3 to S4 for Y[i].
4. There is an output dependence from S1 to S4, based on Y[i].
The following version of the loop eliminates these false (or pseudo) dependences.
                for (i=1; i<=100; i=i+1 {
                /* Y renamed to T to remove output dependence*/
                T[i] = X[i] / c;
                /* X renamed to X1 to remove antidependence*/
                X1[i] = X[i] + c;
                /* Y renamed to T to remove antidependence */
                Z[i] = T[i] + c;
                Y[i] = c - T[i];
        Dependence analysis is a critical technology for exploiting parallelism. For detecting
loop-level parallelism, dependence analysis is the basic tool. The major drawback of
dependence analysis is that it applies only under a limited set of circumstances, namely
among references within a single loop nest and using affine index functions. Thus, there are
wide varieties of situations in which array-oriented dependence analysis cannot tell us what
we might want to know, including n when objects are referenced via pointers rather than
array indices n when array indexing is indirect through another array, which happens with
many representations of sparse arrays; n when a dependence may exist for some value of the
inputs, but does not exist in actuality when the code is run since the inputs never take on
those values; n when an optimization depends on knowing more than just the possibility of a
dependence, but needs to know on which write of a variable does a read of that variable
        To deal with the issue of analyzing programs with pointers, another type of analysis,
often called points-to analysis, is required. The basic approach used in points-to analysis
relies on information from three major sources:
1. Type information, which restricts what a pointer can point to.
2. Information derived when an object is allocated or when the address of an object is taken,
which can be used to restrict what a pointer can point to.
3. Information derived from pointer assignments.
       There are two different types of limitations that affect our ability to do accurate
dependence analysis for large programs. The first type of limitation arises from restrictions
in the analysis algorithms. The second limitation is the need to analyze behavior across
procedure boundaries to get accurate information. This type of analysis, called
interprocedural analysis, is much more difficult and complex than analysis within a single

Eliminating Dependent Computations:
        Compilers can reduce the impact of dependent computations so as to achieve more
ILP. The key technique is to eliminate or reduce a dependent computation by back
substitution, which increases the amount of parallelism and sometimes increases the amount
of computation required. These techniques can be applied both within a basic block and
within loops, and we describe them differently.
        Within a basic block, algebraic simplifications of expressions and an optimization
called copy propagation, which eliminates operations that copy values. In some examples,
computations are actually eliminated, but it also possible that we may want to increase the
parallelism of the code, possibly even increasing the number of operations. Such
optimizations are called tree height reduction, since they reduce the height of the tree
structure representing a computation, making it wider but shorter. Consider the following
code sequence:
               ADD R1,R2,R3
               ADD R4,R1,R6
               ADD R8,R4,R7
        Notice that this sequence requires at least three execution cycles, since all the
instructions depend on the immediate predecessor. By taking advantage of associativity, we
can transform the code and rewrite it as:
               ADD R1,R2,R3
               ADD R4,R6,R7
               ADD R8,R1,R4
        This sequence can be computed in two execution cycles.

Software Pipelining: Symbolic Loop Unrolling:
        Other than loop unrolling there are two other important techniques that have been
developed for this purpose: software pipelining and trace scheduling.
        Software pipelining is a technique for reorganizing loops such that each iteration in
the software-pipelined code is made from instructions chosen from different iterations of the
original loop. By choosing instructions from different iterations, dependent computations are
separated from one another by an entire loop body, increasing the possibility that the unrolled
loop can be scheduled without stalls.
        A software-pipelined loop interleaves instructions from different iterations without
unrolling the loop, as illustrated in Figure 4.6. This technique is the software counterpart to
what Tomasulo’s algorithm does in hardware.
        Register management in software-pipelined loops can be tricky. We may need to
increase the number of iterations between when we issue an instruction and when the result is
used. This increase is required when there are a small number of instructions in the loop body
and the latencies are large. In such cases, a combination of software pipelining and loop
unrolling is needed.
        Software pipelining can be thought of as symbolic loop unrolling. The major
advantage of software pipelining over straight loop unrolling is that software pipelining
consumes less code space. Loop unrolling reduces the overhead of the loop—the branch and
counter-update code.

EXAMPLE: Show a software-pipelined version of this loop, which increments all the
elements of an array whose starting address is in R1 by the contents of F2:
               Loop: L.D F0,0(R1)
                      ADD.D F4,F0,F2
                      S.D F4,0(R1)
                      DADDUI R1,R1,#-8
                      BNE R1,R2,Loop
You may omit the start-up and clean-up code.

ANSWER: Software pipelining symbolically unrolls the loop and then selects instructions
from each iteration. Since the unrolling is symbolic, the loop overhead instructions (the
DADDUI and BNE) need not be replicated. Here’s the body of the unrolled loop without
overhead instructions, highlighting the instructions taken from each iteration:
                Iteration i: L.D F0,0(R1)
                           ADD.D F4,F0,F2
                           S.D F4,0(R1)
             Iteration i+1: L.D F0,0(R1)
                            ADD.D F4,F0,F2
                            S.D 0(R1),F4
            Iteration i+2: L.D F0,0(R1)
                           ADD.D F4,F0,F2
                           S.D F4,0(R1)
The selected instructions from different iterations are then put together in the loop with the
loop control instructions:
                Loop: S.D F4,16(R1) ;stores into M[i]
                        ADD.D F4,F0,F2 ;adds to M[i-1]
                        L.D F0,0(R1) ;loads M[i-2]
                        DADDUI R1,R1,#-8
                        BNE R1,R2,Loop
This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up
portions, and assuming that DADDUI is scheduled after the ADD.D and the L.D instruction,
with an adjusted offset, is placed in the branch delay slot. Software pipelining reduces the
time when the loop is not running at peak speed to once per loop at the beginning and end.
        Figure 4.7 shows this behavior graphically. Because these techniques attack two
different types of overhead, the best performance can come from doing both. In practice,
compilation using software pipelining is quite difficult for several reasons: many loops
require significant transformation before they can be software pipelined, the tradeoffs in
terms of overhead versus efficiency of the software-Pipelined loop are complex, and the issue
of register management creates additional complexities.

FIGURE 4.7 The execution pattern for (a) a software-pipelined loop and (b) an unrolled

Global Code Scheduling:
        The techniques used for loop unrolling work well when the loop body is straight-line
code, since the resulting unrolled loop looks like a single basic block. Similarly, software
pipelining works well when the body is single basic block, since it is easier to find the
repeatable schedule. When the body of an unrolled loop contains internal control flow,
however, scheduling the code is much more complex. In general, effective scheduling of a
loop body with internal control flow will require moving instructions across branches, which
is global code scheduling.
        Global code scheduling aims to compact a code fragment with internal control
structure into the shortest possible sequence that preserves the data and control dependences.
The data dependences force a partial order on operations, while the control dependences
dictate instructions across which code cannot be easily moved. Data dependences are
overcome by unrolling and, in the case of memory operations, using dependence analysis to
determine if two references refer to the same address. Finding the shortest possible sequence
for a piece of code means finding the shortest sequence for the critical path, which is the
longest sequence of dependent instructions?
        Control dependences arising from loop branches are reduced by unrolling. Global
code scheduling can reduce the effect of control dependences arising from conditional
nonloop branches by moving code. Since moving code across branches will often affect the
frequency of execution of such code, effectively using global code motion requires estimates
of the relative frequency of different paths. Although global code motion cannot guarantee
faster code, if the frequency information is accurate, the compiler can determine whether
such code movement is likely to lead to faster code.
       Global code motion is important since many inner loops contain conditional
statements. Figure 4.8 shows a typical code fragment, which may be thought of as an iteration
of an unrolled loop and highlights the more common control flow.
                FIGURE 4.8 A code fragment and the common path shaded with gray.

        Effectively scheduling this code could require that we move the assignments to B and
C to earlier in the execution sequence, before the test of A. Such global code motion must
satisfy a set of constraints to be legal. In addition, the movement of the code associated with
B, unlike that associated with C, is speculative: it will speed the computation up only when
the path containing the code would be taken.
        To perform the movement of B, we must ensure that neither the data flow nor the
exception behavior is changed. Compilers avoid changing the exception behavior by not
moving certain classes of instructions, such as memory references, that can cause exceptions.
        How can the compiler ensure that the assignments to B and C can be moved without
affecting the data flow? To see what’s involved, let’s look at a typical code generation
sequence for the flowchart in Figure 4.8. Assuming that the addresses for A, B, C are in R1,
R2, and R3, respectively, here is such a sequence:
        LD R4,0(R1) ; load A
        LD R5,0(R2) ; load B
        DADDU R4,R4,R5 ; Add to A
        SD 0(R1),R4 ; Store A
        BNEZ R4,elsepart ; Test A
        ... ; then part
        SD 0(R2),... ; Stores to B
        J join ; jump over else
        elsepart:... ; else part
        X ; code for X
        join: ... ; after if
        SD 0(R3),... ; store C[i]
        Let’s first consider the problem of moving the assignment to B to before the BNEZ
instruction. If B is referenced before it is assigned either in code segment X or after the
ifstatement, call the referencing instruction j. If there is such an instruction j, then moving the
assignment to B will change the data flow of the program. In particular, moving the
assignment to B will cause j to become data-dependent on the moved version of the
assignment to B rather than on i on which j originally depended.
        Moving the assignment to C up to before the first branch requires two steps. First,
the assignment is moved over the join point of the else part into the portion corresponding to
the then part. This movement makes the instructions for C control dependent on the branch
and means that they will not execute if the else path, which is the infrequent path, is chosen.
Hence, instructions that were data-dependent on the assignment to C, and which execute after
this code fragment, will be affected. To ensure the correct value is computed for such
instructions, a copy is made of the instructions that compute and assign to C on the else path.
Second, we can move C from the then part of the branch across the branch condition, if it
does not affect any data flow into the branch condition. If C is moved to before the if-test, the
copy of C in the else branch can usually be eliminated, since it will be redundant.
        Consider the factors that the compiler would have to consider in moving the
computation and assignment of B: What are the relative execution frequencies of the then-
case and the else-case in the branch? If the then-case is much more frequent, the code motion
may be beneficial. If not, it is less likely, although not impossible to consider moving the
code. What is the cost of executing the computation and assignment to B above the branch? It
may be that there are a number of empty instruction issue slots in the code above the branch
and that the instructions for B can be placed into these slots that would otherwise go empty.
How will the movement of B change the execution time for the then-case? If B is at the start
of the critical path for the then-case, moving it may be highly beneficial. Is B the best code
fragment that can be moved above the branch? How does it compare with moving C or other
statements within the then-case? n What is the cost of the compensation code that may be
necessary for the elsecase? How effectively can this code be scheduled and what is its impact
on execution time?

Trace Scheduling: Focusing on the Critical Path:
         Trace scheduling is useful for processors with a large number of issues per clock,
where conditional or predicated execution is inappropriate or unsupported, and where simple
loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor
busy. Trace scheduling is a way to organize the global code motion process, so as to simplify
the code scheduling by incurring the costs of possible code motion on the less frequent paths.
Because it can generate significant overheads on the designated infrequent path, it is best
used where profile information indicates significant differences in frequency between
different paths and where the profile information is highly indicative of program behavior
         There are two steps to trace scheduling. The first step, called trace selection, tries
to find a likely sequence of basic blocks whose operations will be put together into a smaller
number of instructions; this sequence is called a trace. Loop unrolling is used to generate
long traces, since loop branches are taken with high probability. Additionally, by using static
branch prediction, other conditional branches are also chosen as taken or not taken, so that
the resultant trace is a straight-line sequence resulting from concatenating many basic blocks.
If, for example, the program fragment shown in Figure 4.8 corresponds to an inner loop with
the highlighted path being much more frequent, and the loop were unwound four times, the
primary trace would consist of four copies of the shaded portion of the program, as shown in
Figure 4.9.
         FIGURE 4.9 This trace is obtained by assuming that the program fragment in Figure
         4.8 is the inner loop and unwinding it four times treating the shaded portion in Figure
                                         4.8 as the likely path.
        Once a trace is selected, the second process, called trace compaction, tries to
squeeze the trace into a small number of wide instructions. Trace compaction is code
scheduling; hence, it attempts to move operations as early as it can in a sequence (trace),
packing the operations into as few wide instructions (or issue packets) as possible.
        The advantage of the trace scheduling approach is that it simplifies the decisions
concerning global code motion. In particular, branches are viewed as jumps into or out of the
selected trace, which is assumed to the most probable path. Although trace scheduling has
been successfully applied to scientific code with its intensive loops and accurate profile data,
it remains unclear whether this approach is suitable for programs that are less simply
characterized and less loop-intensive. In such programs, the significant overheads of
compensation code may make trace scheduling an unattractive approach, or, at best, its
effective use will be extremely complex for the compiler.

        One of the major drawbacks of trace scheduling is that the entries and exits into the
middle of the trace cause significant complications requiring the compiler to generate and
track the compensation code and often making it difficult to assess the cost of such code.
Superblocks are formed by a process similar to that used for traces, but, are a form of
extended basic blocks, which are restricted to have a single entry point but allow multiple
        Because superblocks have only a single entry point, compacting a superblock is easier
than compacting a trace since only code motion across an exit need be considered. In our
earlier example, we would form superblock that did not contain any entrances and hence,
moving C would be easier. Furthermore, in loops that have a single loop exit based on a
count (for example, a for-loop with no loop exit other than the loop termination condition),
the resulting superblocks have only one exit as well as one entrance. Such blocks can then be
scheduled more easily.
        How can a superblock with only one entrance be constructed? The answer is to use
tail duplication to create a separate block that corresponds to the portion of the trace after the
entry. In our example above, each unrolling of the loop would create an exit from the
superblock to a residual loop that handles the remaining iterations. Figure 4.10 shows the
superblock structure if the code fragment from Figure 4.8 is treated as the body of an inner
loop and unrolled four times. The residual loop handles any iteration that occurs if the
superblock is exited, which, in turn, occurs when the unpredicted path is selected. If the
expected frequency of the residual loop were still high, a superblock could be created for that
loop as well.
        The superblock approach reduces the complexity of bookkeeping and scheduling
versus the more general trace generation approach, but may enlarge code size more than a
trace-based approach.
         FIGURE 4.10 This superblock results from unrolling the code in Figure 4.8 four times
                                   and creating a superblock.

Shared By: