CSC 462/562 Homework #1
Due: Monday, January 22
Graduate students answer all 6 questions. Undergraduates answer any 5 questions (answer all 6 for extra credit).
Word process all answers. Figures may be hand drawn. Show your work for partial credit.
1) A given benchmark consists of 35% loads, 10% stores, 16% branches, 27% integer ALU operations, 8% FP
+/-, 3% FP * and 1% FP /. We want to compare the benchmark as run on two processors, as described
below. Which processor is faster and by how much? IC will be the same for both machines.
Processor 1 Processor 2
ALU 4 3
Load/store 5 4
Branch 3 CPI 2 CPI
FP +, - 8 4
FP * 10 6
FP / 30 15
CPU rate 2.5GHz 2.0 GHz
2) Architects have added numerous registers to a processor and are deciding whether the registers should be
used as ordinary registers in order to reduce the number of loads and stores, or used for parameter passing
by making them register windows (see the figure below). In the former case, an optimizing compiler can
successfully remove 40% of the loads and 60% of the stores from a given benchmark. In the latter case,
procedure calls and returns no longer require accessing memory (cache) as the values being passed and
returned will be stored in registers so that the CPI of procedure calls and returns drops from 15 down to 4.
In both cases, the clock rate will be the same, so this value can be factored out of your comparison. Given
the following CPI values and the benchmark’s breakdown of instructions, how should the architects use
these new registers, as normal or as register windows? Demonstrate your answer by determining which
provides the greater speedup on the benchmark and by how much. NOTE: the first approach will require
that you determine a new breakdown of instructions because the IC will change by removing loads and
stores such that all percentages will need to be adjusted (we covered a problem like this in class).
CPI: Load/Store: 4, ALU and Unconditional Branch: 2, Conditional Branch: 3
Procedure Call and Return: 15
Benchmark breakdown: 40% Load, 13% Store, 31% ALU, 8% Conditional Branch, 2%
Unconditional Branch, 3% Procedure Call, 3% Return
Local variables at level j that will be passed as
parameters to the next level, j+1, are stored in
the “temporary registers” whereas local
variables that are not passed are stored in “local
registers”
No data movement is necessary, instead the
CPU merely shifts its focus in the set of
registers in the window by moving to the next
group counterclockwise (in the figure) for any
function call, and clockwise upon function
return.
3) Including a dual processor in a computer gives the computer the potential for a 2 times speedup if the
second processor can be utilized full time. To take advantage of this when running a single program, the
program must be completely parallelized. This is not practical. However, the dual processor can still
provide a decent amount of speedup if a program can be parallelized enough. Provide a table that shows
how much speedup the computer will gain in using the dual processor if a program can be parallelized by
each of the following: 25%, 50%, 75%, 90%, 95% and 99%.
4) Architects are considering implementing one of three enhancements to a processor. The first offers a 3
times speedup in enhanced mode, which is available 25% of the time. The second offers a 20 times
speedup in enhanced mode, which is available 10% of the time. The third offers a 1.5 times speedup in
enhanced mode, which is available 60% of the time. Which enhancement should be selected? Show why
by computing the overall speedup for each enhancement.
5) In the 1980s and 1990s, architects debated whether the RISC or CISC approach was better. The list below
denotes some of the differences in philosophy between the two forms of architecture. For each of the
following, explain how it would improve CPU time in terms of which of the following in our CPU time
formula would be decreased: IC, CPI, Clock Cycle Time, or some combination. NOTE: some of these
may increase but you do not need to discuss what increases, only what decreases.
a. In RISC, there are a great number of registers available, less so in a CISC machine
b. In CISC, there can be complex addressing modes such as indirect addressing to obtain the datum
pointed to by a pointer
c. In RISC, a pipeline is used to perform each part of the fetch-execute cycle as an independent stage
d. In CISC, variable sized instruction lengths are common so that multiple memory operands can be
accessed at the same time
6) A floating point benchmark has the following breakdown of instructions executed (note: these are the
number of instructions, not percentages):
Loads: 103,198
Stores: 28,998
Branches: 37,643
Integer ALU operations: 75,387
FP +/-: 53,837
FP *: 12,111
FP /: 3,002
FP Comparisons: 6,391
A processor executes floating point operations as sequences of integer operations as follows:
FP +/-: 8 integer operations for 1 +/-
FP *: 32 integer operations for 1 *
FP /: 128 integer operations for 1 /
FP Comparison: 10 integer operations for 1 compare
Assume loads and stores have a CPI of 4 and branches and integer operations have a CPI of 3.
a. If the processor has a clock rate of 2.5 GHz, what is the machine’s MIPS rating? MIPS can be
computed as clock rate / (integer operation IC * 10 6).
b. If we replace the processor with one capable of performing floating operations such that FP +/-
/compare are performed in 8 cycles, * in 12 and / in 25, what is the machine’s new MIPS rating?
To compute this, you have to recomputed the integer IC because all of the FP operations are now
being performed in FP hardware and not as sequences of integer operations.
c. How much faster is the machine from part b over the machine from part a?
d. Does your change in MIPS rating from part b to part a roughly agree with your answer in part c?
If not, try explain why not.