CSC/ECE 506: Architecture of Parallel Computers
Summer 2003 - Problem Set 1
Problems 1, 4 and 5 will be graded. There are 60 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on the graded problems.
Problem 1. (25 points)
Consider a computer with four processors P1, P2,
P3, and P4 and six memory modules M1, M2 …. M6. I1
The four processors can be configured as an MIMD
machine or an SIMD machine. The processors
execute six instructions with precedence constraints I I I
2 3 4
illustrated in the diagram at the right. Each time a
processor executes an instruction, it references one
memory module, as given in the table below. It
references this memory module for three consecutive I
memory cycles (to fetch the two operands and store
the result). In SIMD mode, instruction i must be
completely finished (by all processors) before
instruction i+1 can begin. In MIMD mode, each
processor can work on its instructions independently of
the other processors, but it must still obey the
precedence constraints for its own instructions. Also,
it cannot execute more than one instruction at a time.
P1 P2 P3 P4
I1 M2 M4 M3 M2
I2 M1 M3 M2 M6
I3 M4 M6 M5 M6
I4 M3 M4 M4 M2
I5 M2 M2 M2 M1
I6 M1 M5 M5 M1
(a) What is the average memory bandwidth (words/cycle) for this program in MIMD mode?
(b) What is the average memory bandwidth (words/cycle) for this program in SIMD mode?
Problem 2. (25 points) Sometimes seemingly trivial problems are not well adapted to an SIMD
processor when they are coded in a straightforward fashion. Consider the array processor
discussed in class.
(a) Write an assembly-language program for it to find the transpose of an N N matrix. That is,
given an N N matrix A, find B such that B = AT. Comment your program. In addition to the
instructions presented in class, you may also use the following.
RRL i, j Rotate index register i circularly left by j bits.
RRR i, j Rotate index register i circularly right by j bits.
(b) How could your program be improved if A [0, 0] A [0, 1] … A [0, N –1]
skewed storage were used for the array A ?
That is, assume that the elements of A A [1, N –1] A [1, 0] … A [1, N –2]
were stored as shown at the right. You …
A [2, N –2] A [2, N –1] A [2, N –3]
need not actually code the program; just
explain how it would be better than the
program for part (a). Assume that each
ALU has its own private index registers, in A [N –1, 1] A [N –1, 2] … A [ N –1, 0]
addition to those in the control unit.
Problem 3. (25 points) Consider the simple problem of “triangulating” a polygon. To study the
properties of a polygonal surface that depend on each point on that surface, the surface is divided
into n small triangular regions. Each triangular region is an aggregation of points represented by a
single point in the triangle. The complex equations are solved on each such representative point,
and the resulting values at each point are summed to determine the current state of the surface at
that instant. Assume that it takes 1 unit of time per point to solve the equations and 1 unit of time
per point to calculate the sum.
(a) What would be the total cost of computation in a sequential program for this application?
(b) If this application is decomposed into a 2-phase program, describe each phase. What would
be the cost of computation for the serial phase and the parallel phase, total cost of computation
and the maximum achievable speedup taking into consideration the number of processors,
regardless of the processors?
(c) If the application is instead decomposed into a 3-phase program, what would be the answer to
the above question? Instead of determining the maximum achievable speedup regardless of the
processors, please explain what happens to the speedup as the number of processors increases,
assuming n >> p.
Problem 4. (10 points) Which of our case-study applications (Ocean, Barnes-Hut, Raytrace, and
Data Mining) do you think are amenable to decomposing data rather than computation, and using
an owner-computes rule in parallelization? What do you think the problem(s) would be with using
a strict data-distribution and owner-computes rule in the others?
Problem 5. (25 points) A parallel computation on an n-processor system can be characterized
by a pair P(n), T(n), where P(n) is the total number of instructions executed by all the
processors and T(n) is the elapsed execution time for the entire system (measured in number of
instructions). (You may assume that all instructions take the same amount of time.) P(n), for
n > 1, may be greater than P(1) because some of the processors have to do extra “redundant”
work to synchronize or avoid excessive memory contention. However, assume that P(n) is never
less than P(1).
In a serial computation, all instructions are performed by a single processor, so P (1) = T (1).
Usually, for n > 1, T(n) < P(n) because the computation will finish faster on a multiprocessor.
Lee (1980) has suggested five performance indices for comparing a parallel computation with a
T(1) (The speedup)
S(n) = T(n)
E(n) = nT(n) (The efficiency )
R(n) = P(1) (The redundancy)
U(n) = n T(n) (The utilization)
T 3(1) (The quality)
Q (n) =
n T 2(n) P(n)
Note: T 2(n ) = T (n ).T (n )
(a) Prove that the following relationships hold in all possible comparisons of parallel to serial
(1) 1 ≤ S (n ) ≤ n (3) U (n ) = R (n ).E (n )
(2) E (n ) = S (n ) (4) Q (n ) = S (n ).E (n )
n R (n )
(b) Based on the above definitions and relationships, give physical meanings of these