CS 516-Computer System Architecture - DOC by 2Js0Tdf


									                    CS 516 – Computer Architecture
                         Midterm Exam No.2
                              Fall, 2005
                                 12:00-1:15 P.M.
                                November 7, 2005
This question is a closed-book and closed note exam. There are 7 questions in this exam.
You have 75 minutes to finish the questions. Please write your answers on separated
pieces of papers. To avoid grading problems, please staple your papers in the ascending
order in the question number. All the questions in this exam are mandatory (no optional

Notice 1: the questions that appear in this exam are not the same questions you had in
the sample exam. I tried to make the style and the organization of the questions
similar as much as possible. If you have any question, please let me know during this

Notice 2: For most of the questions, long solutions are NOT expected. As long as your
solution contains the key idea or key word, that’s fine. How much time you spend for
each question is up to you, but it is assumed that time management is your responsibility
(if you run out of time just because you spend too much time on some questions, the
instructor will not be responsible for that).

Student ID (the last four digits): ___________________

QUESTION #1 (10 minutes):
(1) Define “out-of-order execution”. What kind of pipeline does out-of-order execution

(2) Mention examples of static and dynamic instruction scheduling techniques (three different
    examples for each group are required for full credit).

(3) Three different types of cache misses and their solutions.

(4) Show the procedure of memory access – the 6 steps from the beginning to the end of a
    memory access (you do NOT have to describe each step, just name the six steps in the
    correct order).

(5) What are possible causes for a pipeline data-path not to execute instructions in the perfect
    way (with some waste in processor cycles)? Mention all the possible causes we discussed
    in the classroom. In presenting your solutions, please do this in the next two minutes:
         First, identify all the possible causes discussed in the classroom.
         Then, classify (group) them in terms of how a pipeline data-path processor handle

QUESTION #2 (10 minutes):
(1) We discussed that we can not infinitely increase “k” in k-stage pipeline data-path
    processors. Mention two technical reasons for this.

(2) In which stage of Score-boarding are WAW dependencies solved? What is (are) the
    condition(s) that guarantees that WAW dependencies will not cause a problem?

(3) What are the benefits in using Bernstein’s 3-conditions (try to mention three different

(4) Derive (mathematically) the speed-up factor of a k-stage pipeline processor over a scalar
    processor (show all your intermediate work).

(5) What are the limiting factors in dynamic branch predictions (limiting factors in a sense
    that dynamic branch predictions may not give us any benefit due to those factors – or you
    could think of “possible problems” in dynamic branch predictions)? Mention three.

QUESTION #3 (10 minutes):
Homework Exercise Question (modified): For the following level-2 cache, find the slowest
main memory access latency to achieve the effective memory access speed of 9ns. Show all
your work (10% of the credit to the correct solution and 90% of credit to showing the correct
intermediate work).

      85% of memory accesses are READ accesses
      65% of WRITE accesses are cache miss
      90% of READ accesses are cache hit
      70% of WRITE misses are clean misses.
      35% of READ misses are dirty misses
      Anything other than the L2 cache and the main memory can be ignored.

Notice: You do not have to complete your calculation. Show your equation with all values
        correctly set up.

QUESTION #4 (10 minutes)

Exercise Question #4.8: The following loop computes Y[i] = a  X[i] + Y[i], the key step in
a Gaussian elimination. Assume the pipeline latencies as shown in Figure 1 and a 1-cycle
delayed branch. Also assume a single-issue pipeline. Unroll the loop as many as necessary to
schedule it without delay, collapsing the loop overhead instructions. Show the schedule.

              loop: L.D             F0, 0(R1)           ; load X[i]
                    MUL.D           F0, F0, F2          ; multiply a * X[i]
                    L.D             F4, 0(R2)           ; load Y[i]
                    ADD.D           F0, F0, F4          ; add a*X[i] + Y[i]
                    S.D             0(R2), F0           ; store Y[i]
                    DSUBUI          R1, R1, #8          ; decrement X index
                    DSUBUI          R2, R2, #8          ; decrement Y index
                    BNEZ            R1, loop            ; loop if not done

Instruction producing results      Instruction using results        Latency in clock cycles
          FP ALU op                      Another FP ALU op                      3
          FP ALU op                         Store Double                        2
         Load Double                         FP ALU op                          1
         Load Double                        Store Double                        0
Figure 1 - The pipeline latencies for various combinations of instructions.

QUESTION #5 (10 minutes):
Exercise Question on page 406: Assume that we have two identical computer systems (System A and
B). The only difference between A and B is their cache scheme. System A has a 16KB instruction
cache with 16KB data cache while System B has a 32KB unified cache. Use the miss rates in Figure 2
below to help calculate correct answer, assuming 45% of the instructions are data transfer instructions.

               Size       Instruction Cache          Data Cache       Unified Cache
               8 KB               8.16                   44.0               63.0
              16 KB               3.15                   42.0               51.0
              32 KB               1.36                   38.4               46.7
              64 KB               0.61                   36.9               39.4
             128 KB               0.30                   35.3               36.2
             256 KB               0.02                   32.6               32.9
Figure 2 - Misses per 1,000 instructions for instruction, data and unified caches of different cache

Question: Which of A and B is better? How many cache misses will be caused for executing 1,000
instructions by A and B? Show all your intermediate work for full credit.

QUESTION #6 (10 minutes):
Example on page 452 (modified): For the following processor and memory chips
organization, answer the following two questions.

        The memory chips have the following properties (these three components can be
         pipelined and these three assumptions can not be changed (# of pipeline components
         and latency for each delay component can not be changed):
                   o For a processor to issue a memory address (and the issued address
                      reach memory chips): 3 cycles are needed
                   o For memory chip to stabilize (after it receives address signal): 12 cycles
                      are needed.
                   o After a memory chip outputs data signals and before a processor
                      receives the signals: 3 cycles are needed.
        The processor in this system has 16-bit memory-bus interface (you can not change
         this assumption).
        Each of the given memory chip has the 16-bit interface (i.e., 16 data pins). You can
         not change this assumption.
        Data in the processor address space (i.e., the addresses as observed by the processor)
         is always accessed in the contiguous order (not random addresses). You can not
         change this assumption.
        Memory block size = 2 bytes (you can not change this assumption).

You can come up with your own assumptions for any other factor (other than mentioned
above), but as long as you set up with your assumptions, you need to clearly describe them
(so that Fujinoki can understand – I assume it’s your responsibility to make your assumptions
clear (unclear descriptions for your assumptions are subject to major penalties)).

Question #1: Find and show (using a figure) the memory organization that will minimize the
             memory access latency using the given memory chips.

              Notice: I assume showing a figure that explains the correct ideas is your
              responsibility (i.e., a figure not neat enough, a figure that lacks technical
              details to show the correct ideas will be subject to some minor or major
              penalties) and I have a right to judge the quality of your work.

Question #2: Calculate the optimum memory access latency (in cycles) for the memory
             organization you come up with in Question #1. Assume that the number of
             memory accesses (i.e.,  memory accesses) will be large enough so that you
             can handle it as infinity.

QUESTION #7 (15 minutes):

Exercise Question #3.10 (modified): Two-layer dynamic branch prediction with 1 bit for the
state-transition diagram and 1 bit for the history bit (called “(1, 1) predictor” – page 203 of the
Hennesy & Patterson) can be implemented using 2-bit state-transition diagram. Construct the 2-bit
state-transition diagram for the following

  Prediction bits    Prediction if last branch not taken    Prediction if last branch taken
      NT/NT                          NT ()                              NT ()
       NT/T                          NT ()                               T ()
       T/NT                           T ()                              NT ()
        T/T                           T ()                               T ()
Table 1 - Definition of this (1, 1) predictor.

Assume that the prediction starts at “NT/NT” state.

Complete the incomplete state-transition diagram (shown later) by performing:
    (a) Identify the three states other than (NT/NT).

     (b) Filling up the prediction for each of the four states. For example, if a state predicts
         “Branch Taken”, fill up a state by “T” (Figure (b-1), while for prediction of “Not
         Taken” fill up a state by “N” (Figure (b-2).

                              T                               N
                            (b)-1                           (b)-2
     (c) For each transition, attach two symbols for the actual activity and corresponding
          activity ( through ) in Table 1 ("T" means "if a branch taken").


Hint: "Prediction bits" do not represent the past two results of a conditional branch
      instruction (because only one bit in the two represents the history while the other does

The following is a template you can start with:


                            (NT/NT)                                                (NT/T)


CS516, Computer Architecture, Midterm Exam #2 Solutions, Fall 2005, November 7, 2005


To top