Document Sample
					                             TERM PAPER



                     TABLE OF CONTENT

   1. History                                              4

   2. Introduction                                         4

   3. Superpipelining                                      4

   4. Superscalar Architecture                             5

          o Characteristics of Superscalar Architectures   5
          o Synchronous Superscalar Architecture           6
          o Asynchronous Superscalar Architecture          6

   5. From Scalar to Superscalar                           7

   6. Window of Execution                                  7

   7. Data Dependencies                                    8

   8. Register Renaming                                    9

   9. Final Comments                                       10

   10. Limitations on Superscalar Architecture             10

   11. Some Architectures                                  11

   12. References                                          11

The Pentium was the first superscalar x86 processor; the Nx586, Pentium Pro and
AMDK5 were among the first designs which decodes x86-instructions asynchronously
into dynamic microcode like micro-op sequences prior to actual execution on a
superscalar microarchitecture; this opened up for dynamic scheduling of buffered partial
instructions and enabled more parallelism to be extracted compared to the more rigid
methods used in the simpler Pentium; it also simplified speculative execution and
allowed higher clock frequencies compared to designs such as the advanced cyrix 6x86

      A superscalar architecture is one in which several instructions can be initiated
       simultaneously and executed independently.
      Pipelining allows several instructions to be executed at the same time, but they
       have to be in different pipeline stages at a given moment.
      Superscalar architectures include all features of pipelining but, in addition, there
       can be several instructions executing simultaneously in the same pipeline stage.
       They have the ability to initiate multiple instructions during the same clock cycle.

There are two typical approaches today, in order to improve performance:
1. Superpipelining
2. Superscalar


      Superpipelining is based on dividing the stages of a pipeline into substages and
       thus increasing the number of instructions which are supported by the pipeline at
       a given moment.
      By dividing each stage into two, the clock cycle period t will be reduced to the
       half, t/2; hence, at the maximum capacity, the pipeline produces a result every
      For a given architecture and the corresponding instruction set there is an optimal
       number of pipeline stages; increasing the number of stages over this limit reduces
       the overall performance.
      A solution to further improve speed is the superscalar architecture.

                                 Superscalar execution


Superscalar processors improve performance by reducing the average number of cycles
required to execute each instruction (CPI). This is accomplished by issuing and executing
more than one independent instruction per cycle, rather than limiting execution to just
one instruction per cycle as in traditional pipelined architectures. The number of
independent instructions available per cycle is called the available instruction-level
parallelism. For superscalar architectures to experience speed-up over traditional
pipelined architectures they require the average level of available instruction-level
parallelism to be greater than one.

Characteristics of Superscalar Architectures

      Superscalar architectures allow several instructions to be issued and completed
       per clock cycle.
      A superscalar architecture consists of a number of pipelines that are working in
      Depending on the number and kind of parallel units available, a certain number of
       instructions can be executed in parallel.
      In the following example a floating point and two integer operations can be issued
       and executed simultaneously; each unit is pipelined and can execute several
       operations in different pipeline stages.

Synchronous Superscalar Architecture

This section highlights some features of a typical synchronous superscalar pipeline with
out-of order instruction issue. The pipeline is capable of fetching and executing multiple
instructions on each clock cycle, and is typically supported by branch prediction and
speculative execution in order to maintain a high instruction bandwidth.

Asynchronous Superscalar Architecture

In synchronous architectures, the control mechanism has a rigid, periodic interaction with
the datapath. Operations are initiated by the control unit and must complete within fixed
multiples of clock cycles. This produces predictable and deterministic behaviour which
may be exploited. However the components of such a system must be designed to
minimize the critical path to ensure a low clock period, even if this path is rarely taken.
As a result, functional components lie idle for a proportion of the clock period, even
though utilisation is high when measured in clock cycles. This is essentially a time-driven
approach to the design of the interface between the control and the datapath. In contrast,
one can implement an event-driven version of this interface using asynchronous circuits.

                               Superscalar Architectures

                     FROM SCALER TO SUPERSCALER
The simplest processors are SCALERS. Each instruction executed by a scalar processor
typically manipulates one or two data items at a time. By contrast, each instruction
executed by a vector processor operates simultaneously on many data items. An analogy
is the difference between scalar and vector arithmetic. A superscalar processor is sort of a
mixture of the two. Each instruction processes one data item, but there are multiple
redundant functional units within each CPU thus multiple instructions can be processing
separate data items concurrently.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and
allowing it to keep the multiple functional units in use at all times. This has become
increasingly important when the number of units increased.

In a superscalar CPU the dispatcher reads instructions from memory and decides which
ones can be run in parallel, dispatching them to redundant functional units contained
inside a single CPU. Therefore a superscalar processor can be envisioned having multiple
parallel pipelines, each of which is processing instructions simultaneously from a single
instruction thread.

                           WINDOW OF EXECUTION
    Window of execution:
     The set of instructions that is considered for execution at a certain moment. Any
     instruction in the window can be issued for parallel execution, subject to data
     dependencies and resource constraints.

    The number of instructions in the window should be as large as possible.
        - Capacity to fetch instructions at a high rate
        - The problem of branches

    The window of execution is extended over basic block borders by branch
     prediction speculative execution

    With speculative execution, instructions of the predicted path are entered into the
     window of execution.

Instructions from the predicted path are executed tentatively. If the prediction turns out to
be correct the state change produced by these instructions will become permanent and
visible (the instructions commit); if not, all effects are removed.

                             DATA DEPENDENCIES
• All instructions in the window of execution may begin execution, subject to data
dependence (and resource) constraints.
• Three types of data dependencies can be identified:
1. True data dependency
2. Output dependency (artificial dependencies)
3. Antidependency (artificial dependencies)

The Nature of Output Dependency and Antidependency
• Output dependencies and antidependencies are not intrinsic features of the executed
program; they are not real data dependencies but storage conflicts.

• Output dependencies and antidependencies are only the consequence of the manner in
which the programmer or the compilers are using registers (or memory locations). They
are produced by the competition of several instructions for the same register.

• In the previous examples the conflicts are produced only because:
the output dependency: R4 is used by both instructions to store the result;
the antidependency: R3 is used by the second instruction to store the result;

• The examples could be written without dependencies by using additional registers:




In-Order Issue with In-Order Completion

      Instructions are issued in the exact order that would correspond to sequential
       execution; results are written (completion) in the same order.

           -   An instruction cannot be issued before the previous one has been issued;
           -   An instruction completes only after the previous one has completed.
           -   To guarantee in-order completion, instruction issuing stalls when there is a
               conflict and when the unit requires more than one cycle to execute

      The processor detects and handles (by stalling) true data dependencies and
       resource conflicts.
      As instructions are issued and completed in their strict order, the resulting
       parallelism is very much dependent on the way the program is written/ compiled.

                  If I3 and I6 switch position, the pairs I6-I4 and I5-I3
                     can be executed in parallel (see following slide).

• With superscalar processors we are interested in techniques which are not
compiler based but allow the hardware alone to detect instructions which can be
executed in parallel and to issue them.

Out-of-Order Issue with Out-of-Order Completion

     With in-order issue, no new instruction can be issued when the processor has
      detected a conflict and is stalled, until after the conflict has been resolved.
   The processor is not allowed to look ahead for further instructions, which could be
                       executed in parallel with the current ones.

       Out-of-order issue tries to resolve the above problem. Taking the set of decoded
        instructions the processor looks ahead and issues any instruction,
in any order, as long as the program execution is correct.

                            REGISTER RENAMING

      Output dependencies and antidependencies can be treated similarly to true data
       dependencies as normal conflicts. Such conflicts are solved by delaying the
       execution of a certain instruction until it can be executed.
      Parallelism could be improved by eliminating output dependencies and
       antidependencies, which are not real data dependencies (see slide 11).
      Output dependencies and antidependencies can be eliminated by automatically
       allocating new registers to values, when such a dependency has been detected.
       This technique is called register renaming.

The output dependency is eliminated by allocating, for example, R6 to the value R2+R5:


The same is true for the antidependency below:


                               FINAL COMMENTS

      The following main techniques are characteristic for superscalar processors:
           1. Additional pipelined units which are working in parallel;
           2. out-of-order issue & out-of-order completion;
           3. Register renaming.
      All of the above techniques are aimed to enhance performance.
      Experiments have shown:
           - without the other techniques, only adding additional units is not efficient;
           - out-of-order issue is extremely important; it allows to look ahead for
               independent instructions;
           - register renaming can improve performance with more than 30%; in this
               case performance is limited only by true dependencies.
           - it is important to provide a fetching/decoding capacity so that the window
               of execution is sufficiently large.


Available performance improvement from superscalar techniques is limited by two key

   1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of
      instruction-level parallelism, and
   2. The complexity and time cost of the dispatcher and associated dependency
      checking logic.

Existing binary executable programs have varying degrees of intrinsic parallelism. In
some cases instructions are not dependent on each other and can be executed
simultaneously. In other cases they are inter-dependent: one instruction impacts either
resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel
because none of the results depend on other calculations. However, the instructions a = b
+ c; b = e + f might not be runnable in parallel, depending on the order in which the
instructions complete while they move through the units.

                             SOME ARCHITECTURES
PowerPC 604
    six independent execution units:
         o Branch execution unit
         o Load/Store unit
         o 3 Integer units
         o Floating-point unit
    in-order issue
    register renaming

Power PC 620
    provides in addition to the 604 out-of-order issue

    three independent execution units:
         o 2 Integer units
         o Floating point unit
    in-order issue

Pentium II
    provides in addition to the Pentium out-of-order issue
    five instructions can be issued in one cycle

      Computer system architecture, by M.Morris mano


Shared By: