Docstoc

Concurrency

Document Sample
Concurrency Powered By Docstoc
					                           Computer Architecture
                                                        Slide Sets


                                                WS 2011/2012

                                     Prof. Dr. Uwe Brinkschulte
                                     Prof. Dr. Klaus Waldschmidt

Part 9
Instruction Level Parallelism (ILP) -
 Concurrency




 Computer Architecture – Part 9 – page 1 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
                   Concurrency

Classical pipelining allows the termination of up to one instruction per
clock cycle (scalar execution)
A concurrent execution of several instructions in one clock cycle requires
the availability of several independent functional units.
These functional units are more or less heterogeneous (that means, they
are designed and optimized for different functions).


Two major concepts of concurrency on ILP level are existing:
- Superscalar concurrency
- VLIW concurrency
These concepts can be found as well in combination


 Computer Architecture – Part 9 – page 2 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
                  Concurrency - superscalar

The superscalar technique operates on a conventional sequential
instruction stream


The concurrent instruction issue is performed completely during runtime
by hardware.


This technique requires a lot of hardware resources.


It allows a very efficient dynamic issue of instructions at runtime.


On the downside, no long running dependency analysis (as e.g. possible
by a compiler) is possible

Computer Architecture – Part 9 – page 3 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
                  Concurrency - superscalar


The superscaler technique is a pure microarchitecture technique, since it
is not visible on the architectural level (conventional sequential
instruction stream)


Thus, hardware structure (e.g. the number of parallel execution units)
can be changed without changing the architectural specifications
(e.g. ISA)


Superscaler execution is usually combined with pipelining (superscalar
pipeline)




Computer Architecture – Part 9 – page 4 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
                  Concurrency - VLIW

The VLIW technique (Very Large Instruction Word) operates on a parallel
instruction stream.


The concurrent instruction issue is organized statically with the support of
the compiler.


The consequence is a lower amount of hardware resources.


Extensive compiler optimizations are possible to exploit parallelism.


On the downside, no dynamic effects can be considered (e.g. branch
prediction is difficult in VLIW).


Computer Architecture – Part 9 – page 5 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
                  Concurrency - VLIW


VLIW is a architectural technique, since the parallel instruction stream is
visible on the architectural level.


Therefore, a change in e.g. the level of parallelism leads to a change in the
architectural specifications


VLIW is usually combined with pipelining


VLIW can also be combined with superscaler concepts, as e.g done in
EPIC (Explicit Parallel Instruction Computing, Intel Itanium)




Computer Architecture – Part 9 – page 6 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Degree of parallelism in ILP


The main question in designing a concurrent computer architecture is:
   How many instruction level parallelism (ILP) exists in the code of an
   application?




This question has been analyzed very extensively for the compilation of
a sequential imperative programming language in a RISC instruction
set.
The result of all these analyses is:
   Programs include a fine grain parallelism degree of 5-7.



Computer Architecture – Part 9 – page 7 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Degree of parallelism in ILP


   Higher degrees in parallelism can be obtained only by code with
   long basic blocks (long instruction sequences without branches).


   Numerical applications in combination with loop unrolling is an
   application class with a higher ILP.


   Further application classes are embedded system control.


   A computer architecture for general purpose applications with a
   higher ILP of 5-7 can suffer from decreasing efficiency because of a
   lot of idle functional units.


Computer Architecture – Part 9 – page 8 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Superscalar technique

Components of
                                                                       I-cache
a superscaler                                                                                                 MMU
                                                            BHT         BTAC
processor
                                                                         RAS
                                             Branch                                   Instruction Fetch
                                              Unit                                          Unit
                                                                                 Instruction Decode and
                                                                                  Register Rename Unit

                                                 Instruction                              Instruction Buffer                  Bus
                                                 Issue Unit                                                                Interface
                                                                                            Reorder Buffer                   Unit

                                  Load/           Floating-                           Multi-
                                   Store            Point            Integer          media                 Retire
                                  Unit(s)                            Unit(s)                                 Unit
                                                   Unit(s)                            Unit(s)

                                                  Floating-        General            Multi-               Rename
                                                    Point          Purpose            media                Registers
                                                  Registers        Registers         Registers
                                MMU
                                                                      D-cache

 Computer Architecture – Part 9 – page 9 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt        Hier wird Wissen Wirklichkeit
            Superscalar technique
A superscalar pipeline:
•      operates on a sequential instruction stream
•      Instructions are collected in a instruction window
•      Instruction issue to heterogeneous execution units is done by hardware
•      microprocessor has several, mostly heterogeneous, functional units in the execution
       stage of the instruction pipeline.
•      Instruction processing can be done out of sequential instruction stream order.
•      Sequential instruction stream order is finally restored.




                                                                                             Reservation
                                                                                                Stations
                                                                                                           Execution
                                                                Instruction Window

                                 Instruction                                                                                            Retire
    Instruction        ...         Decode             ...                                                                                and
                                     and                                             Issue                 ...
                                                                                                                                        Write
       Fetch
                                  Rename                                                                                                Back

                                                                                             Reservation
                                                                                                Stations
                                                                                                           Execution



    Computer Architecture – Part 9 – page 10 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt           Hier wird Wissen Wirklichkeit
        Superscalar technique


 In-order and out-of-order sections in a superscaler pipeline




                                                                                           Reservation
                                                                                              Stations
                                                                                                         Execution




                                                            Instruction Window
                             Instruction                                                                                              Retire
Instruction        ...         Decode             ...                                                                                  and
                                 and                                             Issue                   ...
                                                                                                                                      Write
   Fetch
                              Rename                                                                                                  Back




                                                                                           Reservation
                                                                                              Stations
                                                                                                         Execution


                           In-order                                                                                                   In-order
                                                                                         Out-of-order

Computer Architecture – Part 9 – page 11 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt             Hier wird Wissen Wirklichkeit
        Instruction fetch



• Loads several instructions (instruction block) from the nearest
  instruction memory (e.g. instruction cache) to an instruction buffer

• Ususally, as many instructions are fetched per clock cycle as can be
  issued to the execution units (fetch bandwidth)

• Control flow conflicts are solved by branch prediction and branch
  target address cache

• The instruction buffer decouples instruction fetch from decode




Computer Architecture – Part 9 – page 12 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Instruction fetch


• Cache level Harvard architecture

• Self-modifying code cannot be implemented efficiently on todays
  superscaler processors

• Instruction cache (single port) mostly simpler organized than data
  cache (multi port)

• In case of branches, instructions have to be fetched from different
  cache blocks

• Solutions to parallelize this: multi-channel caches, interleaved caches,
  multiple instructions fetch units, trace cache


Computer Architecture – Part 9 – page 13 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Decode


• Decodes multiple instructions per clock cycle

• Decode bandwidth usually equal to fetch bandwidth

• Fixed length instruction format simplifies decoding of several
  instructions per clock cycle

• Variable instruction length => multi stage decoding
       • first stage: determinde instruction boundaries
       • second stage: decode instructions and create one or more
         microinstructions

• complex CISC instructions are splitted to simpler RISC instructions

Computer Architecture – Part 9 – page 14 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Register rename

• Goal of register renaming: remove false dependencies (output
  dependency, anti dependency)
• Renaming can be done:
       • statically by the compiler
       • dynamically by hardware
• Dynamic register renaming:
       • architectural registers are mapped to physical registers
       • each destination register specified in the instruction is mapped to a
         free physical register
       • the following instructions having the same architectural register as
         source register will get last assigned physical register as input
         operand by register renaming

       => false dependencies between register operands are removed
Computer Architecture – Part 9 – page 15 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Register rename

Two possible implementations:

• two different register sets are present
       • architectural registers store the „valid“ values
       • rename buffer registers store temporary results
       • on renaming, architectural registers are assigned to buffer registers

• only one register set of so-called physical registers is present
       • these store temporary and valid values
       • architectural registers are mapped to physical registers
       • architectural registers themselves are physically non-existent
       • a mapping table defines which physical register currently operates as
         which architectural register for a given instruction
Computer Architecture – Part 9 – page 16 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Register rename
             Possible implementation:

  locical                                                                                                              physical
destination                            Mapping table                                                                  destination
 registers                                                                                                             registers



                                                                                                                        physical
                                                                                        Multi-                           source
                                                                                        plexer                          registers


  locical                                Dependency
  source                                   check
 registers

    Mapping has to be done for multipe instructions simultaneously

 Computer Architecture – Part 9 – page 17 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Instruction window


• Decoded instructions are written to the instruction window
• The instruction window decouples fetch/decode from execution
• The instructions in the instruction window are
       • free of control flow dependencies due to branch prediction
       • free of false dependencies due to register renaming
• True dependencies and resource dependencies remain
• Instruction issue checks in each clock cycle, which instructions from
  instruction window can be issued to the execution units
• These are issued up to the maximum issue bandwidth (number of
  execution units)
• The original instruction sequence is stored in the reorder buffer


Computer Architecture – Part 9 – page 18 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Instruction window and issue
        terminology

• issue means the assignment of instructions to execution units or
  preceeding reservation stations, if present (see e.g. Tomasulo alg.)
• if reservation stations are present, the assignment of instructions from
  reservation stations to the execution units is called dispatch
• the instruction issue policy describes the protcoll used to select
  instructions for issuing
• depending on the processor instructions can be issued in-order or out-
  of-order
• the lookahead capability determines, how may instructions in the
  instruction window can be inspected to find the next issuable instructions
• the issuing logic determining executable instructions often is called
  scheduler



Computer Architecture – Part 9 – page 19 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         In-order versus out-of-order issue

Example:
 I1     w    =   a   -   b
                                    RAW
 I2     x    =   c   +   w
 I3     y    =   d   -   e
 I4     z    =   e   +   y          RAW



In-order issue:                                               Out-of-order issue:
 clock n:   I1                                                 clock n:                     I1, I3
 clock n+1: I2, I3                                             clock n+1:                   I2, I4
 clock n+2: I4


• Using in-order issue, the scheduler has to wait after I1 (RAW), then I2 and
  I3 can be issued in parallel (no dependency), finally I4 can be issued (RAW)

• Using out-of-order issue, the scheduler can issue I1 and I3 in parallel (no
  dependeny), followed by I2 and I4 => one clock cycle is saved

 Computer Architecture – Part 9 – page 20 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         False dependencies and
         out-of-order issue
Example:
 I1     w   =   a   -    b
                                    RAW
 I2     x   =   c   +    w                                                         I2 uses old c
                                      WAR
 I3     c   =   d   -    e
                                    RAW
 I4     z   =   e   +    c                                                         I4 used new c

Out-of-order issue:
                                                                                         Different!
 I1 w = a - b, I3 c = d - e
 I2 x = c + w, I4 z = e + c                                                        I2 and I4 use new c

Out-of-order issue with register rename:
                                                                                                                       Identical!
 I1 w = a - b, I3 c2 = d - e
 I2 x = c1 + w, I4 z = e + c2                                                      I2 uses old c, I4 uses new c

• Out-of-order issue makes a false dependencies (WAR, WAW) critical
• Register renaming solves these issues

 Computer Architecture – Part 9 – page 21 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Scheduling techniques


  There are several possible techniques to determine and
  issue the next executable instructions, e.g.:

       • Associative memory
           (central solution)
       • Tomasulo algorithm
           (decentral solution)
       • Scoreboard
           (central solution)




Computer Architecture – Part 9 – page 22 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Wake up with associative memory

 • The instructions waiting in the instruction window are marked by so
   called tags.
 • The tags of the produced results are compared with the tags of the
   operands of the waiting instructions.
 • For comparison, each window cell is equipped with comparators.
   All comparators are working in parallel.
 • This kind of a memory is called associative memory.
 • A hit of comparison is marked by a ready bit.
 • If the ready bits of an instruction are complete, the instruction is
   issued.
 • This solves the true dependencies

Computer Architecture – Part 9 – page 23 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Wake up with associative memory

                                                  tagIW             tag1

                                                             ...



            OR                        =                                                =                   OR
                                       =                                               =



           rdyL                 opd tagL                                         opd tagR                  rdyR         inst0



                                  .
                                  .                                                        .
                                                                                           .
                                  .                                                        .
          rdyL                  opd tagL                                         opd tagR                  rdyR         instN-1


Computer Architecture – Part 9 – page 24 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
         Priority based issuing of
         instructions woken up

 • If there are more instruction determined for issuing then available
   execution units (issue bandwidth), a priority selection logic is
   necessary

 • This selection logic determines for each execution unit the instruction
   to issue from the woken up instructions

 • Therefore, each execution unit needs such a selection unit

 • This solves the resource dependencies

 • The hardware complexity of the issue unit rises with the size of the
   instruction window and the number of execution units

Computer Architecture – Part 9 – page 25 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
             Selection logic for a single
             execution unit
       ...
                                                                             Issue Window
           req1                                                                                                               ...

          req2
          req0




          req3
        grant0




        grant3
        grant1

        grant2




                  anyreq enable              anyreq enable        anyreq enable              anyreq enable




                                                grant1
                                                grant0




                                                grant3
                                                grant2
                                                  req2

                                                 req3
                                                 req1
                                                req0




                                                             anyreq enable
                        grant0
                        grant1




                                                                                                     from/to other subtrees
                        grant3
                        grant2
             req0
             req1
             req2
             req3




                                                                                            grant0
                                                                                            req0
                           Priority
               OR          Encoder      Arbiter Cell

                                                                                                           root cell

                                                                                                                  enable
              anyreq       enable



Computer Architecture – Part 9 – page 26 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt               Hier wird Wissen Wirklichkeit
              Tomasulo algorithm


• The most well-known principle for instruction parallelism of superscalar
  processors is the Tomasulo algorithm.
• This algorithm was implemented first in the IBM 360 Computer by R.
  Tomasulo.
• The main assumption of the Tomasulo algorithm is, that the semantic
  of a program is unchanged, if the data dependencies are still existing
  when modifying the sequence of the instructions.
• The Tomasulo algorithm is based on the dataflow principle.
• All waiting instructions in the instruction window can be ordered in a
  dataflow graph.
• As consequence, all instructions in one level of the dataflow graph can
  be issued and executed in parallel and all dependencies in the dataflow
  graph can be represented by pointers to the functional units.


Computer Architecture – Part 9 – page 27 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
              Tomasulo algorithm

• Therefore the functional units are equipped with additional registers, so
  called reservation stations, which store these pointers or the operands
  itself.
• Assigning operands and pointers to the reservation stations (issue)
  solves the resource dependencies
• As soon as all operands and pointers are available, the function is
  executed (dispatch)
• This solves the true data dependencies
• If all operands are available immediately, issue and dispatch can be
  done in the same clock cycle, so dispatch usually is not a pipeline stage
• Different from associative memory, resource dependencies are solved
  before true data dependencies
• For a better distinction of the reservation stations from the registers of
  the original register file, the registers of the register file are regarded as
  functional units with the identity operation.
Computer Architecture – Part 9 – page 28 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
            Dataflow graph of instructions
            in the instruction window




                                                          Level 0


                                                          Level 1                reserva-     reserva-                  register
                                                                               tion station tion station

                                                                                     functional units                     identity
                                                          Level 2
                                                                ...


                                                                                           Implementation of the nodes




Computer Architecture – Part 9 – page 29 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Simple microarchitecture for
        demonstrating Tomasulo algorithm


    a             b             c              d             e             f             x             y             z              register
         =             =              =             =             =             =             =            =              =         unit




                                                                                                                              reservation
                                                                                                                                stations



                sub                         add                          div                         mul
                                                                                                                           execution unit
                                                                                                                  functional unit




Computer Architecture – Part 9 – page 30 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt       Hier wird Wissen Wirklichkeit
        Simple microarchitecture for
        demonstrating Tomasulo algorithm


         a             b              c             d             e             f                x
                                                                                                sub
                                                                                                div         add
                                                                                                             y             mul
                                                                                                                            z
    a             b             c              d             e             f                x          y               z
         =             =              =             =             =             =               =            =              =



                                                                         I1    x    =   a   / b
                                                                                                      RAW
                                                                         I2    y    =   x   + z              WAW
                                                                                                      WAR
                                                                         I3    z    =   c    d
                                                                         I4    x    =   e   - f




          e            f               x            z              a               b              c         d

                sub                         add                          div                          mul




                                                    2.
                                                    1. step
                                                    3.
Computer Architecture – Part 9 – page 31 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt         Hier wird Wissen Wirklichkeit
            Execution of the program sequence
            on the microarchitecture
First step:                   instructions I1 - I4 and the available operands are
                              issued to the corresponding reservation stations
                              reservation stations of the results are reserverd for
                              I1, I2 and I3
                              result reservation station for I4 cannot be reserved
                              because already occupied by result of I1
Second step:                  instructions I1 and I3 are dispatched because all
                              operands and result space are available
                              result of I1 is transferred to the reservation station
                              where I2 is waiting
                              therefore, result reservation station occupied by I1
                              so far becomes free and is now reserved for I4
Third step:                   instructions I2 and I4 are dispatched now and the
                              results are stored
Computer Architecture – Part 9 – page 32 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
           Scoreboard (Thornton algorithm)


 • The true data dependencies in a superscalar processor can also be
   solved solely over the register file.

 • This is the basic idea of the scoreboarding and therefore the principle
   is very simple.

 • It is a central method within a microarchitecture for controlling the
   instruction sequence according to the data dependencies.

 • Register, which are in use, are marked by a scoreboard bit. A register
   is marked as in use if it is destination of an instruction.

 • Only free registers are available for read or write operations. This is
   a very simple solution for solving data dependencies.


Computer Architecture – Part 9 – page 33 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
            Scoreboard (Thornton algorithm)


• The scoreboard bit is set at the instruction issue point of the pipeline.
• It is set at the request for a destination register and is reset after the
  write back phase.
• Each instruction is checked against a conflict with their source
  operands and a “in use” destination register.
• In case of a conflict, the instruction will be delayed until the
  scoreboard bit is reset. With this simple method, a RAW-conflict is solved.


      Registerfile
                                  R0     R1     R2            .....          Ri            .....            Rn
      Scoreboard                                              .....                        .....
      bitvector                    0      0      1                            1                             0


      The length of the scoreboard bit vector is the same as the length of the register file.

 Computer Architecture – Part 9 – page 34 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt        Hier wird Wissen Wirklichkeit
           State graph of the scoreboard
           method




                                                                          0                   Register Ri free (unused)



                       write back to Ri                                                        Register Ri is address
                (destination operand is in Ri)                                               of the destination operand




     write back to Ri is finished                                         1                   Register Ri occupied (in use)
                  &
            Ri address of
    another destination operand




Computer Architecture – Part 9 – page 35 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Scoreboard logic



 OPC           R          S1        S2      instruction word


                                      31                                               31
                                                      scoreboard
                                       n                 logic                          n

                                       0         set      reset                         0
                                                SC bit n SC bit n

                                                                 EX
            RF READ                                             stage                                                      Adresse
             stage               OPC                                                         R

                    (S1)
                                                                                            (R)
                                                                         +                                                 Operand
                    (S2)                                                                                   RF WRITE
                                                                                                             stage




Computer Architecture – Part 9 – page 36 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt      Hier wird Wissen Wirklichkeit
         Instruction window organization


Centralized window, single stage                                        Decentralized windows, single stage




                    Centralized or dezentralized windows, two stages




 Computer Architecture – Part 9 – page 37 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Execution


• Out-of-order execution of the instructions in mostly parallel execution
  units
• Results are store in the rename buffers or physical registers
• Execution units can be
        • single cycle units (execution takes a single clock cycle),
          latency = throughput = 1
        • multiple cycle units (execution takes multiple clock cycles),
          latency > 1
                • with pipelining (e.g. arithmetic pipeline), throughput = 1
                • without pipelining (e.g. load-/store-unit - possible cache misses),
                  throughput = 1 / latency



Computer Architecture – Part 9 – page 38 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Execution

Load-Store-Units
• Load- and store-instructions often can take different paths inside the
  load-store-unit (wait-buffer for stores)
• Store instructions need the address (address calculation) and the value
  to store, while load instructions only need the address
• Therefore, load instruction are often brought before store instructions as
  long as not the same address is concerned

                                  address                                          register content



                                                           store
                                                                                                            Load-
                                                                                       write                Store-
                                   load                                                buffer                Unit




 Computer Architecture – Part 9 – page 39 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt      Hier wird Wissen Wirklichkeit
         Execution


Load-Store-Units


• A load instruction is completed, as soon as the value to load is written to
  a buffer register
• A store instruction is completed, as soon as the value is written to the
  cache
• This cannot be undone!
• So store instructions on a speculative path (branch prediction) cannot be
  completed before the speculation is confirmed to be true
• Speculative load instructions are not a problem




 Computer Architecture – Part 9 – page 40 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Execution


Multimedia Units


• perform SIMD operations (subword parallelism)
• the same operation is performed on a part of the register set
• graphic-oriented multimedia operations
      • arithmetic or logic operations on packed datatypes like e.g. eight 8-bit,
        four 16-bit or two 32-bit partial words
      • pack and unpack operations, mask, conversion and compare
        operations
• video-oriented multimedia operations
      • two to four simultaneous 32-bit floatingpoint operations


Computer Architecture – Part 9 – page 41 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Retire and write back




Retire and write back is responsible for:

• commiting or discarding the completed results from execution

• rolling back wrong speculation paths from branch prediction

• restoring the original sequential instruction order

• allowing precise interrupts or exceptions




 Computer Architecture – Part 9 – page 42 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Some wordings




Completion of an instruction:

• The execution unit has finished the execution of the instruction

• The results are written to temporary buffer registers and are available as
  operands for data-dependend instructions

• Completion is done out of order

• During completion, the position of the instruction in the original instruction
  sequence and the current completion state is stored in a reorder buffer

• The completion state might indicate a preceding interrupt/exception or a
  pending speculation for this instruction


 Computer Architecture – Part 9 – page 43 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Some wordings

Commitment of an instruction:

• Commitment is done in the original instruction order (in-order)

• A result of an instruction can be commited, if
       • execution is completed
       • the results of all instructions preceding this instruction in the original
         instruction order are committed or will be committed within the same
         clock cycle
       • no interrupt/exception occured before or during execution
       • the execution does no longer depend on any speculation

• During commitment the results are written permanently to the architectural
  registers

• Committed instructions are removed from the reorder buffer
 Computer Architecture – Part 9 – page 44 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Some wordings


Removement of an instruction:

• The instruction is removed from the reorder buffer without committing it

• All results of the instructions are discarded

• This is done e.g. in case of misspeculation or a preceding
  interrupt/exception



Retirement of an instruction

• The instruction is removed from the reorder buffer with or without
  committing it (commitment or removement)



 Computer Architecture – Part 9 – page 45 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Interrupts and exceptions

On an interrupt or exception, the regular program flow is interrupted and an
interrupt service routine (exception handler) is called
Classes of interrupts/exceptions:
Aborts: are very fatal and lead to processor shutdown
        Reasons: hardware failures like defective memory cells
Traps: are fatal and normally lead to program termination
       Reasons: arithmetic errors (overflow, underflow, division by 0),
                  privilege violation, invalid opcode, …
Faults: cause the repetition of the last executed instruction after handling
        Reasons: virtual memory management errors like page faults
External interrupts: lead to interrupt handling
                     Reasons: interrupts from external devices to indicate
                                the presence of data or timer events
Software interrupts: lead to interrupt handling
                     Reasons: interrupt instruction in program
 Computer Architecture – Part 9 – page 46 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Interrupts and exceptions


 Usually, exceptions like aborts, traps or faults have higher priorities then other
 interrupts

                                                main program                                   interrupt routine


                                                                       save status
                                                                            and
                                                                    set interrupt mask
                    Interrupt
                    request




                                                                                                           return from interrupt
                                                                                   restore status
                              Program flow for interrupt/exception handling

Computer Architecture – Part 9 – page 47 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt      Hier wird Wissen Wirklichkeit
        Precise interrupts and exceptions


 An interrupt or exception is called precise, if the processor state
 saved at the start of the interrupt routine is identical to a sequential in
 order execution on a von-Neumann-architecture

 For out-of-order execution on a superscaler processor this means:

 • all instructions preceding the interrupt causing instruction are
   committed and therefore have modified the processor state

 • all instructions succeeding the interrupt causing instruction are
   removed and therefore have not influenced the processor state

 • depending on the interrupt causing instruction, it is either committed
   or removed


Computer Architecture – Part 9 – page 48 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Reorder buffer


The reorder buffer stores the sequential order of the issued
instructions and therefore allows result serialization during retirement
The reorder bandwidth is usually identical to the issue bandwidth
Possible reorder buffer organization:
• contains instruction states only
• contains instruction states and results (combination of reorder buffer
  and rename buffer register)
Alternate reorder techniques:
• ceckpoint repair
• history buffer


Computer Architecture – Part 9 – page 49 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Reorder buffer


The reorder buffer can be implemented as a ring buffer




Consecutive completed
                                                            head
and non speculative                                                           I1
                                                                                                 can be committed
instructions at the head                                                      I2
of the ring buffer can                                                        I3
be committed                                                                  I4
                                                                              I5
                                                              tail                                   instruction issued & result completed
                                                                              I6
                                                                                                     instruction issued & result completed,
                                                                                                     based on speculation
                                                                                                     instruction issued
                                                                                                     empty slot




Computer Architecture – Part 9 – page 50 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt     Hier wird Wissen Wirklichkeit
        Why serialization during
        commitment?

• Serialization is done to maintain the sequential von-Neumann
  principle on the architectural level
• Out-of-order commitment is not allowed on today's superscaler
  processors
• Single exception: bringing load instructions before store instructions
  is allowed on some processors
• From the outside, a superscalar processor looks like a simple von-
  Neumann computer
• This is

  – good for program verification

  – bad for parallel processing

Computer Architecture – Part 9 – page 51 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Very Long Instruction Word (VLIW)
          architecture

In contrast to the superscaler technique, which is a microarchitectural
technique, VLIW is a architectural technique


While in superscaler technique, the parallelism is exploited by
hardware, in VLIW this is done by software


The compiler bundles a fixed set of simple independent instructions,
which are stored in a very long instruction word


The processor executes all instructions of this very long instruction
word in parallel


Computer Architecture – Part 9 – page 52 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Basic principle of VLIW



                                                                  Compiler

            Instruction




                                                                                                            Very long instruction
                                                                                                            word (VLIW)




                                             FU               FU               FU              FU
                                                                                                           CPU

                    Functional unit

Computer Architecture – Part 9 – page 53 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
        Some important features of pure
        VLIW

• Sequential stream of long instruction words
• Length of an instruction word usually between 128 and 1024 bits
• Static scheduling of instructions by the compiler
   (parallelization at compile time)
• The number of instructions in one VLIW word is fix
• Instructions in one VLIW word must be independent and contain their
  own opcodes and operands. All dependencies have to be solved by the
  compiler. This leads to a restriction of the density of VLIW code.
• If the full width of the very long instruction word cannot be exploited, it
  must be filled with NOOPs
• Only in order issue is supported, but more than one instructions can be
  executed in one clock cycle, according to the width of the very long
  instruction word.
• The hardware complexity of the instruction window is very low.
  Scheduling at runtime is not necessary.
Computer Architecture – Part 9 – page 54 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        VLIW instruction vs. CISC and SIMD
        instruction


• Difference to a CISC instruction:

   A CISC instruction can code several potentially sequential
   operations in one instruction, while VLIW contains independent
   parallel operations

• Difference to a SIMD instruction

   SIMD instructions perform a single operation on multiple data
   elements, while VLIW instructions perform different operations on
   different data elements




Computer Architecture – Part 9 – page 55 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Example of a VLIW machine
         instruction + execution hardware

    VLIW-machine instruction
        FP-ALU                      I-ALU                                              LOAD/                  Address
      instruction                 instruction                                          STORE




           FP-ALU                       ALU                                                         Data Memory




                                                     Multiport-Registerfile




Computer Architecture – Part 9 – page 56 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
       Basic structure of a pure
       VLIW architecture
        Register unit


        Interconnection
        unit



        Function
        units                              FU           FU           FU                     FU                  Operands
                                                                                                                & Results
        Control                                                    CU
        unit

        Very long
                                            I1         I2          I3                 In
        instruction word
                                                                          Instruction stream
        Memory unit


Computer Architecture – Part 9 – page 57 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
       Basic structure of a pure
       VLIW architecture

A VLIW processor contains a number of functional units, which can
execute a machine instruction in parallel and synchronous to the
clock cycle.
A VLIW instruction packet contains as much instructions as functional
units are present
Ideally, the processor starts a VLIW instruction packet each clock
cycle
The instructions of this packet are then fetched, decoded, issued and
executed in parallel
All instructions of the packet must have the same execution time
Usually, pipelining is used for each instruction of the packet
=> n parallel pipelines in a n times VLIW processor


Computer Architecture – Part 9 – page 58 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Problems with pure VLIW


• VLIW is a real architecture approach.
  It is not scalable without new compilation. A new architecture
  means a new VLIW means a new compilation
• VLIW suffers from branch instructions. Speculative branches
  cannot be handled by the hardware
• VLIW suffers from memory latencies. A cache miss leads to a
  stall of all subsequent pipeline stages
• VLIW cannot react to dynamic events. Again, a stall of all
  subsequent pipeline stages is the consequence


Pure VLIW has a strong 1 : 1 relation to the microarchitecture


Computer Architecture – Part 9 – page 59 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
           Code morphing in VLIW


Code morphing has been introduced to VLIW with the Transmeta Crusoe
processors

This is a hardware-software hybrid solution

A software interpreter transforms sequential machine code to VLIW
instructions at runtime

E.g., ordinary x86 code is "morphed" at runtime to VLIW instructions

By changing the morphing software, any other machine code can be
adapted to the Transmeta Crusoe processors

Decoupling from hard- and software is improved

Execution of legacy code is simplified


Computer Architecture – Part 9 – page 60 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Block diagram of VLIW with
        code morphing level
   Code morphing is done by software.

                                              Compiler



                                                                     Sequential instruction stream



                                                                                                      ISA
                                   Code Morphing Software


                                                                                         Parallel VLIW instruction stream


                                                                                                      VLIW-ISA


                              FU            FU            FU             FU
                                                                                  CPU

Computer Architecture – Part 9 – page 61 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
         Principle of Code Morphing


Translation of a "virtual" instruction stream to a "real" instruction stream

Code can be optimized during the translation process in several steps:

• The first translation is performed without optimization in the so called
 lowest execution mode

• Furthermore, the virtual instructions are instrumented to prepare a profile
 of the timing behavior

• The prepared profile can initiate an optimization of the program path.
 The binary translation is started again and a revised real VLIW instruction
 stream is generated

• This procedure can be repeated several times, until an optimized VLIW
 code is available at a high level of execution mode.

 Computer Architecture – Part 9 – page 62 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
       VLIW architecture with code morphing
       by Transmeta

Original goal of Transmeta:



- Fast CPUs with low power consumption on the basis of VLIW
  and CMOS

- Reduction of hardware complexity by additional software shell.
  Crusoe architectures consists of a VLIW hardware core and a
  software shell.

- Code morphing software translates X86 instructions into VLIW
  code at runtime.




Computer Architecture – Part 9 – page 63 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Basic block diagram
         of a Crusoe microarchitecture

A very long instruction word
 (VLIW) is called a „molecule“.
•There are up to 4 atoms                                                            128-Bit-molecule
 (instructions) in one molecule.
•The execution of molecules is                                  FADD                   ADD                  LD            BRCC
 in order.
•The issue of the X86
 instructions is out-of-order.                                 Floating-             Integer                Load/
                                                                                                                           Branch
 There exist a binary code                                       Point                 ALU                  Store
                                                                                                                            Unit
                                                                 Unit                   #0                   Unit
 translation
•Within the Crusoe processor
 family, different instruction
 sets are used.



 Computer Architecture – Part 9 – page 64 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt     Hier wird Wissen Wirklichkeit
          Crusoe features


The processors are optimized for low power consumption.

Crusoe are VLIW architectures with an additional code morphing software.
The translation of code is “on demand” and is stored in the cache.

On an instruction cache miss, new code is translated to VLIW code.
This code is the executed until the next cache miss

By separating the hardware from the application, a "virtual programming
environment" is created, which supports:

• regular VLIW code execution
• speculative load and store instructions
• prediction
• code instrumentation for optimization.

 Computer Architecture – Part 9 – page 65 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        EPIC (Explicitly Parallel Instruction
        Computing)


   • Improvement of VLIW by Intel ad HP to the IA-64 architecture for 64
     bit server processors
   • Extended 3-instruction format, similar to 3 times VLIW
   • Gloal of EPIC: combine simplicity and high clock frequency of a
     VLIW processor with the advantages of dynamic scheduling
   • The EPIC format allows the compiler to inform the processor
     directly about instruction level parallelism
   • Therefore, an EPIC processor ideally has not to check for data and
     control flow dependencies
   • This simplifies the microarchitecture compared to a superscaler
     processor while improving flexibility compared to VLIW processor



Computer Architecture – Part 9 – page 66 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        EPIC (Explicitly Parallel Instruction
        Computing)

                                                         Compiler
                                                                                                            EPIC instruction
                                                                                                            bundle
            Instruction                                                                                        Stop marker to show
                                                                                                               boundaries of parallel
                                                                                                               execution



                                                  Dispersal window




                                     FU              FU               FU               FU
Functional unit
                                                                                                   CPU

Computer Architecture – Part 9 – page 67 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         EPIC (Explicitly Parallel Instruction
         Computing)

• An EPIC instruction bundle is 128 bit in width and consists of a compiler
  generated bundle of 3 IA-64 instructions and so-called template bits
• A IA-64 instruction is 41 bit in width and mainly consists of an opcode, a
  predicate field, two source and one destination register addresses
• 5 template bits indicate information on instruction grouping
• There are no NOOP instructions like in VLIW. Instruction parallelism is
  given by the template bits. They define if an instruction can be executed in
  parallel with the other instructions
• This refers to instruction within the same EPIC bundle and the following
  EPIC bundles
• Therefore, instructions with data or control flow dependencies can be
  bundled improving flexibility compared to VLIW


 Computer Architecture – Part 9 – page 68 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        EPIC (Explicitly Parallel Instruction
        Computing)
                                                                     stop marker given by
                                      IA-64 instruction                  template bits


              bundle i                                      bundle i+1                                     bundle i+2


can be executed in
     parallel
                                          can be executed in                        can be executed in          can be executed in
                                               parallel                                  parallel                    parallel


 e.g. in a bundle:



                                             add             r1 = r2 +                   r3
                                             sub             r4 = r11 –                  r2 ;; stop marker
            Dependency
                                             sub             r5 = r1              –      r10



Computer Architecture – Part 9 – page 69 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
        Format of a bundle in IA-64


                        Template        instruction instruction         instruction               “bundle”
                          5 Bit           41 Bit      41 Bit              41 Bit


                                               128 bit

   Template classifies instruction types and stop marker

   Example of a sequence of IA-64 bundles:


                   Template                1. instruction           2. instruction          3. instruction

                   00000                   Memory                   Integer                 Integer

                   00001                   Memory                   Integer                 Integer        ;;

                   00010                   Memory                   Integer ;;              Integer
                    …




                                             …




                                                                     …




                                                                                              …
                   11101                   Memory                   FP                      Branch

                   11110                   Memory                   FP                      Branch         ;;

Computer Architecture – Part 9 – page 70 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt        Hier wird Wissen Wirklichkeit
        Itanium Processor


• Six times EPIC processor with a ten stage pipeline
• Nine execution units: four ALU/MMX units, two floating point units and
  three branch units
• Itanium concatenates up to two bundles of indepenent instructions and
  executes these instructions in parallel in the pipeline
• Future EPIC processors are able to concatenate more then two bundles
  => in contrast to VLIW scaling is possible
• Itanium 2 nearly identical to Itanium, removes some weaknesses (long
  cache latencies, faster bus, better X86 emulation)
• Variants of Itanium 2: McKinley (first Itanium 2), Madison (higher clock
  frequency then Madison), Deerfield (low power version)
• Montecito is a multicore processor containing two Itanium 2 processor
  cores

 Computer Architecture – Part 9 – page 71 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
       Itanium Processor


   Block diagram of the Itanium processor:




  2 bundles are fetched (32 Bytes) from L1 cache                   9 available functional units
  Instruction Queue contains 24 IA-64 instructions                 6 IA-64 instructions can be issued per
                                                                   Clock cycle over Issue ports

Computer Architecture – Part 9 – page 72 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
        Functionality of the dispersal
        window in EPIC



                                  Dispersal Window
      First Bundle                                                        Second Bundle



                                                                                                                                  Bundle
 Dispersed                                                                                                                       Stream
Instructions                  M F I                    M I B                        M I I                  M I B                   from
                                                                                                                                 I-Cache




              M0 M1               I0 I1          F0 F1               B0 B1 B2                       Functional units




Computer Architecture – Part 9 – page 73 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
         Bundle stream from I-Cache

According to the resources 1 or 2 bundles can be fetched from the I-Cache.

Example: If one bundle cannot be mapped completely, only one bundle is
         fetched from the I-Cache.

                                                                                                                      Bundle
                 Dispersed                                                                                           Stream
                Instructions                    M I I                M I B                     M I B
                                                                                                                       from
                                                                                                                     I-Cache



                                    M0 M1           I0 I1 F0 F1               B0 B1 B2



                                                                                                                              Bundle
                      Dispersed                                                                                              Stream
                     Instructions                 M I I                       M I B                M I B
                                                                                                                               from
                                                                                                                             I-Cache



                                                                 M0 M1          I0 I1 F0 F1                B0 B1 B2
Computer Architecture – Part 9 – page 74 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt    Hier wird Wissen Wirklichkeit
         IA-64 instruction set architecture


        IA-64 instruction set architecture contains:

        • A fully predicative instruction set

        • Many registers:

               •128 Integer register

               •128 floating point register

               • 64 predication register

               • 8 branch register

        • Speculative load instructions



Computer Architecture – Part 9 – page 75 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Predication model


 All instructions of the ISA can refer to one of the 64 predication register



 Example:

 p1, p2 <- cmp (x == y)

 p1: instr

 p2: instr



 p2 is complementary to p1




Computer Architecture – Part 9 – page 76 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
        Example for an “if-then-else”
        statement
    a) Traditional architecture                                           b) EPIC architecture
    The statement is partitioned in 4 basic                               “Then” path will be executed if p1 is true.
    blocks by the compiler. These blocks                                  “Else” path will be executed if p2 is true
    have to be executed serially.                                         .

     Consequence: The conditional branch is parallelized in a simple way.

                    inst                                                          inst
                    inst                                                          inst
          if        .                                              if
                                                                                   .
                    .                                                              .
                    .                                                              .
                    p1, p2  cmp (x==y)                                            .
                    jump of p2                                                     p1, p2  cmp (a==b)
                    inst1                                                   (p1) inst1
          then      inst2                                          then (p1) inst2
                    .                                                               .
                    .                                                               .
                    .                                                               .
                    jump
                                                                            (p2) inst3
                    inst3                                          else     (p2) inst4
                    inst4                                                        .
          else      .                                                            .
                    .                                                            .
                    .
                                                                                    inst
                    inst                                                            inst
                    inst                                                            .
                    .                                                               .
                    .                                                               .
                    .

Computer Architecture – Part 9 – page 77 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Speculative load instructions



   • Speculative load (hoisting) means a load instruction is
     speculatively executed in advance of a branch instruction.
     (before the affiliated basic block)

   • This allows a reduction of memory latency and therefore an
      increasing of the ILP-degree.

   • A new speculative load (ld.s) instruction is introduced, which
     initiates a speculative fetch to the memory

   • A check.s instruction is used to verify speculation




Computer Architecture – Part 9 – page 78 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
           Example


            Traditional Architecture                                              EPIC Architecture (IA-64)
                             .                                                             .
                             .                                                             .
                             .                                                             ld.s
                             inst                                                          inst
                             inst                                                          inst
                             .                                                             .




                                                                                Hoisting
                             .                                                             .
                             .                    Barrier                                  .
                             jump                                                          jump
                             .                                                             .
                             .                                                             .
                             .                                                             .
                             load                                                          check.s
                             inst                                                          inst
                             .                                                             .
                             .                                                             .
                             .                                                             .

               In a traditional architecture, the load can be shifted only to the barrier
               (border of the basic block).


Computer Architecture – Part 9 – page 79 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Another example



                • without Control Speculation


                  -1         (p1)br.cond target 1
                  -2         ld4 r1=[r5] ;;
                  -3         add r2=r1, r3



                • with Control Speculation
                  -1           ld4.s r1=[r5];;
                  -2
                  ...          maybe other instructions

                  -n           (p1)br.cond target 1
                  - n+1        chk.s r1,
                  - n+2        add r2=r1, r3



Computer Architecture – Part 9 – page 80 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Comparing superscaler, VLIW
          and EPIC


• All three techniques aim to improve performance by concurrent
  execution units
• Ideally, as many instructions as execution units are present should be
  executed in one clock cycle
• Architecture- versus microarchitecture approach:
       • VLIW and EPIC are architecture approaches
       • Superscaler is a microarchitecture approach
• Instruction scheduling and conflict avoidance:
       • VLIW/EPIC: the compiler schedules the assignment of instructions
         to execution units and takes care to avoid conflicts
       • In a superscalar processor, this is done by hardware
       => VLIW/EPIC puts higher demands on the compiler than superscaler
Computer Architecture – Part 9 – page 81 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Comparing superscaler, VLIW
          and EPIC
• Compiler optimization:
    • all three techniques require an optimizing compiler
    • the VLIW and EPIC compiler additionally has to take in account
      memory access time
    • superscaler memory access is managed by the load-/store-unit
    • often the same optimization strategies can be used in all three cases

• Instruction ordering:
    • a superscaler processor feeds its execution units from a single simple
      execution stream
    • a VLIW processor uses a instruction stream of instruction packages
      (tuples of simple instructions)
    • EPIC can bundle dependent instructions. Template bits have to be
      checked by the processor. Several bundles of independent
      instructions can be executed concurrently. Therefore EPIC is a hybrid
      of superscaler and VLIW
Computer Architecture – Part 9 – page 82 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Comparing superscaler, VLIW
          and EPIC

• Reaction to runtime events: VLIW not as flexible as superscaler
• Memory organization: superscaler can support memory hierarchies much
                       better then VLIW
• Branch prediction and speculation:
      • dynamic branch prediction is a standard technique in current
        superscaler processors
      • impossible in VLIW, hard to realize in EPIC
• Code density
      • VLIW has a fixed instruction format => code density is lower then in
        superscaler processors, if the available instruction level parallelism is
        insufficient to fill the VLIW instruction package
      • EPIC doesn't have this drawback, but the template bits produce some
        overhead

Computer Architecture – Part 9 – page 83 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit
          Comparing superscaler, VLIW
          and EPIC

• Reachable performance and fields of application
      • comparable performance of all three techniques under ideal
        conditions
      • The simplicity of VLIW processors allow a higher clock frequency
        compared to superscaler
      • VLIW is preferable for code with a high degree of parallelism, e.g. for
        signal processing
      • General purpose applications like e.g. text processing, compiler or
        games have a lower degree of parallelism and a higher degree of
        dynamics thus favoring superscaler
      • EPIC combines VLIW and superscaler thus avoiding the inelasticity of
        VLIW and the issue complexity of superscaler



Computer Architecture – Part 9 – page 84 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt   Hier wird Wissen Wirklichkeit

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:0
posted:3/23/2013
language:Latin
pages:84
xeniawinifred zoe xeniawinifred zoe not http://
About I am optimistic girl.