Examples of One-Dimensional Systolic Arrays by cCG8De4w


									 Examples of
Systolic Arrays
  Motivation & Introduction
• We need a high-performance , special-purpose computer
 system to meet specific application.

• I/O and computation imbalance is a notable problem.

• The concept of Systolic architecture can map high-level
 computation into hardware structures.

• Systolic system works like an automobile assembly line.

• Systolic system is easy to implement because of its
  regularity and easy to reconfigure.

• Systolic architecture can result in cost-effective , high-
  performance special-purpose systems for a wide range
  of problems.
  Pipelined Computations
• Pipelined program divided into a series of tasks that
  have to be completed one after the other.
• Each task executed by a separate pipeline stage
• Data streamed from stage to stage to form computation

   f, e, d, c, b, a   P1   P2   P3   P4   P5
Pipelined Computations
  • Computation consists of data streaming through pipeline
  • Execution Time = Time to fill pipeline (P-1)
P = # of processors
                    + Time to run in steady state (N-P+1)
N = # of data items + Time to empty pipeline (P-1)
(assume P < N)
      f, e, d, c, b, a   P1      P2      P3       P4    P5

                 P5             a   b   c d e f
                 P4         a   b   c   d e f
                 P3       a b   c   d   e f
                 P2     a b c   d   e   f
                 P1   a b c d   e   f
                                time                   This slide must be explained in all detail.
                                                                  It is very important
 Pipelined Example: Sieve of Eratosthenes
• Goal is to take a list of integers greater than 1 and
  produce a list of primes
   – E.g. For input 2 3 4 5 6 7 8 9 10, output is 2 3 5 7

• A pipelined approach:

   – Processor P_i divides each input by the i-th prime

   – If the input is divisible (and not equal to the divisor), it is
     marked (with a negative sign) and forwarded

   – If the input is not divisible, it is forwarded

   – Last processor only forwards unmarked (positive) data
  Sieve of Eratosthenes Pseudo-Code
                                                           • Code for last
• Code for processor Pi (and prime p_i):                     processor
   – x=recv(data,P_(i-1))                                     – x=recv(data,P_(i-
   – If (x>0) then                                              1))
       • If (p_i divides x and p_i = x ) then                 – If x>0 then
       • If (p_i does not divide x or p_i = x) then
         send(x, P_(i+1))
   – Else
       • Send(x,P_(i+1))

                                                                     Processor P_i
                                                                   divides each input
                                                                    by the i-th prime

                     P2         P3          P5        P7     out
          Programming Issues
• Algorithm will take N+P-1 to run where N is the number of data
  items and P is the number of processors.
     – Can also consider just the odd bnys or do some initial part separately

• In given implementation, number of processors must store all
  primes which will appear in sequence
     – Not a scalable approach
     – Can fix this by having each processor do the job of multiple primes, i.e.
       mapping logical “processors” in the pipeline to each physical processor
     – What is the impact of this on performance?

               P2      P3      P5        P7   P11 P13 P17

processor does the job of three primes
   Processors for such operation
• In pipelined algorithm, flow of data moves through processors in lockstep.

• The design attempts to balance the work so that there is no bottleneck at any

• In mid-80’s, processors were developed to support in hardware this kind of
  parallel pipelined computation

• Two commercial products from Intel:
    – Warp (1D array)
    – iWarp (components for 2D array)

• Warp and iWarp were meant to operate synchronously Wavefront Array
  Processor (S.Y. Kung) was meant to operate asynchronously,
    – i.e. arrival of data would signal that it was time to execute
     Example 1:

polynomial evaluation
  Example 1: “pipelined”
  polynomial evaluation
• Polynomial Evaluation is done by using a Linear array
  with 2D.
• Expression:
 Y = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0
• Function of PEs in pairs
   –   1. Multiply input by x
   –   2. Pass result to right.
   –   3. Add aj to result from left.
   –   4. Pass result to right.
Example 1: polynomial evaluation
    X is          Y = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0       Multiplying
broadcasted                                                              processor

  • Using systolic array for polynomial evaluation.                             processor

              x   an     x    an-1    x    an-2                            a0

              X   +     X      +     X     +        ……….            X       +

  • This pipelined array can produce a polynomial on new
    X value on every cycle - after 2n stages.
  • Another variant: you can also calculate various
    polynomials on the same X.
  • This is an example of a deeply pipelined computation-
        – The pipeline has 2n stages.
         For you to think about

1. Pipelined Graph Coloring
2. Pipelined Satisfiability
3. Pipelined sorting/absorbing
4. Pipelined decision function like Petrick
5. Pipelined multiplication.
6. Pipelined calculation of (A + B) * (C – D) on
   vectors A, B, C, D.
 Example 2:

Matrix Vector
          Example 2:
  Matrix Vector Multiplication
• There are many ways to solve a matrix problems using
  systolic arrays, some of the methods are:

   – Triangular Array performing gaussian elimination with
     neighbor pivoting.

   – Triangular Array performing orthogonal triangularization.

• Simple matrix multiplication methods are shown in next
                     Example 2:
             Matrix Vector Multiplication
• Matrix Vector Multiplication:
• Each cell’s function is:
    – 1. To multiply the top and bottom inputs.
    – 2. Add the left input to the product just obtained.
    – 3. Output the final result to the right.
• Each cell consists of an adder and a few registers. (Booth Algorithm for mul).
• Or, a cell can include a hardware multiplier.

                                    -     -
                           PE1          PE3
                            p      q      r
                    Example 2:
                  Matrix Multiplication
      Matrix Vector Multiplication
       -     -      i
        -    h      f
       g     e      c
       d      b     -
       a      -     -

      PE1   PE2    PE3    z y x   • At time t0 the array receives 1, a, p, q,
                                  and r ( The other inputs are all zero).

      p     q       r             • At time t1, the array receive m, d, b,
                                  p, q, and r ….e.t.c

                                  • The results emerge after 5 steps.
  • Explain how to multiply the first row of the matrix by
    the vector,
  • how data are shifted from left to right in the architecture

          -          -           i
            -        h           f
        g       e        c
       d        b        -
       a         -           -

      PE1 PE2 PE3                    z y x    To visualize how it
                                              works it is good to
      p         q            r                  do a snapshot
   Systolic Algorithms
• Systolic arrays were built to support systolic algorithms,
  a hot area of research in the early 80’s

• Systolic algorithms used pipelining through various
  kinds of arrays to accomplish computational goals:

   – Some of the data streaming and applications were very
     creative and quite complex

   – CMU a hotbed of systolic algorithm and array research
     (especially H.T. Kung and his group)
   Systolic Arrays from Intel
• Warp and iWarp were examples of systolic arrays
   – Systolic means regular and rhythmic,
   – data was supposed to move through pipelined computational units in a
     regular and rhythmic fashion

• Systolic arrays meant to be special-purpose processors or co-
• They were very fine-grained

• Processors implement a limited and very simple computation,
  usually called cells

• Communication is very fast, granularity meant to be around one
  Systolic Processors, versus Cellular Automata
     versus Regular Networks of Automata

          Data Path     Data Path       Data Path     Data Path
           Block         Block           Block         Block

                               Systolic processor

           Control          Control      Control      Control
            Block            Block        Block        Block

These slides are for one-        Cellular Automaton
   dimensional only
Systolic Processors, versus Cellular Automata
   versus Regular Networks of Automata
                 Control       Control     Control           Control
                  Block         Block       Block             Block
                                                   General and Soldiers,
                 Cellular Automaton             Symmetric Function Evaluator

      Control        Control          Control            Control
       Block          Block            Block              Block

     Data Path     Data Path          Data Path        Data Path
      Block         Block              Block            Block

                         Regular Network of Automata

To top