A Bridging Model for Multi-Core Computing

Document Sample
A Bridging Model for Multi-Core Computing Powered By Docstoc
					A Bridging Model for
Multi-Core Computing
     Leslie Valiant
   Harvard University
Parallel Computing has Arrived
    via Multi-Core, but …

            but
 1. Why?
 2. What is the main challenge?
 Why is Multi-Core Here Anyway?

Not commercial pressure for throughput
Not following solution to technical problems,
Not following advances in programmability,
…………

but “physics-push”: miniaturization far enough
  along that it is now silly not to physically
  intersperse storage and processing.
        The Main Challenge
Writing one program not impossible, but

reusing the intellectual property a second
  time is something else.
Assumptions: besides
1. independent tasks, or
2. embarrassing parallelism, or
3. implicit parallelism, or
4. automatic parallelization,
….it will be good to run explicitly designed
    parallel algorithms.
              Impediments:

1. Multi-core chip designs differ – an
   algorithm efficient for one may not be
   efficient for another.
2. Intellectually challenging.
3. Have to compete with existing sequential
   algorithms that are sometimes well
   understood and highly optimized.
4. Ultimate reward “only” a constant factor
   improvement.
So what makes sequential computing so
 successful?
           Bridging Models
Hardware                                Software




 Key: “    “  “can efficiently simulate on”
               Bridging Models
Hardware                                    Software
                                               Quicksort
 IBM 1970


 DEC 1982                                      FFT
                   von Neumann
Fujitsu 1994                                   Compiler X


 Lenovo 2006
                                            Word processor Y


   Key: “      “  “can efficiently simulate on”
           Bridging Models
Hardware                                Software


                Multi-BSP
                (p1, L1, g1, m1,
                   ………….)




 Key: “    “  “can efficiently simulate on”
               Reward
Portable Parallel Algorithms – Efficient
algorithm for all combinations of machine
parameters to be run in parameter-aware
way.

Need to be written just once (immortal
algorithms.)
                Level j component

Level j -1 component          Level j -1 component

           .. pj components ..
      gj -1
   data rate
               Lj - synch. cost

          Level j memory mj

                       gj data rate
Multi-BSP
                   Level 1 component

Level 0 = processor               Level 0 = processor

             .. p1 processors ..
      g0 =1
   data rate
            L = 0 - synch. cost
               1




         Level 1 memory m              1




                         g data rate
                              1

Multi-BSP
                 Multi-BSP
Like BSP except,
1. Not 1 level, but d level tree
2. Has memory (cache) size m as further
  parameter at each level.

i.e. Machine H has 4d+1 parameters:
e.g. d = 3, and
(p1, g1, L1, m1) (p2, g2, L2, m2) (p3, g3, L3, m3)
System of Niagara UltraSparc T1s
• Level 1: 1 core has 1 processor with 4
  threads plus L1 cache:
  (p1 = 4, g1 = 1, L1 = 3, m1 = 8kB).
• Level 2: 1 chip has 8 cores plus L2 cache:
 (p2 = 8, g2 = 3, L2 = 23, m2 = 3MB).
• Level 3: p multi-cores with external
  memory m3 via a network with rate g3:
 (p3 = p, g3 = ∞, L3 = 108, m3 ≤ 128GB).
               Multi-BSP
Special instances are:

1. Von Neumann, (d = 1, p1 = 1)
2. PRAM, (d = 1)
3. BSPRAM (p1 = 1, g1 = g, L1 = 0, m1 = m)
  (p2 = p, g2 = ∞, L2 = L, m2)
      BSP(p, g, L, m)
4. Cache hierarchy models (p1 = … = pd = 1)
              Multi-BSP
Numerous Related Models:
BSPRAM (Tiskin, 1998)
BSP with memory parameter (McColl & Tiskin,
 1999)
D-BSP (de la Torre & Kruskal, 1996)
D-BSP: Network-Oblivous Algorithms (Bilardi,
 Pietracaprina, Pucci & Silvestri, 2007)
Multicore-cache: Blelloch,Chowdhury,Gibbons,
 Ramachandran,Chen,Kozuch (SODA 2008)
              Bottom Line
Question: How will a good sorting algorithm
 get on to a 4-core chip?

My Answer: Ideally someone will publish an
 algorithm for sorting that is optimal for all
 values of d and (p1, g1, L1, m1) (p2, g2, L2,
 m2) … (pd, gd, Ld, md).

Is this possible for important problems?
           Some Problems
Matrix Multiplication.
FFT.
Sorting.
Associative Composition:
 x1, … , xn  S, (a set with an associative
 operation) and specifications of disjoint
 subsequences of 1, 2, …n, to find the
 products corresponding to the
 subsequences.
             Approximation
                     F1  F2
if for all  > 0, F1 < (1+)F2 for all large
   enough n and m = min {mi | 1≤ i ≤ d}.

                     F1 d F2
if for all  > 0, F1 < cdF2 for all large enough
   n and m = min {mi | 1≤ i ≤ d}, where cd can
   depend only on d (not on pi, gi, Li, mi)
               Optimality
A Multi-BSP algorithm A* is optimal with
  respect to algorithm A if

 (i) Comp(A*)  Comp(A),
 (ii) Comm(A*) d Comm(A), and
 (iii) Synch(A*) d Synch(A)

where Comm(A), Synch(A) are optimal among
 Multi-BSP implementations, and Comp is
 total computational cost.
    Associative Composition Lower
                Bounds
Theorem Where Qi = tot. no. of level i comps:
AC-Comm(n,d)
     d Σi=1.. d-1 n gi /Qi

AC-Synch(n,d)
     d Σi=1.. d-1 n Li+1/(Qi Mi)

Proof Via Hong-Kung, Irony-Toledo-Tiskin.
    Associative Composition Upper
                Bounds
Theorem Where Qi = tot. no. of level i comps:
AC-Comm(n,d)
     d Σi=1.. d-1 n gi /Qi

AC-Synch(n,d)
     d Σi=1.. d-1 n Li+1/(Qi mi)

Proof Via Hong-Kung, Irony-Toledo-Tiskin.
Matrix Multiplication Lower Bounds
Theorem For standard n3 algorithm:

MM-Comm(n x n, d) d Σi=1.. d-1 n3gi
 /(QiMi1/2)

MM-Synch(n x n, d) d Σi=1.. d-
 1 n3Li+1/(QiMi3/2)
Matrix Multiplication Upper Bounds
Theorem For standard n3 algorithm:

MM-Comm(n x n, d) d Σi=1.. d-1 n3gi
 /(Qim1/2)

MM-Synch(n x n, d) d Σi=1.. d-
 1 n3Li+1/(Qimi3/2)

Proof Recursive blocking with care.
  Parallel Block Matrix Multiplication

NOT
            x            =



                                Partial
BUT
            x            =
          FFT Lower Bounds
Theorem
FFT-Comm(n,d)
     d Σi=1.. d-1 n log(n) gi /(Qi log(Mi))

FFT-Synch(n,d)
     d Σi=1.. d-1 n log(n) Li+1/(Qi log(Mi))
          FFT Upper Bounds
Theorem
FFT-Comm(n,d)
     d Σi=1.. d-1 n log(n) gi /(Qi log(mi))

FFT-Synch(n,d)
     d Σi=1.. d-1 n log(n) Li+1/(Qi log(mi))
       Sorting Lower Bounds
Theorem For any comparison algorithm
FFT-Comm(n,d)
     d Σi=1.. d-1 n log(n) gi /(Qi log(Mi))

FFT-Synch(n,d)
     d Σi=1.. d-1 n log(n) Li+1/(Qi log(Mi))
       Sorting Upper Bounds
Theorem
FFT-Comm(n,d)
     d Σi=1.. d-1 n log(n) gi /(Qi log(mi))

FFT-Synch(n,d)
     d Σi=1.. d-1 n log(n) Li+1/(Qi log(mi))

Proof Deterministic oversampling.
 Terrifying and ugly many parameter
 models for multi-core can sometimes
 be tamed.


Portable algorithms in this broad parameter space are
    possible, at least
(i) For some important divide and conquer algorithms,
(ii) For this level of “O” analysis.

More detailed analysis nontrivial but maybe not rocket
  science.
Dilemma in Choice of Bridging Model
To express “realities of current MC designs” or to
    express “the inevitable – as minimally dictated
    by physics.”

“the inevitable”  e.g. more memory needs more
    time to access.

(See also: Blelloch, Chowdhury, Gibbons,
    Ramachandran, Chen, Kozuch (SODA 2008)
    and Chowdhury, Ramachandran, (SPAA 2008):
    a cache model more directly oriented to
    existing architectures.)
              Some Choices
Multi-BSP assumes:
(i) global synchronization across the cores in a
     component, and
(ii) a cache protocol: data changed prior to last
     synch. is swapped out in preference to that
     changed since.

N.B. (i) can be implemented efficiently in existing
    MC designs (Sampson et al. 2005)
              Thesis

We will need to agree on some multi-
parameter bridging model for parallel
algorithms development and use for
multi-core to prosper.
           Bridging Models
Hardware                              Applications
                                       Software


                 (p1, L1, g1, m1,      Algorithms
                    ………….)


                                        Emulation
                                         Software
 Key: “    “  “can efficiently simulate on”

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:4/12/2012
language:English
pages:35