Docstoc

Color a Graph_ Compute Derivatives _Computing Sparse

Document Sample
Color a Graph_ Compute Derivatives _Computing Sparse Powered By Docstoc
					 Multithreaded Algorithms for
 Graph Coloring

Alex Pothen
Purdue University
CSCAPES Institute
www.cs.purdue.edu/homes/apothen/

Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL),
John Feo (PNNL), Umit Catalyurek (Ohio State)


CSC’11 Workshop
May 2011
                                                      1
References
    Multithreaded algorithms for graph coloring.
    Catalyurek, Feo, Gebremedhin, Halappanavar and
    Pothen, 40 pp., Submitted to Parallel Computing.
   New multithreaded ordering and coloring
    algorithms for multicore architectures. Patwary,
    Gebremedhin, Pothen, 12pp., EuroPar 2011.
   Graph coloring for derivative computation and
    beyond: Algorithms, software and analysis.
    Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp.,
    Submitted to TOMS.
   Distributed Memory Parallel Algorithms for
    Matching and Coloring. Catalyurek, Dobrian,
    Gebremedhin, Halappanavar, and Pothen, 10pp.,
    IPDPS Workshop PCO, 2011.
                                                    2
                   Algorithm




               Performance



Architecture
               Latency Tolerance       Graph




                                   3
Outline
   The many-core and multi-threaded world
    ◦ Intel Nehalem
    ◦ Sun Niagara
    ◦ Cray XMT
   A case study on multithreaded graph coloring
    ◦ An Iterative and Speculative Coloring Algorithm
    ◦ A Dataflow algorithm
   RMAT graphs: ER, G, and B
   Experimental results
   Conclusions


                                                        4
    Architectural Features
Proc.      Threads/   Cores/Socket   Threads   Cache    Clock        Multithreading,
           Core                                                      Other Detail




Intel      2          4              16        Shared   2.5 G        Simultaneous,
Nehalem                                        L3                    Cache Coher.
                                                                     protocol

Sun         8        2             128         Shared   1.2 G        Simultaneous
Niagara                                        L2
2                                                               • B = max back degree
  O(|E|)-time implementations possible for all four                   over entire seq.
Cray       128        128 Procs.     16,384    None     500   M • B+1Interleaved,
                                                                      colors suffice
                                                                to color G.
XMT                                                                  Fine-grained
                                                                     synchronization
Multithreaded: Iterative
Greedy Algorithm




             Forbidden Colors   ci

               V                v    6
                                         v
Multi-threaded: Data Flow Algorithm




                                      7
Multi-threaded: Data Flow
         Algorithm


          Forbidden Colors   ci

         V                   v        v




                                  8
RMAT Graphs


   R-MAT: Recursive MATrix
    method
   Experiments
    ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25)
    ◦ RMAT-G (0.45, 0.15, 0.15, 0.25)
    ◦ RMAT-B (0.55, 0.15, 0.15, 0.15)
   Chakrabarti, D. and Faloutsos, C. 2006. Graph mining:
    Laws, generators, and algorithms. ACM Comput. Surv. 38,
    1.
                                                              10
RMAT Graphs   a   b
              c   d




                      11
Nehalem: Strong Scaling (Niagara)




RMAT-ER

                                    RMAT-G




          RMAT-B                             12
Cray XMT: Strong and Weak Scaling
    Iter-G                             Iter-B




     DF-G                       DF-B

                                                13
   Comparing Three Platforms




a) ER                          c) Good




        e) Bad
                                         14
        No. Colors in Parallel Algorithms




a) ER                                 b) G




              c) B                           15
Computing SL Orderings in Parallel:
RMAT-G graphs (Nehalem)




SL Ordering           Relaxed SL Ordering



                                            16
  Our contributions: Multithreaded Coloring
 Massive multithreading
    ◦ Can tolerate memory latency for graphs/sparse matrices
    ◦ Dataflow algorithms easier to implement than distributed memory
      versions
    ◦ Thread concurrency ameliorates lack of caches, and lower clock speeds
    ◦ Thread parallelism can be exploited at fine grain if supported by
      lightweight synchronization
    ◦ Graph structure critically influences performance

   Many-core machines
    ◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2)
      and ordering algorithms that port to different machines
    ◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1
      thread on X cores)
    ◦ Decomposition into tasks at a finer grain than distributed-memory version,
      and relax synchronization to enhance concurrency
    ◦ Will form nodes of Peta- and Exa-scale machines, so single node
      performance studies are needed
                                                                                   17
Multi-threaded Parallelism




                             24
                         Time
                                          Figure from Robert Golla, Sun


•Memory access times determine performance
•By issuing multiple threads, mask memory latency if a ready
thread is available when a functional unit becomes free
•Interleaved vs. Simultaneous multithreading (IMT or SMT)
                                                                    25
Multi-core: Sun Niagara 2




• Two 8-core sockets,              •Simultaneous multithreading
•8 hw threads per core             •Two threads from a core can be
•1.2 GHz processors linked by      issued in a cycle
8 x 9 crossbar to L2 cache banks   •Shallow pipeline


                                                                     26
Multicore: Intel Nehalem




• Two quad-core sockets, 2.5 GHz    •Advanced architectural features:
• Two hyperthreads per core         Cache coherence protocol to reduce
support SMT                         traffic, loop-stream detection,
•Off chip-data latency 106 cycles   improved branch prediction,
                                    out-of-order execution


                                                                         27
     Massive Multithreading: Cray XMT
   Latency tolerance via massive multi-threading
    ◦ Context switch between threads in a single clock cycle
    ◦ Global address space, hashed to memory banks to reduce hot-spots
    ◦ No cache or local memory, average latency 600 cycles
   Memory request doesn’t stall processor
    ◦ Other threads work while the request is fulfilled
 Light-weight, word-level
  synchr. (full/empty bits)
 Notes:
    ◦ 500 MHz clock
    ◦ 128 Hardware thread streams/proc.,
    ◦ Interleaved multithreading

                                                                         28
Multithreaded Algorithms for Graph
Coloring
◦ We developed two kinds of multithreaded
  algorithms for graph coloring:
  An iterative, coarse-grained method for generic shared-memory
   architectures
  A dataflow algorithm designed for massively multithreaded
   architectures with hardware support for fine-grain synchronization,
   such as the Cray XMT
◦ Benchmarked the algorithms on three systems:
  Cray XMT, Sun Niagara 2 and Intel Nehalem
◦ Excellent speedup observed on all three platforms



                                                                     29
Coloring Algorithms




                      30
Greedy coloring algorithms
  Distance-k, star, and acyclic coloring are NP-hard
  Approximating coloring to within O(n1-e) is NP-hard for any e>0
     GREEDY(G=(V,E))
        Order the vertices in V
        for i = 1 to |V| do
           Determine colors forbidden to vi
           Assign vi the smallest permissible color
        end-for


    A greedy heuristic usually gives a near-optimal solution
    The key is to find good orderings for coloring, and many have
     been developed


Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

                                                                                      31
Distance-1Coloring, Greedy Alg.

          a




                  v

              a




                      v
                                  32
    Many-core greedy coloring
   Given a graph, parallelize greedy coloring on many-core machines
    such that Speedup is attained, and Number of colors is roughly same as in serial
   Difficult task since greedy is inherently sequential, computation small
    relative to communication, and data accesses are irregular
            Approaches based on Luby’s parallel algorithm for maximal
    D1 coloring:
    independent set had limited success
   Gebremedhin and Manne (2000) developed a parallel greedy coloring
    algorithm on shared memory machines

    ◦ Uses speculative coloring to enhance concurrency, randomized
      partitioning to reduce conflicts, and serial conflict resolution
    ◦ Number of conflicts bounded, so this approach yields an effective
      algorithm
    ◦ Extended to distance-2 coloring by G, M and P (2002)
   We adapt this approach to implement the greedy algorithm for many-
    core computing


                                                                                       33
Parallel Coloring




                    34
Parallel Coloring: Speculation
                          w
          a




                  v
                              w
              a




                      v
                                  35
Experimental results




                                                                            Dataflow
    Iterative


         Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges




                                                                                 36
Experimental results
                                                   Iterative      Niagara 2




        Perf. With doubling threads on a core = Doubling cores!
                                                                        37
Experimental results
    RMAT-G with 224 = 16M vertices and 134M edges




                                                    RMAT-B, 224 vertices,134M edges




                All Platforms                                                     38
Iterative Greedy Coloring:
Multithreaded Algorithm


                        Adj(v), color(w),
                       forbidden(v): d(v) reads each
                       forbidden(v): d(v) writes




                      Adj(v), color(w): d(v) reads each




                                                 39
Experimental results
           RMAT-G with 224 = 16M vertices and 134M edges




          All Platforms                                    40
Tentative Conclusions, Future
Work




                                41
  Future Plans: Multithreaded Coloring
 Massive multithreading
    ◦ Microbechmarking to understand where the cycles go: thread management,
      data accesses, synchronization, instruction scheduling, function unit
      limitations…
    ◦ Develop a performance model of the computation
    ◦ Experiment with other graph classes
    ◦ Consider new algorithmic paradigms

   Many-core machines
    ◦ Four items as above
    ◦ Ordering for coloring: Archetype of a problem for computing a sequential
      ordering in a parallel environment (Mostofa Patwary and Assefaw
      Gebremedhin)
    ◦ Extend to nodes of Peta-scale machines, so single node performance is
      enhanced, and complete our work on the Blue Gene and the Cray XT5




                                                                                 42
Thanks


    Rob Bisseling, Erik Boman, Ümit Çatalürek,
    Karen Devine, Florin Dobrian, John Feo,
    Assefaw Gebremedhin, Mahantesh Halappanavar,
    Bruce Hendrickson, Paul Hovland, Gary
    Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo,
    Jean Utke




                                                   43
Further reading
www.cscapes.org
 Gebremedhin and Manne, Scalable parallel graph
  coloring algorithms, Concurrency: Practice and
  Experience, 12: 1131-1146, 2000.
 Gebremedhin, Manne and Pothen, Parallel distance-k
  coloring algorithms for numerical optimization,
  Lecture Notes in Computer Science, 2400: 912-921,
  2002.
 Bozdag, Gebremedhin, Manne, Boman and Catalyurek.
  A framework for scalable greedy coloring on
  distributed-memory parallel computers. J. Parallel
  Distrib. Comput. 68(4):515-535, 2008.
 Catalyurek, Feo, Gebremedhin, Halappanavar and
  Pothen, Multi-threaded algorithms for graph coloring,
  Preprint, Aug. 2010.
                                                      44

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:25
posted:2/21/2013
language:English
pages:37