Engineering Distributed Graph Algorithms in PGAS languages

					Engineering Distributed Graph
Algorithms in PGAS languages
    Guojing Cong, IBM research
   Joint work with George Almasi
         and Vijay Saraswat
Programming language from the perspective
of a not-so-distant admirer
Mapping graph algorithms onto distributed
memory machines has been a challenge
• Efficient mapping PRAM algorithm onto SMPs is hard
• Mapping onto a cluster of SMPs is even harder
• Optimizations are available and shown to improve performance
• Can these be somehow automated with help from the language
  design, compiler and runtime development?
• Expectations of the languages
    – Expressiveness
        • SPMD, task parallelism (spawn/async), pipeline, future, virtual shared-
          memory abstraction, work-stealing, data distribution, …
        • Ease of programming
    – Efficiency
        • Mapping high level constructs to run fast on the target machine
             –   SMP
             –   Multi-core, multi-threaded
             –   MPP
             –   Heterogeneous with accelerators
        • Leverage for tuning
A case study with connected components on
        a cluster of SMPs with UPC

• A connected component of an undirected graph G=(V,E), |V|=n,
  |E|=m, is a maximal connected subgraph
   – Connected components algorithm find all such components in G
• Sequential algorithms
   – Breadth-first traversal (BFS)
   – Depth-first traversal (DFS)
• One parallel algorithm -- Shiloach-Vishkin algorithm (SV82)
   – Edge list as input
   – Adopts the graft and shortcut approach
       • Start with n isolated vertices.
       • Graft vertex v to a neighbor u with (u < v)
       • Shortcut the connected components into super-vertices and continue on the
         reduced graph
                   Example: SV

            4       2     4           2
                                              1,4    2,3
1st iter.

            1        3    1               3
            Input graph       graft           shortcut

2nd iter.   1       2     1           2
                                      Simple? Yes, performs poorly
                                                                                                                       Random Graph, 1M vertices, 10M edges
                                          Random Graph (1M vertices, 20 M edges)
                                                                                                             250                                       Prim
Execution Time (400M cycles)


                                                                                            Time (seconds)
                               10                                                                            150



                                                                                                                   2       4             6               8      10
                                      2                4           6        8    10   12

                                                    Number of Processors                                                        Number of Processors
                                                                                Sun enterprise E4500

                                     • Memory-intensive, irregular accesses, poor temporal
Typical behavior of graph algorithms

 • CPI construction             • LRU stack distance plot

 • BC – betweeness centrality
 • BiCC – Biconnected
 • MST – Minimum spanning
 On distributed-memory machines
• Random access and indirection make it hard to
   – implement, e.g, no fast MPI implementation
   – Optimize, i.e., random access creates problems for both
     communication and cache performance

• The partitioned global address space (PGAS) paradigm
   – presents a shared-memory abstraction to the programmer for
     distributed-memory machines. receives a fair amount of attention
   – allows the programmer to control the data layout and work
   – improve ease of programming, and also give the programmer
     leverage to tune for high performance
    Implementation in UPC is

UPC implementation   Pthread implementation
Performance is miserable
Communication efficient algorithms
•   Proposed to address the “bottleneck of processor-to-processor
     – Goodrich [96] presented a communication-efficient sorting algorithmon weak-
       CREWBSP that runs in O(log n/ log(h + 1)) communication rounds and O((n log
       n)/p) local computation time, for h = Θ(n/p)
     – Adler et. al. [98] presented a communication-optimal MST algorithm
     – Dehne et al. [02] designed an efficient list ranking algorithm for coarse-grained
       multicomputers (CGM) and BSP that takes O(log p) communication rounds with
       O(n/p) local computation
•   Common approach
     – simulating several (e.g., O(log p) or O(log log p) ) steps of the PRAM algorithms
       to reduce the input size so that it fits in the memory of a single node
     – A “sequential” algorithm is then invoked to process the reduced input of size
     – finally the result is broadcast to all processors for computing the final solution
•   Question
     – How well do communication efficient algorithms work on practice?
     – How fast can optimized shared-memory based algorithms run? Cache
       performace vs. communication performance
     – Can these optimizations be automated through necessary language/compiler
   Locality-central optimization
• Improve locality behavior of the algorithm
  – The key performance issues are communication and
    cache performance
  – Determined by locality
• Many prior cache-friendly results, but no tangible
  practical evidence
  – Fine-grain parallelism makes it hard to optimize for
    temporal locality
  – Focus on spatial locality
     • To take advantage of large cache lines, hardware
       prefetching, software prefetching
Scheduling of the memory accesses in a
              parallel loop
Typical loop in CC

Generic loop
An example
       Mapping to the distributed
• All remote accesses are
  consecutive in our
• If the runtime provides
  remote prefetching or
  coalescing, then
  communication efficiency
  can be improved
• If not, coalescing can be
  easily done at the
  program level as shown
  on right
Performance improvement due to
    communication efficency
      Applying the approach to single-node
            for cache-friendly design

• Apply as many levels of                                                               CC

  recursions as necessary                                    0.75
                                                                                                            100M, 400M
                                                                                                            100M, 1G

• Simulate the recursions                                    0.70                                           200M, 800M

                                 normalized execution time
  with virtual threads                                       0.65

• Assuming a large-enough,                                   0.60

  one level, fully associative                               0.55

  cache                                                      0.50


Original execution time                                      0.40
                                                                    1   2   4   6   8   10   12   14   16      18        20

Optimized execution time
   Graph-specific optimization
• Compact edge list
  – the size of the list determines the number of elements
    to request from remote nodes
  – edges within components no longer contribute to the
    merging of connected components, and can be
    filtered out
• Avoid communication hotspot
  – Grafting in CC shoots a pointer from a vertex with
    larger numbering to one with smaller numbering.
  – Thread thr0 owns vertex 0, and may quickly become
    a communication hotspot
  – Avoid querying thr0 about D[0]
      UPC specific optimization
• Avoid runtime cost on local data
   – After optimization, all direct access to the shared arrays are local
   – Yet the compiler is not able to recognize
   – With UPC, we use private pointer arithmetics for
• Avoid intrinsics
   – It is costly to invoke compiler intrinsics to determine the target
     thread id
   – Computing target thread ids is done for every iteration.
   – we compute these ids directly instead of invoking the intrinsics.
   – Noticing that the target ids do not change across iteration, we
     compute them once and store them in a global buffer.
                                  Performance Results
                           Random Graph, 100M vertices, 400M edges                                                            Random Graph, 100M vertices, 1G edges

                    1000                                                                                          1000
                                                                                 Optimized                                                                                         Optimized
                                                                                 SMP                                                                                               SMP
                                                                                 BFS                                                                                               BFS

                                                                                                 Time (seconds)
Time (seconds)

                    100                                                                                            100

                     10                                                                                             10
                           16        32           64              128         256                                        16          32              64              128         256

                                               # Threads                                                                                       # Threads

                            Hybrid Graph, 100M vertices, 400M edges                                                      Random Graph, 100M vertices, 400M edges

                    100                                                                                           120
                                                                                     Comm                                                                                              Comm
                                                                                     Sort                                                                                              Sort
                                                                                     Copy                         100                                                                  Copy
                     80                                                              Irregular                                                                                         Irregular
                                                                                     Work                                                                                              Work
                                                                                     Setup                                                                                             Setup
   Time (seconds)

                                                                                                 Time (seconds)




                      0                                                                                            0
                           base   compact   offload    circular    localcpy     id                                       base    compact   offload        circular    localcpy    id

                                            Implementations                                                                                Implementations
       So, how helpful is UPC
• Straightforward mapping of shared-memory
  algorithm is easy
  – quick prototyping
  – Quick profiling
  – Incremental optimization (10 versions for CC)
• All other optimizations are manual
• Many of them can be automated, though
• UPC is not flexible enough to expose the
  hierarchy of nodes and processors to the
    Conclusion and future work
• We show that with appropriate optimizations, shared-memory graph
  algorithms can be mapped to the PGAS environment with high
• On inputs that fit in the main memory on one node, our
  implementation achieves good speedups over the best SMP
  implementation and the best sequential implementation.
• Our results suggest that effective use of processors and caches can
  bring better performance than simply reducing the communication
• Automating these optimizations is our future work

Shared By:
Jun Wang Jun Wang Dr
About Some of Those documents come from internet for research purpose,if you have the copyrights of one of them,tell me by mail you!