Principles of Parallel Algorithm Design - PowerPoint

Document Sample
Principles of Parallel Algorithm Design - PowerPoint Powered By Docstoc
					Principles of Parallel Algorithm
           Carl Tropper
   Department of Computer Science
        What has to be done
• Identify concurrency in program
• Map concurrent pieces to parallel
• Distribute input, output and intermediate
• Manage accesses to shared data by
• Synchronize processors as program
• Tasks
• Task Dependency graph
Matrix vector multiplication
           Database Query
• Model =civic and year=2001 and
  (color=green or color=white)
Data Dependencies
Another graph
                       Task talk
• Task Granularity
      • Fine grained, coarse grained
• Degree of concurrency
      • Average degree-average number of tasks which can run in
      • Maximum degree
• Critical path
      • Length-sum of the weights of the nodes on the path
      • Average degree of concurrency=total work/length
      Task interaction graph
• Nodes are tasks
• Edges indicate interaction of tasks
• Task dependency graph a subset of task
  interaction graph
   Sparse matrix multiplication
• Tasks compute entries of output vector
• Task i owns row i and b(i)
• Task i sends non zero elements of row i to
  other tasks which need them
Sparse matrix task interaction graph
               Process Mapping
               Goals and illusions
• Goals
     • Maximize concurrency by mapping independent
       tasks to different processors
     • Minimize completion time by having a process
       ready on the critical path when a task is ready
     • Map processes which communicate a lot to same
• Illusions
     • Can’t do all of the above-they conflict
           Task Decomposition
• Big idea
     • First decompose for message passing
     • Then decompose for the shared memory on each
• Decomposition Techniques
     •   Recursive
     •   Data
     •   Exploratory
     •   Speculative
    Recursive Decomposition
• Good for problems which are amenable to
  a divide and conquer strategy
• Quicksort - a natural fit
Quicksort Task Dependency Graph
 Sometimes we force the issue
We re-cast the problem into divide and
 conquer paradigm
          Data Decomposition
• Idea-partitioning of data leads to tasks
• Can partition
     •   Output data
     •   Input data
     •   Intermediate data
     •   Whatever………………….
         Partitioning Output Data
Each element of the output is computed
 independently as a function of the input
Other decompositions
  Output data again
Frequency of itemsets
           Partition Input Data
• Sometimes more natural thing to do
       • Sum of n numbers-only have one output
•   Divide input into groups
•   One task per group
•   Get intermediate results
•   Create one task to combine intermediate
      Top-partition input
Bottom-partition input and output
Partitioning of Intermediate Data

• Good for multi-stage algorithms

• May improve concurrency over a strictly
  input or strictly output partition
Matrix Multiply Again
         Concurrency Picture
• Max concurrency of 8 vs
• Max concurrency of 4 for output partition
• Price is storage for D
   Exploratory Decomposition

• For search space type problems
• Partition search space into small parts
• Look for solution in each part
Search Space Problem
    The15 puzzle
Parallel vs serial-Is it worth it?
  It depends on where you find the answer
   Speculative Decomposition
• Computation gambles at a branch point in
  the program
• Takes path before it knows result
• Win big or waste
           Speculative Example
            Parallel discrete event simulation
• Idea: Compute results at c,d,e before output from a is known
• Sometimes better to put two ideas
• Quicksort - Recursion results in O(n)
  tasks, little concurrency.
• First decompose, then recurse (a poem)
Tasks and their interactions influence choice
 of mapping scheme
          Task Characteristics

Task generation
 Static- know all tasks before algorithm executes
   • Data decomposition leads to static generation
   • Recursive decomposition leads to dynamic
   • Quicksort
            Task Characteristics

• Task sizes
     • Uniform, non-uniform
     • Knowledge of task sizes
        – 15 puzzle: don’t know task sizes
        – Matrix multiplication: do know task sizes

• Size of data associated with tasks
     • Big data can cause big communication
            Task interactions
• Tasks share data, synchronization
  information, work
• Static vs dynamic
    • Static-know task interaction graph and when
      interactions happen before execution
       – Parallel matrix multiply
    • Dynamic
       – 15 puzzle problem
              More interactions
• Regular versus irregular
     •   Interaction may have structure which can be used
     •   Regular: image dithering
     •   Irregular: sparse matrix multiplication
     •   Access pattern for b depends on structure of A
Image dithering
                  Data sharing
• Read only- parallel matrix multiply

• Read-write
  – 15 puzzle
     • Heuristic search:estimate number of moves to solution from
       each state
     • Use priority queue to store states to be expanded
     • Priority queue contains shared data
          Task interactions
• One way
  – Read only
• Two way
  – Producer consumer style
  – Read-write (15 puzzle)
     Mapping tasks to processes
Reduce overhead caused by parallel execution

• Reduce communication between processes
• Minimize task idling
   – Need to balance the load
• But these goals can conflict
Balancing load is not always enough to avoid idling
Task dependencies get in the way

Processes 9-12 can’t proceed until 1-8 finish

MORAL: Include task dependency information in mapping
            Mappings can be
• Static-distribute tasks before algorithm
     • Depends on task size, size of data, task
     • NP complete for non-uniform tasks
• Dynamic-distribute tasks during algorithm
     • Easier with shared memory
              Static Mapping

• Data partitioning
  • Results in task decomposition
  • Arrays, graphs common ways to represent data

• Task partitioning
  • Task dependency graph is static
  • Know task sizes
            Array Distribution
• Block distribution
     • Each process gets contiguous entries
     • Good if computation of array element requires
       nearby elements
     • Load imbalance if different blocks do different
       amounts of work
• Block cyclic and cyclic distributions used
  to redress load imbalances
Block distribution of matrix
Block decomposition of matrix
      C=A x B
Block Decomposition
  Higher dimension partitions

• More concurrency
    • up to n2 processes for 2D mapping vs n
      processes for 1D mapping
• Reduces amount of interaction between
    • 1D C requires all of B for each product
    • 2D C requires part of B
            Graph Partitioning
• Array algorithms good for dense matrices,
  structured interaction patterns
• Many algorithms
     • operate on sparse data structures
     • Interaction of data elements is irregular and data dependent
• Numerical simulations of physical phenomena
     • Are important
     • Have these characteristics
     • Use mesh-each point represents something
Lake Superior Mesh
Random distribution
            Balance the Load
Equalize number of edges crossing partitions
         Task Partitioning

• Map task dependency graph onto
• Optimal mapping problem is NP complete
• Different choices for mapping
  Binary Tree task dependency graph
• Happens for recursive algorithms-compute minimum
  of list of numbers
• Map onto hypercube of processes
Naïve task mapping
    Better mapping
C’s contain fewer elements
                    Hierarchical Mapping

• Load imbalance can occur using task dependency graphs (binary
  tree) Quicksort benefits
       Hierarchical Mapping
• Sparse matrix factorization
• High levels guided by task dependency
  graph, called the elimination graph
• Low level tasks use data decomposition
  because computation happens later
          Dynamic Mapping
• Why? Dynamic task dependency graph
• Two flavors
  – Centralized: Keep tasks in central data
    structure or are looked after by one process
  – Distributed: Processes exchange tasks at run
• Example: Sort each row of array by quicksort
• Problem: each row can take a different amount of time to
• Solution: Self scheduling-maintain list of un-sorted rows.
  Idle process picks from list.
• Problem: work queue becomes bottleneck
• Solution: Chunk scheduling-assign multiple tasks to
• Problem: Chunk size is too large. Load imbalance
• Solution: Decrease size of chunk as computation
          Distributed Schemes
• The Four Questions
  –   How do I measure the load on a task?
  –   To whom do I send?
  –   How much do I send?
  –   When do I send?
• Can tolerate smaller granularity on shared
  memory then on distributed memory machines.
    Tricks to reduce overhead of
         process interaction
• Maximize data locality
     • Minimize use of nonlocal data, minimize frequency
       of access, maximize reuse of recently accessed

• Minimize volume of data exchange
     • Use mapping scheme, e.g. 2 dimensional mapping
       vs 1 dimensional mapping
     • Use local data to store intermediate results, and
       access shared data once, e.g. break dot product of
       two vectors into p partial sums
  Minimize frequency of
• High startup cost associated with each interaction,so try to
  access large amounts of data
• Try for spatial locality-keep memory which is accessed
  consecutively close together
• Pack lots of data into a message in a message passing.
• Reduce number of cache lines fetched from shared memory.
• Example-repeated sparse matrix vector multiplication. Same
  matrix, but different data. Each process gets it needs from
  other processes prior to multiplication.
              Hot spots
• Hot spots happen-processes transmit
  over same link, access same data
• Sometimes can to re-arrange the
  computation to avoid the hot-spot
             Example C=AxB
• Ci,j=∑k=0Ai,k
Overlapping Computations with
• Try to do interaction before computation
  (static interaction pattern helps)
• Multiple tasks on same process. If one
  blocks, another can execute
• Need support from OS, hardware,
  programming paradigm
     • Disjoint memory, message passing architectures
     • Shared address space-prefetching hardware
    Tricks-Replicating Data and/or
• Replicate in each process
    • Frequent read only operations can make it
    • Mostly for distributed memory machines-shared
      memory machines have cache
   Optimize Heavy Duty Operations

• Operations
     • Access data
     • Communications intensive computations
     • Synchronization
• Algorithms and libraries exist
     • Algorithms-Discuss them soon
     • Libaries-MPI
Tricks-overlapping interactions
         Parallel Algorithm Models
   Recipes for decomposing, mapping and minimizing
• Data parallel
     •   Static mapping of tasks to processes
     •   Each task does the same thing to different data
     •   Phases - computation followed by synchronization
     •   Message passing architecture more amenable to
         this style then shared memory architecture
• Task Graph Model
    • Used when amount of data is large relative to the
      computation on the data
    • Used with
       – divide and conquer algorithms
       – Parallel quicksort
       – Sparse matrix factorization
• Work pool model
  – Any task can be executed by any process
  – Dynamic mapping of tasks to processes
  – Examples
    • Parallelization of loops by chunk scheduling
    • Parallel tree search
• Master-slave model
     • Dictator gives work to students
     • Hierarchical master-slave model
• Pipeline
     •   Stream of data passed through processes
     •   Producers followed by consumers
     •   General graph, not just a linear array
     •   Example-Parallel LU factorization (later)
• Hybrid