# Principles of Parallel Algorithm Design - PowerPoint

Document Sample

```					Principles of Parallel Algorithm
Design
Carl Tropper
Department of Computer Science
What has to be done
• Identify concurrency in program
• Map concurrent pieces to parallel
processes
• Distribute input, output and intermediate
data
• Manage accesses to shared data by
processors
• Synchronize processors as program
executes
Vocabulary
Matrix vector multiplication
Database Query
• Model =civic and year=2001 and
(color=green or color=white)
Data Dependencies
Another graph
• Fine grained, coarse grained
• Degree of concurrency
• Average degree-average number of tasks which can run in
parallel
• Maximum degree
• Critical path
• Length-sum of the weights of the nodes on the path
• Average degree of concurrency=total work/length
• Edges indicate interaction of tasks
interaction graph
Sparse matrix multiplication
• Tasks compute entries of output vector
• Task i owns row i and b(i)
• Task i sends non zero elements of row i to
Process Mapping
Goals and illusions
• Goals
• Maximize concurrency by mapping independent
• Minimize completion time by having a process
• Map processes which communicate a lot to same
processor
• Illusions
• Can’t do all of the above-they conflict
• Big idea
• First decompose for message passing
• Then decompose for the shared memory on each
node
• Decomposition Techniques
•   Recursive
•   Data
•   Exploratory
•   Speculative
Recursive Decomposition
• Good for problems which are amenable to
a divide and conquer strategy
• Quicksort - a natural fit
Sometimes we force the issue
We re-cast the problem into divide and
Data Decomposition
• Can partition
•   Output data
•   Input data
•   Intermediate data
•   Whatever………………….
Partitioning Output Data
Each element of the output is computed
independently as a function of the input
Other decompositions
Output data again
Frequency of itemsets
Partition Input Data
• Sometimes more natural thing to do
• Sum of n numbers-only have one output
•   Divide input into groups
•   Get intermediate results
•   Create one task to combine intermediate
results
Top-partition input
Bottom-partition input and output
Partitioning of Intermediate Data

• Good for multi-stage algorithms

• May improve concurrency over a strictly
input or strictly output partition
Matrix Multiply Again
Concurrency Picture
• Max concurrency of 8 vs
• Max concurrency of 4 for output partition
• Price is storage for D
Exploratory Decomposition

• For search space type problems
• Partition search space into small parts
• Look for solution in each part
Search Space Problem
The15 puzzle
Decomposition
Parallel vs serial-Is it worth it?
It depends on where you find the answer
Speculative Decomposition
• Computation gambles at a branch point in
the program
• Takes path before it knows result
• Win big or waste
Speculative Example
Parallel discrete event simulation
• Idea: Compute results at c,d,e before output from a is known
Hybrid
• Sometimes better to put two ideas
together
Hybrid
• Quicksort - Recursion results in O(n)
• First decompose, then recurse (a poem)
Mapping
Tasks and their interactions influence choice
of mapping scheme

Static- know all tasks before algorithm executes
• Data decomposition leads to static generation
Dynamic-runtime
• Recursive decomposition leads to dynamic
generation
• Quicksort

• Uniform, non-uniform
– 15 puzzle: don’t know task sizes
– Matrix multiplication: do know task sizes

• Size of data associated with tasks
• Big data can cause big communication
information, work
• Static vs dynamic
• Static-know task interaction graph and when
interactions happen before execution
– Parallel matrix multiply
• Dynamic
– 15 puzzle problem
More interactions
• Regular versus irregular
•   Interaction may have structure which can be used
•   Regular: image dithering
•   Irregular: sparse matrix multiplication
•   Access pattern for b depends on structure of A
Image dithering
Data sharing
• Read only- parallel matrix multiply

– 15 puzzle
• Heuristic search:estimate number of moves to solution from
each state
• Use priority queue to store states to be expanded
• Priority queue contains shared data
• One way
• Two way
– Producer consumer style
Goal
Reduce overhead caused by parallel execution

So
• Reduce communication between processes
– Need to balance the load
• But these goals can conflict
Balancing load is not always enough to avoid idling
Task dependencies get in the way

Processes 9-12 can’t proceed until 1-8 finish

MORAL: Include task dependency information in mapping
Mappings can be
executes
interactions
• NP complete for non-uniform tasks
execution
• Easier with shared memory
Static Mapping

• Data partitioning
• Arrays, graphs common ways to represent data

• Task dependency graph is static
Array Distribution
• Block distribution
• Each process gets contiguous entries
• Good if computation of array element requires
nearby elements
• Load imbalance if different blocks do different
amounts of work
• Block cyclic and cyclic distributions used
Block distribution of matrix
Block decomposition of matrix
C=A x B
Block Decomposition
Higher dimension partitions

• More concurrency
• up to n2 processes for 2D mapping vs n
processes for 1D mapping
• Reduces amount of interaction between
processes
• 1D C requires all of B for each product
• 2D C requires part of B
Graph Partitioning
• Array algorithms good for dense matrices,
structured interaction patterns
• Many algorithms
• operate on sparse data structures
• Interaction of data elements is irregular and data dependent
• Numerical simulations of physical phenomena
• Are important
• Have these characteristics
• Use mesh-each point represents something
physical
Lake Superior Mesh
Random distribution
Equalize number of edges crossing partitions

• Map task dependency graph onto
processes
• Optimal mapping problem is NP complete
• Different choices for mapping
• Happens for recursive algorithms-compute minimum
of list of numbers
• Map onto hypercube of processes
Better mapping
C’s contain fewer elements
Hierarchical Mapping

tree) Quicksort benefits
Hierarchical Mapping
• Sparse matrix factorization
• High levels guided by task dependency
graph, called the elimination graph
• Low level tasks use data decomposition
because computation happens later
Dynamic Mapping
• Why? Dynamic task dependency graph
• Two flavors
– Centralized: Keep tasks in central data
structure or are looked after by one process
– Distributed: Processes exchange tasks at run
time
Centralized
• Example: Sort each row of array by quicksort
• Problem: each row can take a different amount of time to
sort
• Solution: Self scheduling-maintain list of un-sorted rows.
Idle process picks from list.
• Problem: work queue becomes bottleneck
• Solution: Chunk scheduling-assign multiple tasks to
process
• Problem: Chunk size is too large. Load imbalance
• Solution: Decrease size of chunk as computation
proceeds
Distributed Schemes
• The Four Questions
–   To whom do I send?
–   How much do I send?
–   When do I send?
• Can tolerate smaller granularity on shared
memory then on distributed memory machines.
process interaction
• Maximize data locality
• Minimize use of nonlocal data, minimize frequency
of access, maximize reuse of recently accessed
data

• Minimize volume of data exchange
• Use mapping scheme, e.g. 2 dimensional mapping
vs 1 dimensional mapping
• Use local data to store intermediate results, and
access shared data once, e.g. break dot product of
two vectors into p partial sums
Minimize frequency of
interactions
• High startup cost associated with each interaction,so try to
access large amounts of data
• Try for spatial locality-keep memory which is accessed
consecutively close together
• Pack lots of data into a message in a message passing.
• Reduce number of cache lines fetched from shared memory.
• Example-repeated sparse matrix vector multiplication. Same
matrix, but different data. Each process gets it needs from
other processes prior to multiplication.
Hot spots
• Hot spots happen-processes transmit
over same link, access same data
• Sometimes can to re-arrange the
computation to avoid the hot-spot
Example C=AxB
• Ci,j=∑k=0Ai,k
Overlapping Computations with
Interactions
• Try to do interaction before computation
(static interaction pattern helps)
• Multiple tasks on same process. If one
blocks, another can execute
• Need support from OS, hardware,
• Disjoint memory, message passing architectures
Tricks-Replicating Data and/or
computations
• Replicate in each process
• Frequent read only operations can make it
worthwhile
• Mostly for distributed memory machines-shared
memory machines have cache
Optimize Heavy Duty Operations

• Operations
• Access data
• Communications intensive computations
• Synchronization
• Algorithms and libraries exist
• Algorithms-Discuss them soon
• Libaries-MPI
Tricks-overlapping interactions
Parallel Algorithm Models
or
Recipes for decomposing, mapping and minimizing
• Data parallel
•   Static mapping of tasks to processes
•   Each task does the same thing to different data
•   Phases - computation followed by synchronization
•   Message passing architecture more amenable to
this style then shared memory architecture
Recipes
• Used when amount of data is large relative to the
computation on the data
• Used with
– divide and conquer algorithms
– Parallel quicksort
– Sparse matrix factorization
Recipes
• Work pool model
– Any task can be executed by any process
– Dynamic mapping of tasks to processes
– Examples
• Parallelization of loops by chunk scheduling
• Parallel tree search
Recipes
• Master-slave model
• Dictator gives work to students
• Hierarchical master-slave model
• Pipeline
•   Stream of data passed through processes
•   Producers followed by consumers
•   General graph, not just a linear array
•   Example-Parallel LU factorization (later)
• Hybrid

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 210 posted: 9/27/2010 language: English pages: 90