Document Sample

Principles of Parallel Algorithm Design Carl Tropper Department of Computer Science What has to be done • Identify concurrency in program • Map concurrent pieces to parallel processes • Distribute input, output and intermediate data • Manage accesses to shared data by processors • Synchronize processors as program executes Vocabulary • Tasks • Task Dependency graph Matrix vector multiplication Database Query • Model =civic and year=2001 and (color=green or color=white) Data Dependencies Another graph Task talk • Task Granularity • Fine grained, coarse grained • Degree of concurrency • Average degree-average number of tasks which can run in parallel • Maximum degree • Critical path • Length-sum of the weights of the nodes on the path • Average degree of concurrency=total work/length Task interaction graph • Nodes are tasks • Edges indicate interaction of tasks • Task dependency graph a subset of task interaction graph Sparse matrix multiplication • Tasks compute entries of output vector • Task i owns row i and b(i) • Task i sends non zero elements of row i to other tasks which need them Sparse matrix task interaction graph Process Mapping Goals and illusions • Goals • Maximize concurrency by mapping independent tasks to different processors • Minimize completion time by having a process ready on the critical path when a task is ready • Map processes which communicate a lot to same processor • Illusions • Can’t do all of the above-they conflict Task Decomposition • Big idea • First decompose for message passing • Then decompose for the shared memory on each node • Decomposition Techniques • Recursive • Data • Exploratory • Speculative Recursive Decomposition • Good for problems which are amenable to a divide and conquer strategy • Quicksort - a natural fit Quicksort Task Dependency Graph Sometimes we force the issue We re-cast the problem into divide and conquer paradigm Data Decomposition • Idea-partitioning of data leads to tasks • Can partition • Output data • Input data • Intermediate data • Whatever…………………. Partitioning Output Data Each element of the output is computed independently as a function of the input Other decompositions Output data again Frequency of itemsets Partition Input Data • Sometimes more natural thing to do • Sum of n numbers-only have one output • Divide input into groups • One task per group • Get intermediate results • Create one task to combine intermediate results Top-partition input Bottom-partition input and output Partitioning of Intermediate Data • Good for multi-stage algorithms • May improve concurrency over a strictly input or strictly output partition Matrix Multiply Again Concurrency Picture • Max concurrency of 8 vs • Max concurrency of 4 for output partition • Price is storage for D Exploratory Decomposition • For search space type problems • Partition search space into small parts • Look for solution in each part Search Space Problem The15 puzzle Decomposition Parallel vs serial-Is it worth it? It depends on where you find the answer Speculative Decomposition • Computation gambles at a branch point in the program • Takes path before it knows result • Win big or waste Speculative Example Parallel discrete event simulation • Idea: Compute results at c,d,e before output from a is known Hybrid • Sometimes better to put two ideas together Hybrid • Quicksort - Recursion results in O(n) tasks, little concurrency. • First decompose, then recurse (a poem) Mapping Tasks and their interactions influence choice of mapping scheme Task Characteristics Task generation Static- know all tasks before algorithm executes • Data decomposition leads to static generation Dynamic-runtime • Recursive decomposition leads to dynamic generation • Quicksort Task Characteristics • Task sizes • Uniform, non-uniform • Knowledge of task sizes – 15 puzzle: don’t know task sizes – Matrix multiplication: do know task sizes • Size of data associated with tasks • Big data can cause big communication Task interactions • Tasks share data, synchronization information, work • Static vs dynamic • Static-know task interaction graph and when interactions happen before execution – Parallel matrix multiply • Dynamic – 15 puzzle problem More interactions • Regular versus irregular • Interaction may have structure which can be used • Regular: image dithering • Irregular: sparse matrix multiplication • Access pattern for b depends on structure of A Image dithering Data sharing • Read only- parallel matrix multiply • Read-write – 15 puzzle • Heuristic search:estimate number of moves to solution from each state • Use priority queue to store states to be expanded • Priority queue contains shared data Task interactions • One way – Read only • Two way – Producer consumer style – Read-write (15 puzzle) Mapping tasks to processes Goal Reduce overhead caused by parallel execution So • Reduce communication between processes • Minimize task idling – Need to balance the load • But these goals can conflict Balancing load is not always enough to avoid idling Task dependencies get in the way Processes 9-12 can’t proceed until 1-8 finish MORAL: Include task dependency information in mapping Mappings can be • Static-distribute tasks before algorithm executes • Depends on task size, size of data, task interactions • NP complete for non-uniform tasks • Dynamic-distribute tasks during algorithm execution • Easier with shared memory Static Mapping • Data partitioning • Results in task decomposition • Arrays, graphs common ways to represent data • Task partitioning • Task dependency graph is static • Know task sizes Array Distribution • Block distribution • Each process gets contiguous entries • Good if computation of array element requires nearby elements • Load imbalance if different blocks do different amounts of work • Block cyclic and cyclic distributions used to redress load imbalances Block distribution of matrix Block decomposition of matrix C=A x B Block Decomposition Higher dimension partitions • More concurrency • up to n2 processes for 2D mapping vs n processes for 1D mapping • Reduces amount of interaction between processes • 1D C requires all of B for each product • 2D C requires part of B Graph Partitioning • Array algorithms good for dense matrices, structured interaction patterns • Many algorithms • operate on sparse data structures • Interaction of data elements is irregular and data dependent • Numerical simulations of physical phenomena • Are important • Have these characteristics • Use mesh-each point represents something physical Lake Superior Mesh Random distribution Balance the Load Equalize number of edges crossing partitions Task Partitioning • Map task dependency graph onto processes • Optimal mapping problem is NP complete • Different choices for mapping Binary Tree task dependency graph • Happens for recursive algorithms-compute minimum of list of numbers • Map onto hypercube of processes Naïve task mapping Better mapping C’s contain fewer elements Hierarchical Mapping • Load imbalance can occur using task dependency graphs (binary tree) Quicksort benefits Hierarchical Mapping • Sparse matrix factorization • High levels guided by task dependency graph, called the elimination graph • Low level tasks use data decomposition because computation happens later Dynamic Mapping • Why? Dynamic task dependency graph • Two flavors – Centralized: Keep tasks in central data structure or are looked after by one process – Distributed: Processes exchange tasks at run time Centralized • Example: Sort each row of array by quicksort • Problem: each row can take a different amount of time to sort • Solution: Self scheduling-maintain list of un-sorted rows. Idle process picks from list. • Problem: work queue becomes bottleneck • Solution: Chunk scheduling-assign multiple tasks to process • Problem: Chunk size is too large. Load imbalance • Solution: Decrease size of chunk as computation proceeds Distributed Schemes • The Four Questions – How do I measure the load on a task? – To whom do I send? – How much do I send? – When do I send? • Can tolerate smaller granularity on shared memory then on distributed memory machines. Tricks to reduce overhead of process interaction • Maximize data locality • Minimize use of nonlocal data, minimize frequency of access, maximize reuse of recently accessed data • Minimize volume of data exchange • Use mapping scheme, e.g. 2 dimensional mapping vs 1 dimensional mapping • Use local data to store intermediate results, and access shared data once, e.g. break dot product of two vectors into p partial sums Minimize frequency of interactions • High startup cost associated with each interaction,so try to access large amounts of data • Try for spatial locality-keep memory which is accessed consecutively close together • Pack lots of data into a message in a message passing. • Reduce number of cache lines fetched from shared memory. • Example-repeated sparse matrix vector multiplication. Same matrix, but different data. Each process gets it needs from other processes prior to multiplication. Hot spots • Hot spots happen-processes transmit over same link, access same data • Sometimes can to re-arrange the computation to avoid the hot-spot Example C=AxB • Ci,j=∑k=0Ai,k Overlapping Computations with Interactions • Try to do interaction before computation (static interaction pattern helps) • Multiple tasks on same process. If one blocks, another can execute • Need support from OS, hardware, programming paradigm • Disjoint memory, message passing architectures • Shared address space-prefetching hardware Tricks-Replicating Data and/or computations • Replicate in each process • Frequent read only operations can make it worthwhile • Mostly for distributed memory machines-shared memory machines have cache Optimize Heavy Duty Operations • Operations • Access data • Communications intensive computations • Synchronization • Algorithms and libraries exist • Algorithms-Discuss them soon • Libaries-MPI Tricks-overlapping interactions Parallel Algorithm Models or Recipes for decomposing, mapping and minimizing • Data parallel • Static mapping of tasks to processes • Each task does the same thing to different data • Phases - computation followed by synchronization • Message passing architecture more amenable to this style then shared memory architecture Recipes • Task Graph Model • Used when amount of data is large relative to the computation on the data • Used with – divide and conquer algorithms – Parallel quicksort – Sparse matrix factorization Recipes • Work pool model – Any task can be executed by any process – Dynamic mapping of tasks to processes – Examples • Parallelization of loops by chunk scheduling • Parallel tree search Recipes • Master-slave model • Dictator gives work to students • Hierarchical master-slave model • Pipeline • Stream of data passed through processes • Producers followed by consumers • General graph, not just a linear array • Example-Parallel LU factorization (later) • Hybrid

DOCUMENT INFO

Shared By:

Categories:

Tags:
parallel algorithms, parallel programming, parallel computing, introduction to parallel computing, parallel algorithm design, dependency graph, parallel computation, numerical algorithms, parallel processing, parallel algorithm, pram model, data decomposition, parallel computers

Stats:

views: | 210 |

posted: | 9/27/2010 |

language: | English |

pages: | 90 |

OTHER DOCS BY ocv22853

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.