Document Sample

Winter 08, UC Irvine EECS221 Final Project-Parallel Radix Sort Using OpenMP wnie@uci.edu Nie Weiran 93243903 1. Introduction: In this final project I choose to parallelize a sorting algorithm-radix sort using OpenMP. Three versions of program are submitted: the sequential version, the parallel version with basic optimization and the parallel version optimized using buckets. In the following sections, I will first talk about general sequential and parallel algorithms and my basic implementation. Then I focus on how I improve the performance using buckets and report the result and points that needs to be further improved. Detailed comments are also given in the C source code. 2. Radix Sorting Algorithms and Basic Implementation: In sequential radix sort, each element to be sorted can be represented by b (binary) bits. Typically, one pass of radix sorting sorts blocks of r bits, called a digit. Starting from the least significant digit, sequential radix sorting can finish in ���� ���� passes. Clearly, if we increase r, the number of passes will decrease, but the amount of computation in each pass will increase. One import characteristic of radix sorting is that the sorting algorithm used must be stable. In my implementation, I use counting sort. Basic counting sort algorithm is as follows: Count the number of elements of each value in input array A; the counts are stored in an count array R 0 … 2���� − 1 , where 2���� is the radix; Compute the prefix sum in count array R, which is to the position of the corresponding element in the output array; Elements are then copied into an output array. Parallelizing radix sort is in fact to parallelize counting sort. According to [1], the basic parallel counting sort follows these steps, analogous to its sequential version: In parallel, each one of p processor counts its assigned ���� ���� elements independently; All processors working cooperatively to compute the global prefix sum; Each processor then copies its assigned values to the shared output array independently. The most challenging part of parallel counting sort is to figure out how to perform the computation in step 2 efficiently. To illustrate how I arrange the local counts to compute the global prefix sum, I will use Figure 1: Winter 08, UC Irvine [1][0] [1][1] Global … CountArray [digitvalue][tid] [0][0] [0][1] [2���� − 1][0] [2���� − 1][1] Figure 1 Arrangement of local counts to compute the global prefix sum, demonstrated using 2 threads. We can view the global count array in Fig 1 as a one-dimensional array resulting from flattening a 2-dimensional array. Each one of the p threads (Fig 1 only shows 2 threads) puts their local count in an interleaved fashion. For example, thread 0 puts its local count of 0s into GlobalCountArray[0][0], thread 1 puts its local count of 0s into GlobalCountArray[0][1] and so on. Having arranging the local counts from the various threads in such a way, I can then treat the global count array as a normal one-dimension array and implement one of the parallel prefix sum algorithms taught in class (In reality I used odd/even parallel prefix sum algorithm). After computing the global prefix sum, each thread can then use the computed global prefix sums belonged to it to determine the position of its assigned element in the output array. It turns out where the local count is carried out does affect the performance, which is a surprise to me. Initially, I directly used the global count array to perform the local counting. The reason is that each thread has their own dedicated positions to record their local counts, therefore there are no interference and performance should not be affected. But after I got the basic code running, there is little performance gain. I tried several ways to improve it, and it proved effective if I perform local count locally and then copy them back into the global count array. The idea is illustrated in Figure 2. ����0 0 ����1 0 ����0 2���� − 1 ����1 2���� − 1 Global CountArray … [digitValue][tid] ����0 1 ����1 1 Figure 2 performing local count locally, and then copying them back to the global count array to compute the prefix sum does improve performance. �������� ���� … �������� − ���� represents the local count array owned by thread 0, where �������� is the radix (r is the number of bits processed in each pass of radix sort). Winter 08, UC Irvine Other effective optimizations include using interleaved (cyclic) partition instead of block partition in assigning the elements of the input array to threads. 3. Improve Performance Through Pre-sorting Using Buckets: The input elements I use are random unsigned integers, range from 0~232 − 1. When I run the sequential version with radix equals to 256, that is, consider 8 bits in one pass of radix sort, it takes significant shorter CPU time than if I run it with radix equals to 65536 (detailed timing will be reported later). While for the basic parallel version, the radix of 65536 performs better than 256. This leads me think: Considering the random access pattern when we perform the local count (because input elements are generated randomly), the radix should be chosen the smaller the better to reduce cache misses. For example, the radix of 256 results in a local count array of size 256*4byte=1Kb, while a radix of 65536 result in a 256Kb local count array. The whole local count array is not likely to be contained in cache. On the other hand, the less the number of passes in radix sort, the better it seems to parallel program (perhaps because of parallel overhead). What if I try to do something to combine the merits of the two: keep radix large to reduce the number of passes while reduce the cache misses caused by the large local count array. The method I used is to perform a bucket sort before doing each counting sort. The idea is demonstrated in Figure 3. unsorted randomly …… …… generated presorted[0] …… bucket #0, containing digit values range from 0~255. digitValue[0] …… presorted[1] …… bucket #1, containing digit values range from 256~511. digitValue[1] …… …… presorted[255] …… bucket #255, containing digit values range from 65281~65535. digitValue[255] …… Figure 3 increasing the access locality by pre-sort using bucket. Winter 08, UC Irvine In figure 3, the unsorted array consists of randomly generated unsigned integers range from 0~231 − 1. During each pass, unsorted array is scanned linearly and a mask is used to extract one digit from each element, which forms a digitValue array and serves as the input of the parallel counting sort. Without bucket pre-sort, each extracted digit value is directly put into the digitValue array using the same index as in unsorted array, so that the corresponding element in unsorted array can be easily located later. With bucket pre-sort four problems arises immediately: How big should be pre-allocated to each bucket to ensure the pre-allocated size is no smaller than the actual size; How to assign each element in the unsorted array to its corresponding bucket efficiently; After bucket pre-sort, the elements in unsorted array cannot be located using the same index as that in digitValue array, then we should keep track of which element in unsorted array goes where in digitValue array to retrieve it at a later time; Whenever a digit is added into a bucket, a pointer recording the actual size of bucket increments. When running multiple threads, it is likely that several threads access and increment the pointer simultaneously. How to solve this synchronization problem while not let it degrade performance. As for the first problem, I assume the actual number in each bucket will be roughly the same because the input elements are randomly generated. Therefore I heuristically assign a bucket size which is larger than the total number of elements divided by number of buckets. For the second problem, using radix of 65536 and 256 buckets, I simply divide each digit value by 256 (essentially a 1byte right shift) to get the bucket number it belongs to. For example, 65535/256 = 255, so the digit value 65535 should go to bucket # 255. For the third problem, I used an auxiliary array e[], which has the same size as digitValue[]. Whenever a digit is extracted from unsorted[] and put into a bucket in digitValue[], the original element in unsorted[] is put into e[] at exactly the same index as digitValue[]. This facilitates the later retrieval of the original element, but does require more memory space, especially when the input array is huge. For the fourth problem, I tried two OpenMP synchronization constructs: omp critical and omp atomic. omp critical is really not necessary because it prevents all the other threads to access the whole pointer array while actually only several threads trying to increment the pointer of the same bucket should be prevented. omp atomic is right for this purpose, which actually provides a mini-critical section. However, due to the restriction of the syntax of omp atomic, only a write operation can be used along with the construct. This leads to a problem: when two threads tries to read the pointer value and put their digit into that position, race condition may occur. I think that is why the correctness test of my program with bucket pre- Winter 08, UC Irvine sorting fails. One feasible solution I can figure out is to create a lock array corresponding to the pointer array. Before reading and incrementing each pointer, a thread has to first acquire the lock corresponding to the pointer. However, the current program can be reasonably viewed as an approximation of how much performance gain can be gleaned by using bucket pre-sorting. In summary, it can be seen that adding bucket pre-sort scheme would add extra CPU computation, synchronization and space requirement. As a result, the performance gain is not as large as I expected. I report the timing in the following section. 4. Timing Report and Instruction to Run the Program: Program Version Radix CPU Time Speedup (unit: sec) Sequential 65536 7.4571 - Sequential 256 4.9610 base time Parallel_basic 65536 5.4770 0.9057 Parallel_basic 256 6.4970 0.7635 Parallel_local_count_array 65536 3.9780 1.2471 Parallel_local_count_array 256 4.8990 1.0126 Parallel_bucket_pre_sort 65536 3.4630 1.4320 Table 1 final project timing report. Input array size: 25,000,000, test platform: AMD dual core laptop, number of threads running in parallel program: 2. The speedup between parallel radix sort with bucket pre-sort and sequential radix sort based on radix of 256 is 1.432. To run the program in Visual Studio, first create a project with one of the source files and compile it. Suppose the name of the executable file is radix_sort.exe, to run the program, just specify the number to be sorted in the command line, e.g. C:\eecs221 final\radix_sort 25000000 References: [1] N. M. Amato. A Comparison of Parallel Sorting Algorithms on Different Architectures. TAMU technical report 98-029, Jan. 1996. [2] M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings Supercomputing’91, pages 712–721, Nov. 1991.

DOCUMENT INFO

Shared By:

Categories:

Tags:
final project, shared memory, openmp directives, openmp programs, openmp implementation, free pdf ebook, ayon basumallik, distributed memory, parallel programming, pdf search, prefix sum, parallel radix sort, radix sort, input elements, digit values

Stats:

views: | 19 |

posted: | 9/6/2010 |

language: | English |

pages: | 5 |

OTHER DOCS BY cwl19788

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.