EECS221 Final Project-Parallel Radix Sort Using OpenMP

Document Sample
EECS221 Final Project-Parallel Radix Sort Using OpenMP Powered By Docstoc
					                                                                                  Winter 08, UC Irvine


             EECS221 Final Project-Parallel Radix Sort Using OpenMP
wnie@uci.edu                                                                  Nie Weiran 93243903

1. Introduction:
     In this final project I choose to parallelize a sorting algorithm-radix sort using OpenMP.
  Three versions of program are submitted: the sequential version, the parallel version with
  basic optimization and the parallel version optimized using buckets. In the following sections,
  I will first talk about general sequential and parallel algorithms and my basic implementation.
  Then I focus on how I improve the performance using buckets and report the result and points
  that needs to be further improved. Detailed comments are also given in the C source code.

2. Radix Sorting Algorithms and Basic Implementation:
     In sequential radix sort, each element to be sorted can be represented by b (binary) bits.
  Typically, one pass of radix sorting sorts blocks of r bits, called a digit. Starting from the least
  significant digit, sequential radix sorting can finish in ���� ���� passes. Clearly, if we increase r,
  the number of passes will decrease, but the amount of computation in each pass will increase.
  One import characteristic of radix sorting is that the sorting algorithm used must be stable. In
  my implementation, I use counting sort. Basic counting sort algorithm is as follows:

            Count the number of elements of each value in input array A; the counts are stored
             in an count array R 0 … 2���� − 1 , where 2���� is the radix;
            Compute the prefix sum in count array R, which is to the position of the
             corresponding element in the output array;
            Elements are then copied into an output array.

     Parallelizing radix sort is in fact to parallelize counting sort. According to [1], the basic
  parallel counting sort follows these steps, analogous to its sequential version:

            In parallel, each one of p processor counts its assigned ���� ���� elements
             independently;
            All processors working cooperatively to compute the global prefix sum;
             Each processor then copies its assigned values to the shared output array
             independently.

     The most challenging part of parallel counting sort is to figure out how to perform the
  computation in step 2 efficiently. To illustrate how I arrange the local counts to compute the
  global prefix sum, I will use Figure 1:
                                                                                         Winter 08, UC Irvine



                                    [1][0] [1][1]



        Global                                      …
     CountArray
   [digitvalue][tid]

                          [0][0] [0][1]                     [2���� − 1][0] [2���� − 1][1]

 Figure 1 Arrangement of local counts to compute the global prefix sum, demonstrated using 2 threads.

    We can view the global count array in Fig 1 as a one-dimensional array resulting from
flattening a 2-dimensional array. Each one of the p threads (Fig 1 only shows 2 threads) puts
their local count in an interleaved fashion. For example, thread 0 puts its local count of 0s into
GlobalCountArray[0][0], thread 1 puts its local count of 0s into GlobalCountArray[0][1] and
so on. Having arranging the local counts from the various threads in such a way, I can then
treat the global count array as a normal one-dimension array and implement one of the
parallel prefix sum algorithms taught in class (In reality I used odd/even parallel prefix sum
algorithm). After computing the global prefix sum, each thread can then use the computed
global prefix sums belonged to it to determine the position of its assigned element in the
output array.

   It turns out where the local count is carried out does affect the performance, which is a
surprise to me. Initially, I directly used the global count array to perform the local counting.
The reason is that each thread has their own dedicated positions to record their local counts,
therefore there are no interference and performance should not be affected. But after I got the
basic code running, there is little performance gain. I tried several ways to improve it, and it
proved effective if I perform local count locally and then copy them back into the global
count array. The idea is illustrated in Figure 2.

                            ����0 0      ����1 0                 ����0 2���� − 1   ����1 2���� − 1

           Global
        CountArray                                      …
      [digitValue][tid]


                                       ����0 1   ����1 1
Figure 2 performing local count locally, and then copying them back to the global count array to compute the
prefix sum does improve performance. �������� ���� … �������� − ���� represents the local count array owned by thread 0,
where �������� is the radix (r is the number of bits processed in each pass of radix sort).
                                                                                               Winter 08, UC Irvine


     Other effective optimizations include using interleaved (cyclic) partition instead of block
  partition in assigning the elements of the input array to threads.

3. Improve Performance Through Pre-sorting Using Buckets:
      The input elements I use are random unsigned integers, range from 0~232 − 1. When I run
  the sequential version with radix equals to 256, that is, consider 8 bits in one pass of radix sort,
  it takes significant shorter CPU time than if I run it with radix equals to 65536 (detailed
  timing will be reported later). While for the basic parallel version, the radix of 65536
  performs better than 256. This leads me think:

              Considering the random access pattern when we perform the local count (because
               input elements are generated randomly), the radix should be chosen the smaller the
               better to reduce cache misses. For example, the radix of 256 results in a local count
               array of size 256*4byte=1Kb, while a radix of 65536 result in a 256Kb local count
               array. The whole local count array is not likely to be contained in cache.
              On the other hand, the less the number of passes in radix sort, the better it seems to
               parallel program (perhaps because of parallel overhead).

  What if I try to do something to combine the merits of the two: keep radix large to reduce the
  number of passes while reduce the cache misses caused by the large local count array. The
  method I used is to perform a bucket sort before doing each counting sort. The idea is
  demonstrated in Figure 3.

    unsorted                                                                                    randomly
                                      ……                               ……                       generated


  presorted[0]                        ……
                                                                       bucket #0, containing digit
                                                                       values range from 0~255.
  digitValue[0]                       ……


  presorted[1]                        ……
                                                                       bucket #1, containing digit
                                                                       values range from 256~511.
  digitValue[1]                       ……
                                       ……




  presorted[255]                      ……
                                                                       bucket #255, containing digit values
                                                                       range from 65281~65535.
  digitValue[255]                     ……


   Figure 3 increasing the access locality by pre-sort using bucket.
                                                                             Winter 08, UC Irvine


   In figure 3, the unsorted array consists of randomly generated unsigned integers range from
0~231 − 1. During each pass, unsorted array is scanned linearly and a mask is used to extract
one digit from each element, which forms a digitValue array and serves as the input of the
parallel counting sort. Without bucket pre-sort, each extracted digit value is directly put into
the digitValue array using the same index as in unsorted array, so that the corresponding
element in unsorted array can be easily located later. With bucket pre-sort four problems
arises immediately:

         How big should be pre-allocated to each bucket to ensure the pre-allocated size is
          no smaller than the actual size;
         How to assign each element in the unsorted array to its corresponding bucket
          efficiently;
         After bucket pre-sort, the elements in unsorted array cannot be located using the
          same index as that in digitValue array, then we should keep track of which element
          in unsorted array goes where in digitValue array to retrieve it at a later time;
         Whenever a digit is added into a bucket, a pointer recording the actual size of
          bucket increments. When running multiple threads, it is likely that several threads
          access and increment the pointer simultaneously. How to solve this synchronization
          problem while not let it degrade performance.

  As for the first problem, I assume the actual number in each bucket will be roughly the
same because the input elements are randomly generated. Therefore I heuristically assign a
bucket size which is larger than the total number of elements divided by number of buckets.

   For the second problem, using radix of 65536 and 256 buckets, I simply divide each digit
value by 256 (essentially a 1byte right shift) to get the bucket number it belongs to. For
example, 65535/256 = 255, so the digit value 65535 should go to bucket # 255.

   For the third problem, I used an auxiliary array e[], which has the same size as digitValue[].
Whenever a digit is extracted from unsorted[] and put into a bucket in digitValue[], the
original element in unsorted[] is put into e[] at exactly the same index as digitValue[]. This
facilitates the later retrieval of the original element, but does require more memory space,
especially when the input array is huge.

   For the fourth problem, I tried two OpenMP synchronization constructs: omp critical and
omp atomic. omp critical is really not necessary because it prevents all the other threads to
access the whole pointer array while actually only several threads trying to increment the
pointer of the same bucket should be prevented. omp atomic is right for this purpose, which
actually provides a mini-critical section. However, due to the restriction of the syntax of omp
atomic, only a write operation can be used along with the construct. This leads to a problem:
when two threads tries to read the pointer value and put their digit into that position, race
condition may occur. I think that is why the correctness test of my program with bucket pre-
                                                                                             Winter 08, UC Irvine


  sorting fails. One feasible solution I can figure out is to create a lock array corresponding to
  the pointer array. Before reading and incrementing each pointer, a thread has to first acquire
  the lock corresponding to the pointer. However, the current program can be reasonably
  viewed as an approximation of how much performance gain can be gleaned by using bucket
  pre-sorting.

     In summary, it can be seen that adding bucket pre-sort scheme would add extra CPU
  computation, synchronization and space requirement. As a result, the performance gain is not
  as large as I expected. I report the timing in the following section.

4. Timing Report and Instruction to Run the Program:
      Program Version                         Radix                    CPU Time                    Speedup
                                                                       (unit: sec)
         Sequential                          65536                       7.4571                        -
         Sequential                           256                        4.9610                    base time
        Parallel_basic                       65536                       5.4770                     0.9057
        Parallel_basic                        256                        6.4970                     0.7635
 Parallel_local_count_array                  65536                       3.9780                     1.2471
 Parallel_local_count_array                   256                        4.8990                     1.0126
  Parallel_bucket_pre_sort                   65536                       3.4630                     1.4320
 Table 1 final project timing report. Input array size: 25,000,000, test platform: AMD dual core laptop, number of
                                        threads running in parallel program: 2.

     The speedup between parallel radix sort with bucket pre-sort and sequential radix sort
  based on radix of 256 is 1.432.

     To run the program in Visual Studio, first create a project with one of the source files and
  compile it. Suppose the name of the executable file is radix_sort.exe, to run the program, just
  specify the number to be sorted in the command line, e.g.

      C:\eecs221 final\radix_sort 25000000




References:
[1] N. M. Amato. A Comparison of Parallel Sorting Algorithms on Different Architectures.
TAMU technical report 98-029, Jan. 1996.
[2] M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings
Supercomputing’91, pages 712–721, Nov. 1991.