Winter 08, UC Irvine
EECS221 Final Project-Parallel Radix Sort Using OpenMP
email@example.com Nie Weiran 93243903
In this final project I choose to parallelize a sorting algorithm-radix sort using OpenMP.
Three versions of program are submitted: the sequential version, the parallel version with
basic optimization and the parallel version optimized using buckets. In the following sections,
I will first talk about general sequential and parallel algorithms and my basic implementation.
Then I focus on how I improve the performance using buckets and report the result and points
that needs to be further improved. Detailed comments are also given in the C source code.
2. Radix Sorting Algorithms and Basic Implementation:
In sequential radix sort, each element to be sorted can be represented by b (binary) bits.
Typically, one pass of radix sorting sorts blocks of r bits, called a digit. Starting from the least
significant digit, sequential radix sorting can finish in ���� ���� passes. Clearly, if we increase r,
the number of passes will decrease, but the amount of computation in each pass will increase.
One import characteristic of radix sorting is that the sorting algorithm used must be stable. In
my implementation, I use counting sort. Basic counting sort algorithm is as follows:
Count the number of elements of each value in input array A; the counts are stored
in an count array R 0 … 2���� − 1 , where 2���� is the radix;
Compute the prefix sum in count array R, which is to the position of the
corresponding element in the output array;
Elements are then copied into an output array.
Parallelizing radix sort is in fact to parallelize counting sort. According to , the basic
parallel counting sort follows these steps, analogous to its sequential version:
In parallel, each one of p processor counts its assigned ���� ���� elements
All processors working cooperatively to compute the global prefix sum;
Each processor then copies its assigned values to the shared output array
The most challenging part of parallel counting sort is to figure out how to perform the
computation in step 2 efficiently. To illustrate how I arrange the local counts to compute the
global prefix sum, I will use Figure 1:
Winter 08, UC Irvine
  [2���� − 1] [2���� − 1]
Figure 1 Arrangement of local counts to compute the global prefix sum, demonstrated using 2 threads.
We can view the global count array in Fig 1 as a one-dimensional array resulting from
flattening a 2-dimensional array. Each one of the p threads (Fig 1 only shows 2 threads) puts
their local count in an interleaved fashion. For example, thread 0 puts its local count of 0s into
GlobalCountArray, thread 1 puts its local count of 0s into GlobalCountArray and
so on. Having arranging the local counts from the various threads in such a way, I can then
treat the global count array as a normal one-dimension array and implement one of the
parallel prefix sum algorithms taught in class (In reality I used odd/even parallel prefix sum
algorithm). After computing the global prefix sum, each thread can then use the computed
global prefix sums belonged to it to determine the position of its assigned element in the
It turns out where the local count is carried out does affect the performance, which is a
surprise to me. Initially, I directly used the global count array to perform the local counting.
The reason is that each thread has their own dedicated positions to record their local counts,
therefore there are no interference and performance should not be affected. But after I got the
basic code running, there is little performance gain. I tried several ways to improve it, and it
proved effective if I perform local count locally and then copy them back into the global
count array. The idea is illustrated in Figure 2.
����0 0 ����1 0 ����0 2���� − 1 ����1 2���� − 1
����0 1 ����1 1
Figure 2 performing local count locally, and then copying them back to the global count array to compute the
prefix sum does improve performance. �������� ���� … �������� − ���� represents the local count array owned by thread 0,
where �������� is the radix (r is the number of bits processed in each pass of radix sort).
Winter 08, UC Irvine
Other effective optimizations include using interleaved (cyclic) partition instead of block
partition in assigning the elements of the input array to threads.
3. Improve Performance Through Pre-sorting Using Buckets:
The input elements I use are random unsigned integers, range from 0~232 − 1. When I run
the sequential version with radix equals to 256, that is, consider 8 bits in one pass of radix sort,
it takes significant shorter CPU time than if I run it with radix equals to 65536 (detailed
timing will be reported later). While for the basic parallel version, the radix of 65536
performs better than 256. This leads me think:
Considering the random access pattern when we perform the local count (because
input elements are generated randomly), the radix should be chosen the smaller the
better to reduce cache misses. For example, the radix of 256 results in a local count
array of size 256*4byte=1Kb, while a radix of 65536 result in a 256Kb local count
array. The whole local count array is not likely to be contained in cache.
On the other hand, the less the number of passes in radix sort, the better it seems to
parallel program (perhaps because of parallel overhead).
What if I try to do something to combine the merits of the two: keep radix large to reduce the
number of passes while reduce the cache misses caused by the large local count array. The
method I used is to perform a bucket sort before doing each counting sort. The idea is
demonstrated in Figure 3.
…… …… generated
bucket #0, containing digit
values range from 0~255.
bucket #1, containing digit
values range from 256~511.
bucket #255, containing digit values
range from 65281~65535.
Figure 3 increasing the access locality by pre-sort using bucket.
Winter 08, UC Irvine
In figure 3, the unsorted array consists of randomly generated unsigned integers range from
0~231 − 1. During each pass, unsorted array is scanned linearly and a mask is used to extract
one digit from each element, which forms a digitValue array and serves as the input of the
parallel counting sort. Without bucket pre-sort, each extracted digit value is directly put into
the digitValue array using the same index as in unsorted array, so that the corresponding
element in unsorted array can be easily located later. With bucket pre-sort four problems
How big should be pre-allocated to each bucket to ensure the pre-allocated size is
no smaller than the actual size;
How to assign each element in the unsorted array to its corresponding bucket
After bucket pre-sort, the elements in unsorted array cannot be located using the
same index as that in digitValue array, then we should keep track of which element
in unsorted array goes where in digitValue array to retrieve it at a later time;
Whenever a digit is added into a bucket, a pointer recording the actual size of
bucket increments. When running multiple threads, it is likely that several threads
access and increment the pointer simultaneously. How to solve this synchronization
problem while not let it degrade performance.
As for the first problem, I assume the actual number in each bucket will be roughly the
same because the input elements are randomly generated. Therefore I heuristically assign a
bucket size which is larger than the total number of elements divided by number of buckets.
For the second problem, using radix of 65536 and 256 buckets, I simply divide each digit
value by 256 (essentially a 1byte right shift) to get the bucket number it belongs to. For
example, 65535/256 = 255, so the digit value 65535 should go to bucket # 255.
For the third problem, I used an auxiliary array e, which has the same size as digitValue.
Whenever a digit is extracted from unsorted and put into a bucket in digitValue, the
original element in unsorted is put into e at exactly the same index as digitValue. This
facilitates the later retrieval of the original element, but does require more memory space,
especially when the input array is huge.
For the fourth problem, I tried two OpenMP synchronization constructs: omp critical and
omp atomic. omp critical is really not necessary because it prevents all the other threads to
access the whole pointer array while actually only several threads trying to increment the
pointer of the same bucket should be prevented. omp atomic is right for this purpose, which
actually provides a mini-critical section. However, due to the restriction of the syntax of omp
atomic, only a write operation can be used along with the construct. This leads to a problem:
when two threads tries to read the pointer value and put their digit into that position, race
condition may occur. I think that is why the correctness test of my program with bucket pre-
Winter 08, UC Irvine
sorting fails. One feasible solution I can figure out is to create a lock array corresponding to
the pointer array. Before reading and incrementing each pointer, a thread has to first acquire
the lock corresponding to the pointer. However, the current program can be reasonably
viewed as an approximation of how much performance gain can be gleaned by using bucket
In summary, it can be seen that adding bucket pre-sort scheme would add extra CPU
computation, synchronization and space requirement. As a result, the performance gain is not
as large as I expected. I report the timing in the following section.
4. Timing Report and Instruction to Run the Program:
Program Version Radix CPU Time Speedup
Sequential 65536 7.4571 -
Sequential 256 4.9610 base time
Parallel_basic 65536 5.4770 0.9057
Parallel_basic 256 6.4970 0.7635
Parallel_local_count_array 65536 3.9780 1.2471
Parallel_local_count_array 256 4.8990 1.0126
Parallel_bucket_pre_sort 65536 3.4630 1.4320
Table 1 final project timing report. Input array size: 25,000,000, test platform: AMD dual core laptop, number of
threads running in parallel program: 2.
The speedup between parallel radix sort with bucket pre-sort and sequential radix sort
based on radix of 256 is 1.432.
To run the program in Visual Studio, first create a project with one of the source files and
compile it. Suppose the name of the executable file is radix_sort.exe, to run the program, just
specify the number to be sorted in the command line, e.g.
C:\eecs221 final\radix_sort 25000000
 N. M. Amato. A Comparison of Parallel Sorting Algorithms on Different Architectures.
TAMU technical report 98-029, Jan. 1996.
 M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings
Supercomputing’91, pages 712–721, Nov. 1991.