Parallel Implementation of Compressed Sensing
Algorithm on CUDA- GPU
Kuldeep Yadav1, Ankush Mittal 2 M. A. Ansar3, Avi Srivastava4
1, 2, 4
Computer Science and Engineering Galgotia College of Engineering
College of Engineering Roorkee Gr. Noida, INDIA
Roorke-247667, INDIA firstname.lastname@example.org
Abstract - In the field of biomedical imaging, compressed The basis principle is that sparse or compressible signals can
sensing (CS) plays an important role because compressed be reconstructed from a surprisingly small number of linear
sensing, a joint compression and sensing process, is an measurements, provided that the measurements satisfy an
emerging field of activity in which the signal is sampled and
incoherence property. Such measurements can then be
simultaneously compressed at a greatly reduced rate.
regarded as a compression of the original signal, which can
Compressed sensing is a new paradigm for signal, image and
be recovered if it is sufficiently compressible. A few of the
function acquisition. In this paper we have worked on Basis
Pursuit Algorithm for compressed sensing. We have computed many potential applications are medical image
time for running this algorithm on CPU with Intel® reconstruction , image acquisition , and sensor
Core™2Duo T8100 @ 2.1GHz and 3.0 GB of main memory networks . The first algorithm presented in this context is
which run on Windows XP. The next step was to convert this known as basis pursuit . Let Φ be an M × N measurement
code in GPU format i.e. to run this program on GPU NVIDIA matrix, and Φx = b the vector of M measurements of an N
GeForce series 8400m GS model having 256 MB of video
dimensional signal x. The reconstructed signal u∗ is the
memory of DDR2 type and bus width of 64bit. The graphic
driver we used is of 197.15 series of NVIDIA. Both the CPU minimizer of the L1 Norm subject to the data min ||u||= 1,
and GPU version of algorithm is being implemented on the subject to Φu = b .A remarkable result of Candes and Tao
Matlab R2008b. The CPU version of the algorithm is being  is that if, for example, the rows of Φ are randomly
analyzed in simple Matlab but the GPU version is being chosen, Gaussian distributed vectors, there is a constant C
implemented with the help of intermediate software JACKET such that if the support of x has size K and M ≥ CK log
V1.3. For using Jacket, we have to make some changes in our
(N/K), then the solution will be exactly u∗ = x with
source code so to make the CPU and GPU to work
simultaneously and thus reducing the overall computational overwhelming probability. The required C depends on the
acceleration of the algorithm. Graphic Processing Units desired probability of success, which in any case tends to
(GPUs) are emerging as powerful parallel systems at a cheap one as N →∞. Donoho and Tanner  have computed
cost of a few thousand rupees. We got the speed up around 8X, sharp reconstruction thresholds for Gaussian measurements,
for most of the Biomedical images and six of them have been so that for any choice of sparsity K and signal size N, the
included in this paper, which can be analyzed via the profiler.
required number of measurements M to recover x can be
determined precisely.In this study, we implemented Basis
Keywords — Compressive sensing, Basis Pursuit Algorithms,
Pursuit Algorithms, on NVIDIA’s GeForce 8400 GPU with
Jacket v 1.3, GPU, medical image processing; high performance
the Computer Unified Device Architecture(CUDA)
I. INTRODUCTION programming environment. Hence, we have chosen to
implement it and we hope that other GPGPU researchers in
Current papers [1, 2, 3, and 4] have introduced the concept
the field will also make the same choice to standardize the
known as compressed sensing (among other related terms).
II. BACKGROUND A. Jacket Overview
In this section, we have discussed about the Jacket v 1.3 Jacket connects Matlab to the GPU. Matlab is a technical
which is a Graphics Processors for general purpose computing language that integrates computation,
computing. Over the past few years, specialized visualization and programming in an easy to use
coprocessors from floating point hardware to field environment that has found wide popularity both in the
programmable gate arrays have enjoyed a widening industry and academia. It is used across the breath of
performance gap with traditional x86 based processors. Of technical computing applications including mathematical
these, graphics processing units (GPUs) have advanced at an computations, algorithm development, data analysis, data
astonishing rate, currently capable of delivering over 1 visualization and application development. With the GPU as
TFOPS of single precision performance and over 300 a backend computation engine, Jacket brings together the
GFLOPS of double precision while executing up to 240 best of three important computational worlds: computational
simultaneous threads in one low cost package. As such, speed, visualization, and the user friendliness of M
GPUs have gained significant popularity as powerful tools programming. Jacket enables developers to write and run
for high performance computing (HPC) achieving 20100 code on the GPU in the native M language used in Matlab.
times the speed of their x86 counterparts in applications Jacket accomplishes this by automatically wrapping the M
such as physics simulation, computer vision, options language into a GPU compatible form. By simply casting
pricing, sorting, and search. As with previous research input data to Jacket’s GPU data structure, Matlab functions
compressed sensing studies based on Graphics Processing are transformed into GPU functions. Jacket also preserves
Units (GPUs) provide fast implementations. However, only the interpretive nature of the M language by providing
a small number of these GPU-based studies concentrate on realtime, transparent access to the GPU compiler.
compressed sensing Since the GPU which we have taken
B. Integration with Matlab
(NVIDIA 8400M GS) is the most basis model has high
Once Jacket is installed, it is transparently integrated with
portability and is easily available in present day laptop and
the Matlab’s user interface and the user can start working
desktops so can be implemented directly. However
interactively through the Matlab desktop and command
synchronizing of host and device with suitable parallel
window as well as write M-functions using the Matlab
implementation is the most challenging part. Which has
editor and debugger. All Jacket data is visible in the Matlab
been parallelized by us? Basis Pursuit Algorithms cannot be
workspace, along with any other Matlab matrices.
parallelized straight forward because of distribution part, so
our solution provides balanced and data distributed
C. GPU Data Types
parallelization framework of Basis Pursuit Algorithms on
CUDA without compromising numerical precision. We Jacket provides GPU counterparts to MATLAB’s CPU data
have broken the process in threads of blocks and managed types, such as real and complex double, single, uint32,
those threads inside a special thread managing hardware int32, logical, etc. Any variable residing in the host (CPU)
called as GPU with the help of the environment and set of memory can be cast to Jacket’s GPU data types. Jacket’s
libraries provided by CUDA. memory management system allocates and manages
memory for these variables on the GPU automatically,
behind the scenes. Any functions called on GPU data will
execute on the GPU automatically without any extra
programming, GPUfunction Jacket provides the largest
available set of GPU functions in the world, ranging from A. Basis Pursuit Compression Algorithm
functions like sum, sine, cosine, and complex arithmetic to This algorithm takes an input as an image, reads it as a
more sophisticated functions like matrix inverse, singular matrix. It then decomposes the image into blocks, then from
value decomposition, Bessel functions, and Fast Fourier blocks to many columns. Each of the columns is then
Transforms. The supported set of functions continues to processed and compressed. Each of the compressed columns
grow with every release of Jacket (see the Function is reconstructed into compressed blocks. Each compressed
Reference Guide), runtime of jacket is the most advanced block is then reconstructed and sampled to get the final
GPU runtime in the world, providing automated memory compressed image. In this algorithm there are six functions
management, compile on the fly, and execution is used they are as bp_basis, bp_decompose, bp_construct,
optimizations for Jacket enable code, Jacket’s Graphics bp_block_decompose, bp_block_construct, imagproc. These
Toolbox is the only tool in the world that enables a merger functions are collectively used to decompose the image into
of GPU visualizations with computation. With Jacket a blocks, compress the image, and reconstruct it by sampling
simple graphics command can be added at the end of a it again. The general scheme of algorithm is shown below.
simulation loop to visualize data as it is being computed
while maintaining performance, The Developer SDK makes /* Pseudo Code of Basis Pursuit Algorithm */
integration of custom CUDA code into Jacket’s runtime
1. Load real time RGB image.
very easy. With a few simple SDK functions, your CUDA
2. Assign it with double precision matrix.
code can benefit from the optimized Jacket platform. When
Jacket applications have completed the development, test,
for each row of block
and optimization stages and are ready for deployment, the
4. Using CUDA threads Decompose coloumnwise by basis
Jacket MATLAB Compiler allows users to generate license
free executables for distribution to larger user bases. (See
Find the decomposed matrix.
the SDK and JMC Wiki pages) and Interactive help for any
Jacket function is available using Jacket’s ghelp function.
5. Until all three colour block decompose.
III. IMPLEMENTATION for each row of block
7. Parallely using CUDA threads Reconstruct coloumnwise
In this section, first, we introduce the general scheme for by basis pursuit algorithm.
Basis Pursuit algorithm. Then, we introduce our GPU Consider complex matrix also.
implementation environment by first discussing why GPUs Find the reconstructed matrix.
are a good fit for medical imaging applications and then End for.
presenting NVIDIA’s CUDA platform and GeForce 8400 M 8. until all three color block reconstructed.
architecture. Next, we talk about the CPU implementation 9. Combine all three color block to give a matrix.
environment. This is followed by description of the test data 10. convert double precision of compressed
used in the experiments. Finally, we provide the list of Image data to unsigned int values.
CUDA kernels used in our GPU implementation. 11. Scale image data
B. GPU Implementation Environment fine, lightweight threads in parallel. In CUDA, programs are
We have implemented a GPU version of Basis pursuit (BP) expressed as kernels. Kernels have a Single Program
with NVIDIA’s GPU programming environment, CUDAv Multiple Data (SPMD) programming model, which is
0.9. The era of single-threaded processor performance essentially Single Instruction Multiple Data (SIMD)
increases has come to an end. Programs will only increase in programming model that allows limited divergence in
performance if they utilize parallelism. However, there are execution. A part of the application that is executed many
different kinds of parallelism. For instance, multicore CPUs times, but independently on different elements of a dataset,
provide task-level parallelism. On the other hand, GPUs can be isolated into a kernel that is executed on the GPU in
provide data-level parallelism. Depending on the application the form of many different threads. Kernels run on a grid,
area, the type of the preferred parallelism might change. which is an array of blocks; and each block is an array of
Hence, GPUs is good fit for all problems. However, medical threads. Blocks are mapped to multiprocessors within the
imaging applications are very suitable to be implemented on G80 architecture, and each thread is mapped to single
GPU architecture. It is because these applications processor. Threads within a block can share memory on a
intrinsically have data-level parallelism with high compute multiprocessor. But two threads from two different blocks
requirements, and GPUs provide the best cost-per- cannot cooperate. The GPU hardware performs switching of
performance parallel architecture for implementing such threads on multiprocessors to keep processors busy and hide
algorithms. In addition, most medical imaging applications memory latency. Thus, thousands of threads can be in flight
(e.g. semi-automatic segmentation) require, or benefit from at the same time, and CUDA kernels are executed on all
visual interaction and GPUs naturally provide this elements of the dataset in parallel. We would like to
functionality. Hence, the use of the GPU in non-graphics mention that in our implementation; increasing the dataset
related highly-parallel applications, such as medical imaging size does not have an effect on the shared memory usage.
applications, became much easier than before. For to the This is because, to deal with larger datasets, we only
graphics API. Since it is cumbersome to use graphics APIs increase the number of blocks and keep the shared memory
for non-graphics tasks such as medical applications, allocations in a thread as well as the number of threads in a
instance, NVIDIA introduced CUDA to perform data- block the same.
parallel computations on the GPU without the need of C. CPU Implementation Environment
mapping the graphics-centric nature of previous
The CPU version of basis pursuit is implemented in Matlab
environments. GPU- specific details and allowing the
with integration of jacket v1.3. The computer used for the
programmer to think in terms of memory and math
CPU implementation is an Intel® CORE™2Duo T8100 @
operations as in CPU programs, instead of primitives,
2.1GHz and 3.0 GB of main memory which run on
fragments, and textures that are specific to graphics
Windows XP (SP2). The CPU implementation was executed
programs. CUDA is available for the NVIDIA GeForce
on this box with both single threading mode and multi-
8400 (G80) Series and beyond. The GeForce 8400 GS
threading mode. Open MP is used to implement the multi-
model having 256Mbytes of video memory of DDR2 type
and bus width of 64bit l. The memory bandwidth of the
GeForce 8400 GTX is 80+ GB/s. To get the best
performance from G80 architecture, we have to keep 128
processors occupied and hide memory latency. In order to
D. Test Data of (BP) Compressed Results
achieve this goal, CUDA runs hundreds or thousands of
Our test data consists of six MRI images which measure the implementations. It is clear from the figures that significant
performance of images on CPU & GPU with the help of speedup is obtained for Basis Pursuit in GPU in comparison
profiler and has been shown in Figure 2, 3, 4 below. Show to CPU implementations.
the profiler results for CPU and GPU based
(a) MRI1 (b) MRI 2 (c) MRI 3
(d) MRI 4 (E) MRI 5 (F) MRI 6
Figure 1: Test Data Images Under Measure the Performance
Figure 2: Profile on CPU
Figure 3: Profile on GPU
5. Experiment and Results
Sharpetal’s implementation and highlight our
In this section, we first compare the runtimes of our GPU
improvements. We measure the performance of images on
and CPU implementations for datasets with different sizes.
CPU & GPU with the help of profiler and have been shown
And present our speedup. Then, we show visual results by
in figure 2 and 3 show the profiler results for CPU
providing slices from one of the datasets. Next, we provide
architecture and GPU based implementations. It is clear
the breakdown of GPU implementation runtime to the
from the figures that significant speedup is obtained. We
CUDA kernels and present GFLOP (Giga Floating Point
have achieved around 8X speedup over a single-threaded
Operations = 109 FLOP) and GiB (Gibi Bytes = 230 Bytes)
CPU implementation over GPU. We can calculate the speed
instruction and data throughput information. This is
up by using the following formula. Speed up=time taken to
followed by the discussion of the factors that limit our
compress the image on CPU/time taken to compress image
performance. Finally, we compare our implementation with
Table1. Performance of GPU implementation with respect to CPU implementation
Time to run Time to run Time to run
S. No. Image
Size program on BP_DECOMPOSE BP_DECOMPOSE Depth
CPU on CPU(sec) on GPU(sec)
1. MRI1 256 X 256 118.52 105.2 14.06 512
2. MRI2 256 X 256 136.47 123.49 20.31 1024
3. MRI3 256 X 256 139.56 125.159 18.19 1024
4. MRI4 256 X 256 98.87 84.25 11.90 512
5 MRI5 256 X 256 120.34 106.1 16.31 512
6 MRI6 256 X 256 138.47 124.50 21.02 1024
Figure 4: Comprarative Graphical Results on CPU& GPU
The work presented in this paper is one of the major
problems of clinical application of compressed sensing on
bio-medical imaging which is its high computational cost by
using currently available jacket software and GPU NVIDIA
GeForce series 8400m GS model having 256M bytes of
Video memory of DDR2 type and bus width of 64bit. We
reduced the execution time from 50 seconds to 6 seconds for
several bio-medical images. We obtained the speedup of
the system 8x time.
 Rick chartrand and wotayin “Iteratively Reweighted algorithms for
Compressed Sensing”, in IEEE 2008 ICASSP 2008.
 D. L. Donoho, “Compressed sensing,” IEEE Trans.On Information
Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
 http //www.dsp.ece.rice.edu / cs.
 E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency
information,” IEEE Trans. Inf. Theory, vol. 52, 2006.
. M. Lustig, J. M. Santos, J.-H. Lee, D. L. Donoho, and J. M. Pauly,
“Application of compressed sensing for rapid MR imaging,” in
SPARS, (Rennes, France), 2005.
 D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S.
Sarvotham, K. F. Kelly, and R.G. Baraniuk, “A new compressive
imaging camera architecture using optical-domain compression,” in
Computational Imaging IV - Proceedings of SPIE-IS and T
Electronic Imaging, vol. 6065, 2006.
 M. F. Duarte, S. Sarvotham, D. Baron, M. B. Wakin, and R. G.
Baraniuk, “Distributed compressed sensing of jointly sparse
signals,” in 39th Asilomar Conference on Signals, Systems and
Computers, pp. 1537–1541, 2005.
 S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic
decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp.
 E. Cand`es and T. Tao, “Near optimal signal recovery from random
projections: universal encoding strategies,” IEEE Trans. Inf.
Theory, vol. 52, pp. 5406–5425, 2006.
 D. L. Donoho and J. Tanner, “Thresholds for the recovery of sparse
solutions via L1 minimization,” in 40th Annual Conference on
Information Sciences and Systems, pp. 202–206, 2006.