Parallel Implementation of Compressed Sensing Algorithm on CUDA- GPU

Document Sample
Parallel Implementation of Compressed Sensing Algorithm on CUDA- GPU Powered By Docstoc
					     Parallel Implementation of Compressed Sensing
                Algorithm on CUDA- GPU
         Kuldeep Yadav1, Ankush Mittal 2                                                     M. A. Ansar3, Avi Srivastava4
                                               1, 2, 4
         Computer Science and Engineering                                                    Galgotia College of Engineering
         College of Engineering Roorkee                                                      Gr. Noida, INDIA
         Roorke-247667, INDIA                                                                ma.ansari@ieee.org3
         Kul82_deep@rediffmail.com1                                                          aviarodonix12@yahoo.com4
         dr.ankush.mittal@gmail.com2

Abstract - In the field of biomedical imaging, compressed               The basis principle is that sparse or compressible signals can
sensing (CS) plays an important role because compressed                 be reconstructed from a surprisingly small number of linear
sensing, a joint compression and sensing process, is an                 measurements, provided that the measurements satisfy an
emerging field of activity in which the signal is sampled and
                                                                        incoherence property. Such measurements can then be
simultaneously compressed at a greatly reduced rate.
                                                                        regarded as a compression of the original signal, which can
Compressed sensing is a new paradigm for signal, image and
                                                                        be recovered if it is sufficiently compressible. A few of the
function acquisition. In this paper we have worked on Basis
Pursuit Algorithm for compressed sensing. We have computed              many     potential      applications     are     medical       image
time for running this algorithm on CPU with Intel®                      reconstruction [5], image acquisition [6], and sensor
Core™2Duo T8100 @ 2.1GHz and 3.0 GB of main memory                      networks [7]. The first algorithm presented in this context is
which run on Windows XP. The next step was to convert this              known as basis pursuit [8]. Let Φ be an M × N measurement
code in GPU format i.e. to run this program on GPU NVIDIA               matrix, and Φx = b the vector of M measurements of an N
GeForce series 8400m GS model having 256 MB of video
                                                                        dimensional signal x. The reconstructed signal u∗ is the
memory of DDR2 type and bus width of 64bit. The graphic
driver we used is of 197.15 series of NVIDIA. Both the CPU              minimizer of the L1 Norm subject to the data min ||u||= 1,
and GPU version of algorithm is being implemented on the                subject to Φu = b .A remarkable result of Candes and Tao
Matlab R2008b. The CPU version of the algorithm is being                [9] is that if, for example, the rows of Φ are randomly
analyzed in simple Matlab but the GPU version is being                  chosen, Gaussian distributed vectors, there is a constant C
implemented with the help of intermediate software JACKET               such that if the support of x has size K and M ≥ CK log
V1.3. For using Jacket, we have to make some changes in our
                                                                        (N/K), then the solution will be exactly u∗ = x with
source code so to make the CPU and GPU to work
simultaneously and thus reducing the overall computational              overwhelming probability. The required C depends on the
acceleration of the algorithm. Graphic Processing Units                 desired probability of success, which in any case tends to
(GPUs) are emerging as powerful parallel systems at a cheap             one as N →∞. Donoho and Tanner [10] have computed
cost of a few thousand rupees. We got the speed up around 8X,           sharp reconstruction thresholds for Gaussian measurements,
for most of the Biomedical images and six of them have been             so that for any choice of sparsity K and signal size N, the
included in this paper, which can be analyzed via the profiler.
                                                                        required number of measurements M to recover x can be
                                                                        determined precisely.In this study, we implemented Basis
Keywords — Compressive sensing, Basis Pursuit Algorithms,
                                                                        Pursuit Algorithms, on NVIDIA’s GeForce 8400 GPU with
Jacket v 1.3, GPU, medical image processing; high performance
computing.
                                                                        the    Computer      Unified    Device      Architecture(CUDA)

                     I. INTRODUCTION                                    programming environment. Hence, we have chosen to
                                                                        implement it and we hope that other GPGPU researchers in
Current papers [1, 2, 3, and 4] have introduced the concept
                                                                        the field will also make the same choice to standardize the
known as compressed sensing (among other related terms).
                                                                        performance comparisons.




                                                                  112                               http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                     II. BACKGROUND                                    A. Jacket Overview
In this section, we have discussed about the Jacket v 1.3              Jacket connects Matlab to the GPU. Matlab is a technical
which is a Graphics Processors for general purpose                     computing     language     that     integrates      computation,
computing.    Over    the   past   few   years,   specialized          visualization and programming in an easy  to  use
coprocessors from floating point hardware to field                     environment that has found wide popularity both in the
programmable gate arrays have enjoyed a widening                       industry and academia. It is used across the breath of
performance gap with traditional x86  based processors. Of             technical computing applications including mathematical
these, graphics processing units (GPUs) have advanced at an            computations, algorithm development, data analysis, data
astonishing rate, currently capable of delivering over 1               visualization and application development. With the GPU as
TFOPS of single precision performance and over 300                     a backend computation engine, Jacket brings together the
GFLOPS of double precision while executing up to 240                   best of three important computational worlds: computational
simultaneous threads in one low  cost package. As such,                speed, visualization, and the user friendliness of M
GPUs have gained significant popularity as powerful tools              programming. Jacket enables developers to write and run
for high performance computing (HPC) achieving 20100                   code on the GPU in the native M language used in Matlab.
times the speed of their x86 counterparts in applications              Jacket accomplishes this by automatically wrapping the M
such as physics simulation, computer vision, options                   language into a GPU compatible form. By simply casting
pricing, sorting, and search. As with previous research                input data to Jacket’s GPU data structure, Matlab functions
compressed sensing studies based on Graphics Processing                are transformed into GPU functions. Jacket also preserves
Units (GPUs) provide fast implementations. However, only               the interpretive nature of the M language by providing
a small number of these GPU-based studies concentrate on               realtime, transparent access to the GPU compiler.
compressed sensing Since the GPU which we have taken
                                                                       B. Integration with Matlab
(NVIDIA 8400M GS) is the most basis model has high
                                                                       Once Jacket is installed, it is transparently integrated with
portability and is easily available in present day laptop and
                                                                       the Matlab’s user interface and the user can start working
desktops so can be implemented directly. However
                                                                       interactively through the Matlab desktop and command
synchronizing of host and device with suitable parallel
                                                                       window as well as write M-functions using the Matlab
implementation is the most challenging part. Which has
                                                                       editor and debugger. All Jacket data is visible in the Matlab
been parallelized by us? Basis Pursuit Algorithms cannot be
                                                                       workspace, along with any other Matlab matrices.
parallelized straight forward because of distribution part, so
our solution provides balanced and data distributed
                                                                       C.   GPU Data Types
parallelization framework of Basis Pursuit Algorithms on
CUDA without compromising numerical precision. We                      Jacket provides GPU counterparts to MATLAB’s CPU data
have broken the process in threads of blocks and managed               types, such as real and complex double, single, uint32,
those threads inside a special thread managing hardware                int32, logical, etc. Any variable residing in the host (CPU)
called as GPU with the help of the environment and set of              memory can be cast to Jacket’s GPU data types. Jacket’s
libraries provided by CUDA.                                            memory management system allocates and manages
                                                                       memory for these variables on the GPU automatically,
                                                                       behind  the scenes. Any functions called on GPU data will
                                                                       execute on the GPU automatically without any extra



                                                                 113                             http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
programming, GPUfunction Jacket provides the largest
available set of GPU functions in the world, ranging from             A. Basis Pursuit Compression Algorithm
functions like sum, sine, cosine, and complex arithmetic to           This algorithm takes an input as an image, reads it as a
more sophisticated functions like matrix inverse, singular            matrix. It then decomposes the image into blocks, then from
value decomposition, Bessel functions, and Fast Fourier               blocks to many columns. Each of the columns is then
Transforms. The supported set of functions continues to               processed and compressed. Each of the compressed columns
grow with every release of Jacket (see the Function                   is reconstructed into compressed blocks. Each compressed
Reference Guide), runtime of jacket is the most advanced              block is then reconstructed and sampled to get the final
GPU runtime in the world, providing automated memory                  compressed image. In this algorithm there are six functions
management,     compile  on  the  fly,       and    execution         is used they are as bp_basis, bp_decompose, bp_construct,
optimizations for Jacket  enable code, Jacket’s Graphics              bp_block_decompose, bp_block_construct, imagproc. These
Toolbox is the only tool in the world that enables a merger           functions are collectively used to decompose the image into
of GPU visualizations with computation. With Jacket a                 blocks, compress the image, and reconstruct it by sampling
simple graphics command can be added at the end of a                  it again. The general scheme of algorithm is shown below.
simulation loop to visualize data as it is being computed
while maintaining performance, The Developer SDK makes                /* Pseudo Code of Basis Pursuit Algorithm */
integration of custom CUDA code into Jacket’s runtime
                                                                      1. Load real time RGB image.
very easy. With a few simple SDK functions, your CUDA
                                                                      2. Assign it with double precision matrix.
code can benefit from the optimized Jacket platform. When
                                                                      3. repeat
Jacket applications have completed the development, test,
                                                                          for each row of block
and optimization stages and are ready for deployment, the
                                                                      4. Using CUDA threads Decompose coloumnwise by basis
Jacket MATLAB Compiler allows users to generate license 
                                                                      pursuit algorithm.
free executables for distribution to larger user bases. (See
                                                                           Find the decomposed matrix.
the SDK and JMC Wiki pages) and Interactive help for any
                                                                          End for.
Jacket function is available using Jacket’s ghelp function.
                                                                      5. Until all three colour block decompose.
                                                                      6. repeat

                  III. IMPLEMENTATION                                    for each row of block
                                                                      7. Parallely using CUDA threads Reconstruct coloumnwise
In this section, first, we introduce the general scheme for           by basis pursuit algorithm.
Basis Pursuit algorithm. Then, we introduce our GPU                        Consider complex matrix also.
implementation environment by first discussing why GPUs                    Find the reconstructed matrix.
are a good fit for medical imaging applications and then                 End for.
presenting NVIDIA’s CUDA platform and GeForce 8400 M                  8. until all three color block reconstructed.
architecture. Next, we talk about the CPU implementation              9. Combine all three color block to give a matrix.
environment. This is followed by description of the test data         10. convert double precision of compressed
used in the experiments. Finally, we provide the list of                Image data to unsigned int values.
CUDA kernels used in our GPU implementation.                          11. Scale image data




                                                                114                               http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
B. GPU Implementation Environment                                     fine, lightweight threads in parallel. In CUDA, programs are
We have implemented a GPU version of Basis pursuit (BP)               expressed as kernels. Kernels have a Single Program
with NVIDIA’s GPU programming environment, CUDAv                      Multiple Data (SPMD) programming model, which is
0.9. The era of single-threaded processor performance                 essentially Single Instruction Multiple Data (SIMD)
increases has come to an end. Programs will only increase in          programming model that allows limited divergence in
performance if they utilize parallelism. However, there are           execution. A part of the application that is executed many
different kinds of parallelism. For instance, multicore CPUs          times, but independently on different elements of a dataset,
provide task-level parallelism. On the other hand, GPUs               can be isolated into a kernel that is executed on the GPU in
provide data-level parallelism. Depending on the application          the form of many different threads. Kernels run on a grid,
area, the type of the preferred parallelism might change.             which is an array of blocks; and each block is an array of
Hence, GPUs is good fit for all problems. However, medical            threads. Blocks are mapped to multiprocessors within the
imaging applications are very suitable to be implemented on           G80 architecture, and each thread is mapped to single
GPU architecture. It is because these applications                    processor. Threads within a block can share memory on a
intrinsically have data-level parallelism with high compute           multiprocessor. But two threads from two different blocks
requirements, and GPUs provide the best cost-per-                     cannot cooperate. The GPU hardware performs switching of
performance parallel architecture for implementing such               threads on multiprocessors to keep processors busy and hide
algorithms. In addition, most medical imaging applications            memory latency. Thus, thousands of threads can be in flight
(e.g. semi-automatic segmentation) require, or benefit from           at the same time, and CUDA kernels are executed on all
visual interaction and GPUs naturally provide this                    elements of the dataset in parallel. We would like to
functionality. Hence, the use of the GPU in non-graphics              mention that in our implementation; increasing the dataset
related highly-parallel applications, such as medical imaging         size does not have an effect on the shared memory usage.
applications, became much easier than before. For to the              This is because, to deal with larger datasets, we only
graphics API. Since it is cumbersome to use graphics APIs             increase the number of blocks and keep the shared memory
for non-graphics tasks such as medical applications,                  allocations in a thread as well as the number of threads in a
instance, NVIDIA introduced CUDA to perform data-                     block the same.
parallel computations on the GPU without the need of                  C. CPU Implementation Environment
mapping    the    graphics-centric   nature    of   previous
                                                                      The CPU version of basis pursuit is implemented in Matlab
environments. GPU- specific details and allowing the
                                                                      with integration of jacket v1.3. The computer used for the
programmer to think in terms of memory and math
                                                                      CPU implementation is an Intel® CORE™2Duo T8100 @
operations as in CPU programs, instead of primitives,
                                                                      2.1GHz and 3.0 GB of main memory which run on
fragments, and textures that are specific to graphics
                                                                      Windows XP (SP2). The CPU implementation was executed
programs. CUDA is available for the NVIDIA GeForce
                                                                      on this box with both single threading mode and multi-
8400 (G80) Series and beyond. The GeForce 8400 GS
                                                                      threading mode. Open MP is used to implement the multi-
model having 256Mbytes of video memory of DDR2 type
                                                                      threading part.
and bus width of 64bit l. The memory bandwidth of the
GeForce 8400 GTX is 80+ GB/s. To get the best
performance from G80 architecture, we have to keep 128
processors occupied and hide memory latency. In order to
                                                                      D. Test Data of (BP) Compressed Results
achieve this goal, CUDA runs hundreds or thousands of



                                                                115                             http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500
Our test data consists of six MRI images which measure the                  implementations. It is clear from the figures that significant
performance of images on CPU & GPU with the help of                         speedup is obtained for Basis Pursuit in GPU in comparison
profiler and has been shown in Figure 2, 3, 4 below. Show                   to CPU implementations.
the   profiler     results   for   CPU   and     GPU      based




                 (a) MRI1                                       (b) MRI 2                                (c) MRI 3




                      (d) MRI 4                                (E) MRI 5                                      (F) MRI 6

                                     Figure 1: Test Data Images Under Measure the Performance




                                                                     116                              http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                    Figure 2: Profile on CPU




                                                    Figure 3: Profile on GPU




                 5. Experiment and Results
                                                                      Sharpetal’s   implementation        and       highlight        our
In this section, we first compare the runtimes of our GPU
                                                                      improvements. We measure the performance of images on
and CPU implementations for datasets with different sizes.
                                                                      CPU & GPU with the help of profiler and have been shown
And present our speedup. Then, we show visual results by
                                                                      in figure 2 and 3 show the profiler results for CPU
providing slices from one of the datasets. Next, we provide
                                                                      architecture and GPU based implementations. It is clear
the breakdown of GPU implementation runtime to the
                                                                      from the figures that significant speedup is obtained. We
CUDA kernels and present GFLOP (Giga Floating Point
                                                                      have achieved around 8X speedup over a single-threaded
Operations = 109 FLOP) and GiB (Gibi Bytes = 230 Bytes)
                                                                      CPU implementation over GPU. We can calculate the speed
instruction and data throughput information. This is
                                                                      up by using the following formula. Speed up=time taken to
followed by the discussion of the factors that limit our
                                                                      compress the image on CPU/time taken to compress image
performance. Finally, we compare our implementation with
                                                                      on GPU.




                                                                117                           http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                          Table1. Performance of GPU implementation with respect to CPU implementation

                                       Time to run            Time to run                 Time to run
S. No.     Image
                         Size          program on          BP_DECOMPOSE                BP_DECOMPOSE                        Depth
                                          CPU                on CPU(sec)                 on GPU(sec)
  1.       MRI1       256 X 256           118.52                   105.2                        14.06                        512
  2.       MRI2       256 X 256           136.47                  123.49                        20.31                       1024
  3.       MRI3       256 X 256           139.56                 125.159                        18.19                       1024
  4.       MRI4       256 X 256            98.87                   84.25                        11.90                        512
   5       MRI5       256 X 256           120.34                   106.1                        16.31                        512
   6       MRI6       256 X 256           138.47                  124.50                        21.02                       1024




                                    Figure 4: Comprarative Graphical Results on CPU& GPU



                 6. CONCLUSION


The work presented in this paper is one of the major
problems of clinical application of compressed sensing on
bio-medical imaging which is its high computational cost by
using currently available jacket software and GPU NVIDIA
GeForce series 8400m GS model having 256M bytes of
Video memory of DDR2 type and bus width of 64bit. We
reduced the execution time from 50 seconds to 6 seconds for
several bio-medical images. We obtained the speedup of
the system 8x time.



                                                                     118                                 http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                               REFERENCES

[1]             Rick chartrand and wotayin “Iteratively Reweighted algorithms for
               Compressed Sensing”, in IEEE 2008 ICASSP 2008.
[2]            D. L. Donoho, “Compressed sensing,” IEEE Trans.On Information
               Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
[3]            http //www.dsp.ece.rice.edu / cs.
[4] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles:
       [[[[[




      Exact signal reconstruction from highly incomplete frequency
      information,” IEEE Trans. Inf. Theory, vol. 52, 2006.
[5].           M. Lustig, J. M. Santos, J.-H. Lee, D. L. Donoho, and J. M. Pauly,
                “Application of compressed sensing for rapid MR imaging,” in
                SPARS, (Rennes, France), 2005.
[6]            D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S.
                Sarvotham, K. F. Kelly, and R.G. Baraniuk, “A new compressive
                imaging camera architecture using optical-domain compression,” in
                Computational Imaging IV - Proceedings of SPIE-IS and T
                Electronic Imaging, vol. 6065, 2006.
[7]            M. F. Duarte, S. Sarvotham, D. Baron, M. B. Wakin, and R. G.
                Baraniuk, “Distributed compressed sensing of jointly sparse
                signals,” in 39th Asilomar Conference on Signals, Systems and
                Computers, pp. 1537–1541, 2005.
[8]            S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic
                decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp.
                33–61, 1998.
[9]            E. Cand`es and T. Tao, “Near optimal signal recovery from random
                projections: universal encoding strategies,” IEEE Trans. Inf.
                Theory, vol. 52, pp. 5406–5425, 2006.
[10] D. L. Donoho and J. Tanner, “Thresholds for the recovery of sparse
      solutions via L1 minimization,” in 40th Annual Conference on
      Information Sciences and Systems, pp. 202–206, 2006.




                                                                                      119   http://sites.google.com/site/ijcsis/
                                                                                            ISSN 1947-5500