Document Sample

Parallel Implementation of Compressed Sensing Algorithm on CUDA- GPU Kuldeep Yadav1, Ankush Mittal 2 M. A. Ansar3, Avi Srivastava4 1, 2, 4 Computer Science and Engineering Galgotia College of Engineering College of Engineering Roorkee Gr. Noida, INDIA Roorke-247667, INDIA ma.ansari@ieee.org3 Kul82_deep@rediffmail.com1 aviarodonix12@yahoo.com4 dr.ankush.mittal@gmail.com2 Abstract - In the field of biomedical imaging, compressed The basis principle is that sparse or compressible signals can sensing (CS) plays an important role because compressed be reconstructed from a surprisingly small number of linear sensing, a joint compression and sensing process, is an measurements, provided that the measurements satisfy an emerging field of activity in which the signal is sampled and incoherence property. Such measurements can then be simultaneously compressed at a greatly reduced rate. regarded as a compression of the original signal, which can Compressed sensing is a new paradigm for signal, image and be recovered if it is sufficiently compressible. A few of the function acquisition. In this paper we have worked on Basis Pursuit Algorithm for compressed sensing. We have computed many potential applications are medical image time for running this algorithm on CPU with Intel® reconstruction [5], image acquisition [6], and sensor Core™2Duo T8100 @ 2.1GHz and 3.0 GB of main memory networks [7]. The first algorithm presented in this context is which run on Windows XP. The next step was to convert this known as basis pursuit [8]. Let Φ be an M × N measurement code in GPU format i.e. to run this program on GPU NVIDIA matrix, and Φx = b the vector of M measurements of an N GeForce series 8400m GS model having 256 MB of video dimensional signal x. The reconstructed signal u∗ is the memory of DDR2 type and bus width of 64bit. The graphic driver we used is of 197.15 series of NVIDIA. Both the CPU minimizer of the L1 Norm subject to the data min ||u||= 1, and GPU version of algorithm is being implemented on the subject to Φu = b .A remarkable result of Candes and Tao Matlab R2008b. The CPU version of the algorithm is being [9] is that if, for example, the rows of Φ are randomly analyzed in simple Matlab but the GPU version is being chosen, Gaussian distributed vectors, there is a constant C implemented with the help of intermediate software JACKET such that if the support of x has size K and M ≥ CK log V1.3. For using Jacket, we have to make some changes in our (N/K), then the solution will be exactly u∗ = x with source code so to make the CPU and GPU to work simultaneously and thus reducing the overall computational overwhelming probability. The required C depends on the acceleration of the algorithm. Graphic Processing Units desired probability of success, which in any case tends to (GPUs) are emerging as powerful parallel systems at a cheap one as N →∞. Donoho and Tanner [10] have computed cost of a few thousand rupees. We got the speed up around 8X, sharp reconstruction thresholds for Gaussian measurements, for most of the Biomedical images and six of them have been so that for any choice of sparsity K and signal size N, the included in this paper, which can be analyzed via the profiler. required number of measurements M to recover x can be determined precisely.In this study, we implemented Basis Keywords — Compressive sensing, Basis Pursuit Algorithms, Pursuit Algorithms, on NVIDIA’s GeForce 8400 GPU with Jacket v 1.3, GPU, medical image processing; high performance computing. the Computer Unified Device Architecture(CUDA) I. INTRODUCTION programming environment. Hence, we have chosen to implement it and we hope that other GPGPU researchers in Current papers [1, 2, 3, and 4] have introduced the concept the field will also make the same choice to standardize the known as compressed sensing (among other related terms). performance comparisons. 112 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 II. BACKGROUND A. Jacket Overview In this section, we have discussed about the Jacket v 1.3 Jacket connects Matlab to the GPU. Matlab is a technical which is a Graphics Processors for general purpose computing language that integrates computation, computing. Over the past few years, specialized visualization and programming in an easy to use coprocessors from floating point hardware to field environment that has found wide popularity both in the programmable gate arrays have enjoyed a widening industry and academia. It is used across the breath of performance gap with traditional x86 based processors. Of technical computing applications including mathematical these, graphics processing units (GPUs) have advanced at an computations, algorithm development, data analysis, data astonishing rate, currently capable of delivering over 1 visualization and application development. With the GPU as TFOPS of single precision performance and over 300 a backend computation engine, Jacket brings together the GFLOPS of double precision while executing up to 240 best of three important computational worlds: computational simultaneous threads in one low cost package. As such, speed, visualization, and the user friendliness of M GPUs have gained significant popularity as powerful tools programming. Jacket enables developers to write and run for high performance computing (HPC) achieving 20100 code on the GPU in the native M language used in Matlab. times the speed of their x86 counterparts in applications Jacket accomplishes this by automatically wrapping the M such as physics simulation, computer vision, options language into a GPU compatible form. By simply casting pricing, sorting, and search. As with previous research input data to Jacket’s GPU data structure, Matlab functions compressed sensing studies based on Graphics Processing are transformed into GPU functions. Jacket also preserves Units (GPUs) provide fast implementations. However, only the interpretive nature of the M language by providing a small number of these GPU-based studies concentrate on realtime, transparent access to the GPU compiler. compressed sensing Since the GPU which we have taken B. Integration with Matlab (NVIDIA 8400M GS) is the most basis model has high Once Jacket is installed, it is transparently integrated with portability and is easily available in present day laptop and the Matlab’s user interface and the user can start working desktops so can be implemented directly. However interactively through the Matlab desktop and command synchronizing of host and device with suitable parallel window as well as write M-functions using the Matlab implementation is the most challenging part. Which has editor and debugger. All Jacket data is visible in the Matlab been parallelized by us? Basis Pursuit Algorithms cannot be workspace, along with any other Matlab matrices. parallelized straight forward because of distribution part, so our solution provides balanced and data distributed C. GPU Data Types parallelization framework of Basis Pursuit Algorithms on CUDA without compromising numerical precision. We Jacket provides GPU counterparts to MATLAB’s CPU data have broken the process in threads of blocks and managed types, such as real and complex double, single, uint32, those threads inside a special thread managing hardware int32, logical, etc. Any variable residing in the host (CPU) called as GPU with the help of the environment and set of memory can be cast to Jacket’s GPU data types. Jacket’s libraries provided by CUDA. memory management system allocates and manages memory for these variables on the GPU automatically, behind the scenes. Any functions called on GPU data will execute on the GPU automatically without any extra 113 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 programming, GPUfunction Jacket provides the largest available set of GPU functions in the world, ranging from A. Basis Pursuit Compression Algorithm functions like sum, sine, cosine, and complex arithmetic to This algorithm takes an input as an image, reads it as a more sophisticated functions like matrix inverse, singular matrix. It then decomposes the image into blocks, then from value decomposition, Bessel functions, and Fast Fourier blocks to many columns. Each of the columns is then Transforms. The supported set of functions continues to processed and compressed. Each of the compressed columns grow with every release of Jacket (see the Function is reconstructed into compressed blocks. Each compressed Reference Guide), runtime of jacket is the most advanced block is then reconstructed and sampled to get the final GPU runtime in the world, providing automated memory compressed image. In this algorithm there are six functions management, compile on the fly, and execution is used they are as bp_basis, bp_decompose, bp_construct, optimizations for Jacket enable code, Jacket’s Graphics bp_block_decompose, bp_block_construct, imagproc. These Toolbox is the only tool in the world that enables a merger functions are collectively used to decompose the image into of GPU visualizations with computation. With Jacket a blocks, compress the image, and reconstruct it by sampling simple graphics command can be added at the end of a it again. The general scheme of algorithm is shown below. simulation loop to visualize data as it is being computed while maintaining performance, The Developer SDK makes /* Pseudo Code of Basis Pursuit Algorithm */ integration of custom CUDA code into Jacket’s runtime 1. Load real time RGB image. very easy. With a few simple SDK functions, your CUDA 2. Assign it with double precision matrix. code can benefit from the optimized Jacket platform. When 3. repeat Jacket applications have completed the development, test, for each row of block and optimization stages and are ready for deployment, the 4. Using CUDA threads Decompose coloumnwise by basis Jacket MATLAB Compiler allows users to generate license pursuit algorithm. free executables for distribution to larger user bases. (See Find the decomposed matrix. the SDK and JMC Wiki pages) and Interactive help for any End for. Jacket function is available using Jacket’s ghelp function. 5. Until all three colour block decompose. 6. repeat III. IMPLEMENTATION for each row of block 7. Parallely using CUDA threads Reconstruct coloumnwise In this section, first, we introduce the general scheme for by basis pursuit algorithm. Basis Pursuit algorithm. Then, we introduce our GPU Consider complex matrix also. implementation environment by first discussing why GPUs Find the reconstructed matrix. are a good fit for medical imaging applications and then End for. presenting NVIDIA’s CUDA platform and GeForce 8400 M 8. until all three color block reconstructed. architecture. Next, we talk about the CPU implementation 9. Combine all three color block to give a matrix. environment. This is followed by description of the test data 10. convert double precision of compressed used in the experiments. Finally, we provide the list of Image data to unsigned int values. CUDA kernels used in our GPU implementation. 11. Scale image data 114 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 B. GPU Implementation Environment fine, lightweight threads in parallel. In CUDA, programs are We have implemented a GPU version of Basis pursuit (BP) expressed as kernels. Kernels have a Single Program with NVIDIA’s GPU programming environment, CUDAv Multiple Data (SPMD) programming model, which is 0.9. The era of single-threaded processor performance essentially Single Instruction Multiple Data (SIMD) increases has come to an end. Programs will only increase in programming model that allows limited divergence in performance if they utilize parallelism. However, there are execution. A part of the application that is executed many different kinds of parallelism. For instance, multicore CPUs times, but independently on different elements of a dataset, provide task-level parallelism. On the other hand, GPUs can be isolated into a kernel that is executed on the GPU in provide data-level parallelism. Depending on the application the form of many different threads. Kernels run on a grid, area, the type of the preferred parallelism might change. which is an array of blocks; and each block is an array of Hence, GPUs is good fit for all problems. However, medical threads. Blocks are mapped to multiprocessors within the imaging applications are very suitable to be implemented on G80 architecture, and each thread is mapped to single GPU architecture. It is because these applications processor. Threads within a block can share memory on a intrinsically have data-level parallelism with high compute multiprocessor. But two threads from two different blocks requirements, and GPUs provide the best cost-per- cannot cooperate. The GPU hardware performs switching of performance parallel architecture for implementing such threads on multiprocessors to keep processors busy and hide algorithms. In addition, most medical imaging applications memory latency. Thus, thousands of threads can be in flight (e.g. semi-automatic segmentation) require, or benefit from at the same time, and CUDA kernels are executed on all visual interaction and GPUs naturally provide this elements of the dataset in parallel. We would like to functionality. Hence, the use of the GPU in non-graphics mention that in our implementation; increasing the dataset related highly-parallel applications, such as medical imaging size does not have an effect on the shared memory usage. applications, became much easier than before. For to the This is because, to deal with larger datasets, we only graphics API. Since it is cumbersome to use graphics APIs increase the number of blocks and keep the shared memory for non-graphics tasks such as medical applications, allocations in a thread as well as the number of threads in a instance, NVIDIA introduced CUDA to perform data- block the same. parallel computations on the GPU without the need of C. CPU Implementation Environment mapping the graphics-centric nature of previous The CPU version of basis pursuit is implemented in Matlab environments. GPU- specific details and allowing the with integration of jacket v1.3. The computer used for the programmer to think in terms of memory and math CPU implementation is an Intel® CORE™2Duo T8100 @ operations as in CPU programs, instead of primitives, 2.1GHz and 3.0 GB of main memory which run on fragments, and textures that are specific to graphics Windows XP (SP2). The CPU implementation was executed programs. CUDA is available for the NVIDIA GeForce on this box with both single threading mode and multi- 8400 (G80) Series and beyond. The GeForce 8400 GS threading mode. Open MP is used to implement the multi- model having 256Mbytes of video memory of DDR2 type threading part. and bus width of 64bit l. The memory bandwidth of the GeForce 8400 GTX is 80+ GB/s. To get the best performance from G80 architecture, we have to keep 128 processors occupied and hide memory latency. In order to D. Test Data of (BP) Compressed Results achieve this goal, CUDA runs hundreds or thousands of 115 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 Our test data consists of six MRI images which measure the implementations. It is clear from the figures that significant performance of images on CPU & GPU with the help of speedup is obtained for Basis Pursuit in GPU in comparison profiler and has been shown in Figure 2, 3, 4 below. Show to CPU implementations. the profiler results for CPU and GPU based (a) MRI1 (b) MRI 2 (c) MRI 3 (d) MRI 4 (E) MRI 5 (F) MRI 6 Figure 1: Test Data Images Under Measure the Performance 116 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 Figure 2: Profile on CPU Figure 3: Profile on GPU 5. Experiment and Results Sharpetal’s implementation and highlight our In this section, we first compare the runtimes of our GPU improvements. We measure the performance of images on and CPU implementations for datasets with different sizes. CPU & GPU with the help of profiler and have been shown And present our speedup. Then, we show visual results by in figure 2 and 3 show the profiler results for CPU providing slices from one of the datasets. Next, we provide architecture and GPU based implementations. It is clear the breakdown of GPU implementation runtime to the from the figures that significant speedup is obtained. We CUDA kernels and present GFLOP (Giga Floating Point have achieved around 8X speedup over a single-threaded Operations = 109 FLOP) and GiB (Gibi Bytes = 230 Bytes) CPU implementation over GPU. We can calculate the speed instruction and data throughput information. This is up by using the following formula. Speed up=time taken to followed by the discussion of the factors that limit our compress the image on CPU/time taken to compress image performance. Finally, we compare our implementation with on GPU. 117 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 Table1. Performance of GPU implementation with respect to CPU implementation Time to run Time to run Time to run S. No. Image Size program on BP_DECOMPOSE BP_DECOMPOSE Depth CPU on CPU(sec) on GPU(sec) 1. MRI1 256 X 256 118.52 105.2 14.06 512 2. MRI2 256 X 256 136.47 123.49 20.31 1024 3. MRI3 256 X 256 139.56 125.159 18.19 1024 4. MRI4 256 X 256 98.87 84.25 11.90 512 5 MRI5 256 X 256 120.34 106.1 16.31 512 6 MRI6 256 X 256 138.47 124.50 21.02 1024 Figure 4: Comprarative Graphical Results on CPU& GPU 6. CONCLUSION The work presented in this paper is one of the major problems of clinical application of compressed sensing on bio-medical imaging which is its high computational cost by using currently available jacket software and GPU NVIDIA GeForce series 8400m GS model having 256M bytes of Video memory of DDR2 type and bus width of 64bit. We reduced the execution time from 50 seconds to 6 seconds for several bio-medical images. We obtained the speedup of the system 8x time. 118 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 REFERENCES [1] Rick chartrand and wotayin “Iteratively Reweighted algorithms for Compressed Sensing”, in IEEE 2008 ICASSP 2008. [2] D. L. Donoho, “Compressed sensing,” IEEE Trans.On Information Theory, vol. 52, no. 4, pp. 1289–1306, April 2006. [3] http //www.dsp.ece.rice.edu / cs. [4] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: [[[[[ Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, 2006. [5]. M. Lustig, J. M. Santos, J.-H. Lee, D. L. Donoho, and J. M. Pauly, “Application of compressed sensing for rapid MR imaging,” in SPARS, (Rennes, France), 2005. [6] D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. F. Kelly, and R.G. Baraniuk, “A new compressive imaging camera architecture using optical-domain compression,” in Computational Imaging IV - Proceedings of SPIE-IS and T Electronic Imaging, vol. 6065, 2006. [7] M. F. Duarte, S. Sarvotham, D. Baron, M. B. Wakin, and R. G. Baraniuk, “Distributed compressed sensing of jointly sparse signals,” in 39th Asilomar Conference on Signals, Systems and Computers, pp. 1537–1541, 2005. [8] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp. 33–61, 1998. [9] E. Cand`es and T. Tao, “Near optimal signal recovery from random projections: universal encoding strategies,” IEEE Trans. Inf. Theory, vol. 52, pp. 5406–5425, 2006. [10] D. L. Donoho and J. Tanner, “Thresholds for the recovery of sparse solutions via L1 minimization,” in 40th Annual Conference on Information Sciences and Systems, pp. 202–206, 2006. 119 http://sites.google.com/site/ijcsis/ ISSN 1947-5500

DOCUMENT INFO

Shared By:

Categories:

Tags:
IJCSIS, call for paper, journal computer science, research, google scholar, IEEE, Scirus, download, ArXiV, library, information security, internet, peer review, scribd, docstoc, cornell university, archive, Journal of Computing, DOAJ, Open Access, March 2011, Volume 9, No. 3, Impact Factor, engineering, international, proQuest, computing, computer, technology

Stats:

views: | 569 |

posted: | 4/9/2011 |

language: | English |

pages: | 8 |

OTHER DOCS BY ijcsis

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.