Rapid Fluoro-CT Reconstruction for a Mobile C-Arm on Graphics Hardware
Xinwei Xue School of Computing University of Utah Arvi Cheryauka, David Tubbs GE-OEC Healthcare Technologies Salt Lake City, UT 84116
Introduction
Mobile CT imaging in interventional and minimally-invasive surgery requires high-performance computing solutions that meet operational room demands, healthcare business requirements, and the constraints of a mobile C-arm system. Fast processing and display of the reconstructed image at any time before, during or after the procedure is critical. To achieve a higher spatial, contrast, or temporal resolution, the dimensions of input and output datasets used by the analytical reconstruction methods are increasing [1]. On the other hand, the highly parallel nature of Radon transform and CT algorithms enables embedded computing solutions utilizing a parallel processing architecture to realize a significant gain of computational intensity with comparable hardware and program coding/testing expenses [2]. In this paper, we present the preliminary results on hardware acceleration of parallel beam 2D and 3D reconstruction on commodity graphics processing units(GPUs).
API: API: OpenGL or OpenGL or Direct3D Direct3D
GPU/CPU Architecture
API Commands Application Application CPU-GPU Boundary (AGP/PCIe)
Phantom Image Image
Results
The GPUs can achieve same accuracy as the CPU, but as an order of magnitude faster. For 16-bit input, the interpolation can be done in hardware.
GPU Commands
Vertex Index Stream Primitive Primitive Assembly Assembly
Assembled Primitives Rasterization and Interpolation Transformed Vertices Raster Operations
Pixel Updates
CPU: 32-bit GPU: 32 -bit float GPU 32-bit float
GPU GPU Front End Front End
Frame Frame Buffer Buffer Transformed Fragments
Figure 2: 2D Phantom, Recon. Image by CPU, GPU
Figure 4: GPU 3D Parallel Beam Reconstruction
Since GPU is specialized graphics processing hardware, the data needs to be loaded into GPU texture memory and the computations are driven by OpenGL commands. The input data is represented in the texture (stream) format, the computation can be done in vertex and fragment shader, and the output is written to framebuffer, or an off-screen buffer.
Pretransformed Vertices
Programmable Programmable Vertex Vertex Processor Processor
Programmable Programmable Fragment Fragment Processor Processor
Pretransformed Fragments
Time(s) Matlab C GPU (32bit in) GPU (16bit in) 84 N/A 2.8 1.8
Speedup 1 N/A 30 46.7 Matlab C GPU (32bit in) GPU (16bit in)
Time(s) 84 N/A 1.84 1.18
Speedup 1 N/A 46.7 71
Figure 3: Profile Comparison (row 128) for 2D images in figure 2 Left: CPU Right: GPU. Blue: Phantom Red: Reconstructed Image
Time(ms) Matlab C GPU (32bit in) GPU (16bit in) 1450 507 45.3 36 Speedup N/A 1 11 14 Matlab C GPU (32bit in) GPU (16bit in)
Table 4-5: 3D Recon Speedup Left: (NV 6800 Ultra) Right: ( NV7800 GTX)
Time(ms) 1450 507 36 21
Speedup N/A 1 14 24 Matlab C GPU (32bit in) GPU (16bit in)
Time(ms) 1450 507 23.8 11.5
Speedup N/A 1 21 44
Left: (ATI X700 Pro)
Table 1-3: 2D Recon Speedup: Middle: (NV 6800 Ultra)
Right: ( NV7800 GTX)
CT Reconstruction: A Simple 2D Example
The filtered backprojection (FBP) technique is the most popular reconstruction technique, which consists of two steps: 1. Filtered Projection
GPU Implementation
For low precision (8bit), we can use texture stretching technique: 1.Stretch 1-D projection to 2D 2.Rotate by the projection angle 3.Accumulate Fang et al. [5] uses a mixed precision (8-bit and 16bit) approach. For high precision (32-bit float), we use the pbuffer, Render-To-Texture (RTT) technique, and fragment shader to compute the backprojection loop. Since a pbuffer can not be read and write in the same render pass, a PingPong pubuffer technique is used to reduce context switch overhead. LOOP 1. Bind Source Buffer as Lookup Texture 2. Set Destination Buffer as Render Target 3. Render Scene (lookup and accumulate) 4. Release Texture 5. Switch Source/Destination
Conclusions and Future Work
We have shown that GPU can achieve same accuracy as CPU, but is of an order of magnitude or more faster, for the case of 2D and 3D parallel beam geometry. The GPU acceleration can be easily extended to cone-beam geometry, with an addition of depth weighting for the Feldkamp algorithm (FDK), which can be precomputed and applied as texture lookup. With the capability of computing 3D backprojection in seconds, iterative reconstruction, which achieves better accuracy, can be implemented in a reasonable amount of time, which would make it clinically practical and useful. Metal artifact reduction (MAR) will directly benefit from fast iterative reconstruction. These extensions will be our future work. With GPU acceleration, many complex time-consuming 2D or 3D image processing techniques may become practical in medical imaging.
Qθ (nτ ) = τ ∑ h(nτ − kτ ) Pθ ( kτ ), n = 0,1,2,L , N − 1
k =0
N −1
2. Backprojection
f ( x, y ) =
π
N
∑ Qθ ( x cosθ
i =1
i
N
i
+ y sin θ i )
The backprojection loop is the most computationally intensive part, thus it is our main target for acceleration.
Why Use GPUs
GPU is powerful. Modern GPU can do computations with up to 32-bit floating-point precision, with high memory bandwidth up to 38.4 GB/second. It has up to 24 pixel pipelines, which is essentially a SIMD (Single Instruction Multiple Data) parallel processor. GPU is flexible and programmable. It has programmable vertex and fragment processors, which makes it possible for general purpose scientific computing (GPGPU [3][4]), not limited to graphics applications. GPU has rapid development cycle. The performance of GPUs is doubling every 6 months. GPU is inexpensive. Driven by multi-billion dollar video game industry, it costs less than $500 for the latest card (Nvidia Geforce 7800 GTX). GPU is scalable. Nvidia SLI and ATI Cross-fire technology makes it possible for multi-GPU solutions.
Experiment Settings
The PC we use has 3.4 GHz P4 processor, with 2GB memory, and with PCI Express bus. We tested three PCI-Express Graphics Cards: ATI X700 Pro ( 8 pixel pipelines, 24-bit float precision) Nvidia 6800 Ultra (16 pixel pipelines, 32-bit float precision) Nvidia 7800 GTX (24 pixel pipelines, 32-bit float precision) Among the shading languages ( Cg, Sh, Brook, GLSL), we choose to use Cg for fragment programs. OpenGL 1.5 and OpenGL 2.0 are supported on the above GPUs respectively. We use synthetic Shepp-Logan phantoms as input. For simplicity, we use parallel geometry. The system has 165 views within 180 degrees. We reconstruct 256x256 image from 1-D projections of length 512, and 256x256x256 volume from 2D-projections of size 512x512.
References
[1] Clarcdoyle, R, 2005, Fully 3D reconstruction theory in perspective, Fully 3D Recon. in Rad. Nucl. Med, Salt Lake City, 2005, 64-69. [2] Owens,J, Luebke, D, Govindaraju, N, Harris, M, Kruger, J, Lefohn, A, Purcell, T, 2005, A Survey of general Purpose Computation on Graphic Hardware, In Eurographics 2005, State-of-the-Art Reports, August 2005, 31-51. [3] http://www.gpgpu.org [4] Brain Cabral, Nancy Cam, and Jim Foran, “Accelerated volume rendering and tomographic reconstruction using texture mapping hardware.” Symp. On Volume Visualization, pp.91-98, 1994 [5] Fang Xu and Klaus Mueller, "Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware," to appear on IEEE Transaction of Nuclear Science, 2005.