Improvement of CT Slice Image Reconstruction Speed Using SIMD by COi98Y

VIEWS: 7 PAGES: 14

									Improvement of CT Slice Image
 Reconstruction Speed Using
      SIMD Technology


       Xingxing Wu            Yi Zhang

      Instructor: Prof. Yu Hen Hu


  Department of Electrical & Computer Engineering
         University of Wisconsin, Madison
                    Motivation
   CT Slice Image Reconstruction is a very important
    part which will affect the reconstructed image
    quality and scanning speed
   CT Slice Image Reconstruction is very time-
    consuming
   Traditional methods for speedup:
         Specially designed hardware
         Parallel algorithm running on super computer
   Explore a new method: SIMD implementation
                Parallel-Beam FBP Image
                Reconstruction Algorithm
   The Algorithm consists on three parts:
            data rebinning:




            data filtering



            back-projection
      Parallel-Beam FBP Image
      Reconstruction Algorithm
   Projection:
                  P (t )   f ( x, y ) ( x cos  y sin   t )dxdy

   Data Rebinning:
                  R ( )  P  ( D sin  )

   Data Filtering:
                                                 j       
                  Q (t )         j 2wS ( w)     sgn(w) e j 2wt dw
                            
                                                  2       
                                                 j   
              Q (t )  F 1 j 2wS (w)F 1  sgn(w) 
                                                2    
   Data Backprojection:
                               
                  f ( x, y)    S (w) w e j 2wt dwd
                                   

                               0  
                                                      
                                                       
CT Slice Image Reconstruction
   Is Very Time Consuming




  A Whole Head Spiral Scanning will generate several GB projection data
        Function Profiling

         90
         80
         70
         60
         50
time (s)
         40
         30
         20
         10
          0
              Data Rebinning   Data Filtering       Data
                                                Backprojection
Can FBP Algorithm Benefit from SIMD?

    The Algorithm has the following features:
          Small, highly repetitive loops that operate on
           sequential arrays of integers and floating-point values
          Frequent multiplies and accumulates
          Computation-intensive algorithms
          Inherently parallel operations
          Wide dynamic range, hence floating-point based
          Regular memory access patterns
          Data independent control flow
  Analysis of Data Dynamic
Range and Quantization Errors
   Wide Dynamic Range




   Relative Error Metric
                     N

                      [( xi  x)  ( yiDFP  y
                                                   DFP
                                                         )] 2
              RE    i 1
                            N

                             ( yiDFP  y
                                            DFP 2
                                               )
                            i 1

   32-Bit Single-Precision Floating Point and
    SSE2
Updated Algorithm to Fit SIMD

   Update the algorithm to eliminate some
    conditional branches

   Reduce the on-the-fly calculations which are
    not suitable for the SIMD implementation
      Parallel Implementation of
        Data Filtering In SIMD


A0      A1    A2     A3      A4    A5     A6    A7     Rebinned Data

*        *     *      *       *     *      *     *

B0      B1    B2     B3      B4    B5     B6    B7       Weight




     A0*B0+A4*B4 A1*B1+A5*B5 A2*B2+A6*B6 A3*B3+A7*B7   Filtered Data
         Parallel Implementation of
          Backprojection in SIMD
         Index                  A0 A1 A2 A3       Index
       Calculation
                      -0.5-0.5-0.5-0.5        +0.5 +0.5
                                          +0.5 +0.5


       Floor (index) B0 B1 B2 B3           C0 C1 C2 C3 Ceil (index)


                                   (fetch data)
  Filtered Data   D0 D1 D2 D3                     E0 E1 E2 E3



    Weight        F0 F1 F2 F3                     G0 G1 G2 G3


Reconstructed Image               H0 H1 H2 H3
           Optimization of The
            Implementation
   Optimize Memory Access
        Ensure proper alignment to prevent data split across cache line
         boundary: data alignment, stack alignment, code alignment
        Observe store-forwarding constraints
        Optimize data structure layout and data locality to ensure efficient
         use of 64-byte cache line size and also reduce the frequency of
         memory loading and storing
        Use prefetching cacheability instructions control appropriately
        Minimize bus latency by segmenting the reads and writes into
         phases
   Replace Branches with Logic Operations
   Optimize Instruction Scheduling
   Optimize the Parallelism
        Loop Unrolling
        Break dependence chains
            Optimization of The
             Implementation
   Optimize Instruction Selection
         avoid longer latency instruction
         avoid instructions that unnecessarily introduce
          dependence-related stalls

   Optimize the Floating-point Performance
         avoid exceeding the representable range
         avoid change floating-point control/status register
         enable flush-to-zero and DAZ mode
  Improvement of Performance
         90
         80
         70
         60
                                                               C Implementation
         50
time (s)
         40                                                    SIMD Implementation
         30
         20
         10
          0
              Data Rebinning Data Filtering       Data
                                              Backprojection



        The differences of the reconstructed image pixel values
  between C implementation and SIMD implementation are less than 0.01

								
To top