Block Parallel in OpenSURF

Document Sample
Block Parallel in OpenSURF Powered By Docstoc
					Block Parallel in OpenSURF

          Max Lv
         2010/1/15
                  Motivation
• Try to accelerate SURF on a single image
  – Cut image to blocks
  – And calculate them in parallel

• Simulate the parallel on FPGA
  – And find the overhead of block images statically
  – Prove the scalability of block parallel
  – Evaluate the influence to performance, caused by
    some additional calculation
                 Implementation

     Image Block 1 Image Block 2 ......              Image Block M
                Statically allocate blocks for each thread

                                                 ……

SURF Thread 1                 SURF Thread 2                   SURFThread N

                       Bind threads to each core
     Core 1          Core 2          …….             Core N

                 •    Block Image (16×16)
                 •    SURF parallel at Block level
                 •    Pthread implementation
                 •    Set affinity for each core
               Overall Speedup – 76i7
8

                                                                                       7.01
7
                                                       6.52

6

                                                                  4.99
5


                                   3.89
4


3


2
                       1.57
     1
1


0
    Base               SIMD       ImgPar            ImgPar-SIMD   Base                 SIMD
              Inline                       ImgPar                    BlockPar(8 threads)

           BlockPar achieved a very good speedup, although with static schedule
    Thread Num Comparison – 76i7
8
                                                       7.01
7
                                 6.42
6


5                                                                             4.5

4
           3.28
3


2


1


0
          SIMD                  SIMD                  SIMD                   SIMD
    BlockPar(2 threads)   BlockPar(4 threads)   BlockPar(8 threads)   BlockPar(16 threads)



       All the results from implementation compiled by icc.
       With Hyper-thread, 8-threads parallel achieve the best performance
       In theory, we can achieve better performance with CPU cores grown
                    Discussion
• Image Blocking
  – Benefit both FPGA and GPU
• Schedule Policy
  – Static schedule doesn’t cause obvious overhead
• Hyper-thread cannot contribute significant
  speedup
• Additional Calculation
  – BlockPar needs additional calculation (4X in Detecting
    Stage) to make threads run independently
  – Not a big problem on both GPU and FPGA, even on
    CPU

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:8/18/2012
language:English
pages:6