Docstoc

4_d200_craven_s

Document Sample
4_d200_craven_s Powered By Docstoc
					           Super-Sized Multiplies:
           How Do FPGAs Fare in
         Extended Digit Multipliers?
                 Stephen Craven
                Cameron Patterson
                  Peter Athanas

            Configurable Computing Lab
                   Virginia Tech

Craven                  1                200/MAPLD 2004
         Outline
 • Background
         • Large Integer Multiplication
         • GIMPS
 • Algorithm Comparison
         • Floating-point FFT
         • All-integer FFT
         • Fast Galois Transform
 • Accelerator Design
         • System Design
         • Operation
         • Performance
 • Improvements & Future Work



Craven                                    2   200/MAPLD 2004
         Large Integer Multiplication
 •   Complexity
         • Grade School: O(N2)
         • Fourier Transform: ~O(N log N)
 •   Efficient FFT-Based Multiplication
         • Divide integers into sequences of smaller digits.
           867530924601  86, 75, 30, 92, 46, 01
         • Convolution of two sequences equivalent to multiplication.
                              z[n]      x[i]y[j]
                                       i  j n
         • Element-wise multiplication in frequency domain  time domain
           convolution.




Craven                                            3                     200/MAPLD 2004
         GIMPS
 • Why multiply big numbers?
         • Great Internet Mersenne Prime Search (GIMPS)
             • Primality testing algorithm for Mersenne numbers (2q – 1) requires
               squaring of multi-million digit numbers.
             • Mersenne primes are largest primes known – used in cryptography.
      • Large integer convolution
      • Performance comparison of Pentiums and FPGAs in traditional
        floating-point domains.
 •   Lucas-Lehmer Primality Test
         Mq = 2q – 1; v = 4;
         for i = 1:q-2,
             v = v2 – 2 (mod Mq);
         if v == 0, Mq is prime
         else, Mq is composite




Craven                                     4                        200/MAPLD 2004
         Discrete Weighted Transform
 •   Discrete Weighted Transform (DWT)
         • Variable base – each sequence digit can contain differing numbers of bits.
         • Creates power-of-two sequence needed by FFT.
         • Eliminates need to zero pad to convert cyclic, FFT-based convolution into
           acyclic convolution needed for squaring.
 •   Steps:
         • Number to be multiplied divided into variable-length digits.
         • Sequence multiplied by a weight sequence.
         • FFT performed on new, power-of-two length weighted sequence.
 •   Example for Mq = 237 – 1 with FFT length of 4:
         • Bits / digit = { 10, 9, 9, 9 }
         • To square 78,314,567,209 (mod Mq), our sequence would be:
             { 553, 93, 381, 291 }
         • 553 + 93 * 210 + 381 * 219 + 291 * 228 = 78,314,567,209
         • Multiply sequence by weights then FFT.

Craven                                       5                        200/MAPLD 2004
         Objective
 • Compare performance of Pentium processors to
   FPGAs.
         • GIMPS chosen because highly optimized code exists.
         • GIMPS utilizes fast floating-point performance of Pentiums.
 • Xilinx Virtex-II Pro 100 (2VP100) chosen as target
   device.
         • Largest available 2VP device.
         • Contains 444, 17x17 unsigned multipliers
         • 888kB of embedded Block RAM
 • Target 12 million digit numbers.
         • Reward for first prime above 10 million.



Craven                                6                   200/MAPLD 2004
         Floating-point FFT
 • GIMPS implementation uses floating-point – requires
   round off error checks.
 • Using near double-precision floating-point (51-bit
   mantissa):
         • 49 real multipliers can be placed on 2VP100
         • 12 complex multipliers
 • 12 million digit number -> 2 million point FFT
         • 44 million complex multiplies -> 3.7 million cycles




Craven                                7                    200/MAPLD 2004
         All-integer FFT
 • Perform FFT modulo special prime.
         • Prime must have nice roots of one & two.
         • Reductions modulo prime should be simple.
 • Primes of the form 2k – 2m + 1 meet requirements.

     Prime           # Multipliers   FFT Length   Iteration time
     247-224+1       49              4M           1.9M cycles
     264-232+1       26              2M           1.7M cycles
     273-237+1       17              2M           2.6M cycles
     2113-257+1      9               1M           2.3M cycles



Craven                                8                    200/MAPLD 2004
         Fast Galois Transform
 • All-integer transform using complex numbers modulo a
   Mersenne Prime: a + b*i (mod Mp)
 • Real input sequence folded into complex input with half the
   length.
 • Modular reductions via Mersenne primes are simple addition.


     Prime       # Multipliers   FFT Length   Iteration Time

     261 - 1     6 (complex)     1M           3.5M cycles

     289 - 1     3 (complex)     512K         3.3M cycles




Craven                            9                  200/MAPLD 2004
         Algorithm Selection
 • Considered algorithms:
         •   Floating-point FFT             3.7M cycles / iteration
         •   All-integer FFT                1.7M cycles / iteration
         •   Galois Transform               3.3M cycles / iteration
         •   Winograd Transform – no acceptable run lengths
         •   Chinese Remainder Theorem – added complexity




Craven                                10                  200/MAPLD 2004
         FFT Design
 • Multipliers and adder generated by CoreGen.
 • 10 cycle butterfly latency.




Craven                          11               200/MAPLD 2004
         Complete Design
 • 8-point FFTs lower cache throughput.
 • Multiple caches allow for overlapping computation with memory
   reads and writes.




Craven                          12                 200/MAPLD 2004
         Performance Estimates
 • XC2VP100-6ff1696        Iteration Stage               Time (us)
                           Weighted sequence             250
 • ISE version 6.2i        creation*
 • Iteration time:         Forward FFT                   11,500

         34 milliseconds   DFT coefficient squaring      250
                           Inverse FFT                   11,500
 • FFT Engine frequency:
                           Weight removal*               250
         80 MHz
                           Carry releasing*              5,000
 • 2VP 100 utilization:    Mersenne mod reduction*       5,000
         70% slices        * Not Implemented

         24% BRAMs
         86% multipliers


Craven                     13                         200/MAPLD 2004
         Performance Comparison
 • Pentium 4 Performance:
         • Non-SIMD (64-bit multiplies)
         • 6.4 GFLOPs
 • All-Integer transform leverages FPGA strengths:
         • 1.9 billion integer multiplies /sec
         • Transform performance exceeds P4.
 • FPGA vs. Pentium 4:
         • 34 ms vs. 60 ms          => 1.76x speed-up!
         • $10,000 vs. $500         => 20x more costly.
         • 600 sq mm* vs. 146 sq mm => 4.1x more die area.†
            FPGAs would likely be less costly if volume equaled the P4.
         † The P4 area estimate does not include the area required by all of the support chips.
         * 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights
            (www.semiconductor.com).
Craven                                              14                          200/MAPLD 2004
         Improvements & Future Work
 • Pentium assemble code highly-optimized while HW accelerator
   is a first draft.
 • Algorithm exploration
         • Nussbaumer’s method using 17-bit primes
         • Utilize “nice” form of prime to implement shift-only multiply for first
           two FFT stages.
 • Cluster Implementation
         • Configurable Computing Lab constructing a 16-node 2VP cluster
           with gigabit transceivers as interconnect.
 • Alternative reduced-multiplier butterfly structures
 • Floorplanning



Craven                                    15                      200/MAPLD 2004
         Conclusions
 • All-integer FFTs attractive for hardware
   implementations of filters / convolutions.
 • GIMPS accelerator designed:
         • Operates at 80 MHz
         • 176% faster than 3.2 GHz Pentium 4
 • Cost of accelerator outweighs benefit in this
   application.




Craven                             16           200/MAPLD 2004

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:256
posted:3/30/2011
language:English
pages:16