# 4_d200_craven_s

Document Sample

```					           Super-Sized Multiplies:
How Do FPGAs Fare in
Extended Digit Multipliers?
Stephen Craven
Cameron Patterson
Peter Athanas

Configurable Computing Lab
Virginia Tech

Craven                  1                200/MAPLD 2004
Outline
• Background
• Large Integer Multiplication
• GIMPS
• Algorithm Comparison
• Floating-point FFT
• All-integer FFT
• Fast Galois Transform
• Accelerator Design
• System Design
• Operation
• Performance
• Improvements & Future Work

Craven                                    2   200/MAPLD 2004
Large Integer Multiplication
•   Complexity
• Fourier Transform: ~O(N log N)
•   Efficient FFT-Based Multiplication
• Divide integers into sequences of smaller digits.
867530924601  86, 75, 30, 92, 46, 01
• Convolution of two sequences equivalent to multiplication.
z[n]      x[i]y[j]
i  j n
• Element-wise multiplication in frequency domain  time domain
convolution.

Craven                                            3                     200/MAPLD 2004
GIMPS
• Why multiply big numbers?
• Great Internet Mersenne Prime Search (GIMPS)
• Primality testing algorithm for Mersenne numbers (2q – 1) requires
squaring of multi-million digit numbers.
• Mersenne primes are largest primes known – used in cryptography.
• Large integer convolution
• Performance comparison of Pentiums and FPGAs in traditional
floating-point domains.
•   Lucas-Lehmer Primality Test
Mq = 2q – 1; v = 4;
for i = 1:q-2,
v = v2 – 2 (mod Mq);
if v == 0, Mq is prime
else, Mq is composite

Craven                                     4                        200/MAPLD 2004
Discrete Weighted Transform
•   Discrete Weighted Transform (DWT)
• Variable base – each sequence digit can contain differing numbers of bits.
• Creates power-of-two sequence needed by FFT.
• Eliminates need to zero pad to convert cyclic, FFT-based convolution into
acyclic convolution needed for squaring.
•   Steps:
• Number to be multiplied divided into variable-length digits.
• Sequence multiplied by a weight sequence.
• FFT performed on new, power-of-two length weighted sequence.
•   Example for Mq = 237 – 1 with FFT length of 4:
• Bits / digit = { 10, 9, 9, 9 }
• To square 78,314,567,209 (mod Mq), our sequence would be:
{ 553, 93, 381, 291 }
• 553 + 93 * 210 + 381 * 219 + 291 * 228 = 78,314,567,209
• Multiply sequence by weights then FFT.

Craven                                       5                        200/MAPLD 2004
Objective
• Compare performance of Pentium processors to
FPGAs.
• GIMPS chosen because highly optimized code exists.
• GIMPS utilizes fast floating-point performance of Pentiums.
• Xilinx Virtex-II Pro 100 (2VP100) chosen as target
device.
• Largest available 2VP device.
• Contains 444, 17x17 unsigned multipliers
• 888kB of embedded Block RAM
• Target 12 million digit numbers.
• Reward for first prime above 10 million.

Craven                                6                   200/MAPLD 2004
Floating-point FFT
• GIMPS implementation uses floating-point – requires
round off error checks.
• Using near double-precision floating-point (51-bit
mantissa):
• 49 real multipliers can be placed on 2VP100
• 12 complex multipliers
• 12 million digit number -> 2 million point FFT
• 44 million complex multiplies -> 3.7 million cycles

Craven                                7                    200/MAPLD 2004
All-integer FFT
• Perform FFT modulo special prime.
• Prime must have nice roots of one & two.
• Reductions modulo prime should be simple.
• Primes of the form 2k – 2m + 1 meet requirements.

Prime           # Multipliers   FFT Length   Iteration time
247-224+1       49              4M           1.9M cycles
264-232+1       26              2M           1.7M cycles
273-237+1       17              2M           2.6M cycles
2113-257+1      9               1M           2.3M cycles

Craven                                8                    200/MAPLD 2004
Fast Galois Transform
• All-integer transform using complex numbers modulo a
Mersenne Prime: a + b*i (mod Mp)
• Real input sequence folded into complex input with half the
length.
• Modular reductions via Mersenne primes are simple addition.

Prime       # Multipliers   FFT Length   Iteration Time

261 - 1     6 (complex)     1M           3.5M cycles

289 - 1     3 (complex)     512K         3.3M cycles

Craven                            9                  200/MAPLD 2004
Algorithm Selection
• Considered algorithms:
•   Floating-point FFT             3.7M cycles / iteration
•   All-integer FFT                1.7M cycles / iteration
•   Galois Transform               3.3M cycles / iteration
•   Winograd Transform – no acceptable run lengths
•   Chinese Remainder Theorem – added complexity

Craven                                10                  200/MAPLD 2004
FFT Design
• Multipliers and adder generated by CoreGen.
• 10 cycle butterfly latency.

Craven                          11               200/MAPLD 2004
Complete Design
• 8-point FFTs lower cache throughput.
• Multiple caches allow for overlapping computation with memory

Craven                          12                 200/MAPLD 2004
Performance Estimates
• XC2VP100-6ff1696        Iteration Stage               Time (us)
Weighted sequence             250
• ISE version 6.2i        creation*
• Iteration time:         Forward FFT                   11,500

34 milliseconds   DFT coefficient squaring      250
Inverse FFT                   11,500
• FFT Engine frequency:
Weight removal*               250
80 MHz
Carry releasing*              5,000
• 2VP 100 utilization:    Mersenne mod reduction*       5,000
70% slices        * Not Implemented

24% BRAMs
86% multipliers

Craven                     13                         200/MAPLD 2004
Performance Comparison
• Pentium 4 Performance:
• Non-SIMD (64-bit multiplies)
• 6.4 GFLOPs
• All-Integer transform leverages FPGA strengths:
• 1.9 billion integer multiplies /sec
• Transform performance exceeds P4.
• FPGA vs. Pentium 4:
• 34 ms vs. 60 ms          => 1.76x speed-up!
• \$10,000 vs. \$500         => 20x more costly.
• 600 sq mm* vs. 146 sq mm => 4.1x more die area.†
   FPGAs would likely be less costly if volume equaled the P4.
† The P4 area estimate does not include the area required by all of the support chips.
* 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights
(www.semiconductor.com).
Craven                                              14                          200/MAPLD 2004
Improvements & Future Work
• Pentium assemble code highly-optimized while HW accelerator
is a first draft.
• Algorithm exploration
• Nussbaumer’s method using 17-bit primes
• Utilize “nice” form of prime to implement shift-only multiply for first
two FFT stages.
• Cluster Implementation
• Configurable Computing Lab constructing a 16-node 2VP cluster
with gigabit transceivers as interconnect.
• Alternative reduced-multiplier butterfly structures
• Floorplanning

Craven                                    15                      200/MAPLD 2004
Conclusions
• All-integer FFTs attractive for hardware
implementations of filters / convolutions.
• GIMPS accelerator designed:
• Operates at 80 MHz
• 176% faster than 3.2 GHz Pentium 4
• Cost of accelerator outweighs benefit in this
application.

Craven                             16           200/MAPLD 2004

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 256 posted: 3/30/2011 language: English pages: 16