# chui presentation th Computer Engineering Research Group

Document Sample

```					   An FPGA Implementation of the
Ewald Direct Space and Lennard-Jones
Compute Engines

By: David Chui

Supervisor: Professor P. Chow
Overview

 Introduction and Motivation
 Background and Previous Work
 Hardware Compute Engines
 Results and Performance
 Conclusions and Future Work
1. Introduction and Motivation
What is Molecular Dynamics (MD)
simulation?
 Biomolecular simulations
 Structure and behavior of biological systems
 Uses classical mechanics to model a molecular system
 Newtonian equations of motion (F = ma)
 Compute forces and integrate acceleration through time
to move atoms
 A large scale MD system takes years to simulate
Why is this an interesting computational
problem?
Physical time for simulation                  1e-4 sec

Time-step size                                1e-15 sec

Number of time-steps                          1e11

Number of atoms in a protein system           32,000

Number of interactions                        1e9

Number of instructions/force calculation      1e3

Total number of machine instructions          1e23

Estimated simulation time on a petaflop/sec   3 years
capacity machine
Motivation

 Special-purpose computers for MD simulation have
become an interesting application
 FPGA technology
 Reconfigurable
 Low cost for system prototype
 Short turn around time and development cycle
 Latest technology
 Design portability
Objectives

 Implement the compute engines on FPGA
 Calculate the non-bonded interactions in an MD
simulation (Lennard-Jones and Ewald Direct Space)
 Explore the hardware resources
 Study the trade-off between hardware resources and
computational precision
 Analyze the hardware pipeline performance
 Become the components of a larger project in the future
2. Background and Previous Work
Lennard-Jones Potential

   Attraction due to instantaneous dipole of molecules
   Pair-wise non-bonded interactions O(N2)
   Short range force
   Use cut-off radius to reduce computations
   Reduced complexity close to O(N)
Lennard-Jones Potential of Argon gas
300

250

200

v(r)/kb (K)   150

100

50

0
0.3    0.5     0.7     0.9     1.1   1.3   1.5
-50

-100

-150
r (nm )

  12    6 
U LJ    4      
 r 
        r    
Electrostatic Potential

 Attraction and repulsion due to electrostatic charge of
particles (long range force)
 Reformulate using Ewald Summation
 Decompose to Direct Space and Reciprocal Space
 Direct Space computation similar to Lennard-Jones
 Direct Space complexity close to O(N)
Ewald Summation - Direct Space
1 '             N                 erfc(rij ,n )
U    r
        qi q j
2 n             ij                      rij ,n

1.2

1

0.8
erfc(x)

0.6

0.4

0.2

0
0   1        2        3           4      5   6   7
x
Previous Hardware Developments

Project      Technology   Year

MD-GRAPE     0.6um        1996

MD-Engine    0.8um        1997

BlueGene/L   0.13um       2003

MD-GRAPE3    0.13um       2004
Recent work - FPGA based MD simulator

Transmogrifier-3 FPGA system
 University of Toronto (2003)
 Estimated speedup of over 20 times over software with better
hardware resources
 Fixed-point arithmetic, function table lookup, and interpolation

Xilinx Virtex-II Pro XC2VP70 FPGA
 Boston University (2005)
 Achieved a speedup of over 88 times over software
 Fixed-point arithmetic, function table lookup, and interpolation
MD Simulation software - NAMD

 Parallel runtime system (Charm++/Converse)
 Highly scalable
 Largest system simulated has over 300,000 atoms on
1000 processors
 Spatial decomposition
 Double precision floating point
NAMD - Spatial Decomposition

Simulation Box

Cell

3. Hardware Compute Engines
Purpose and Design Approach

 Implement the functionality of the software compute
object
 Calculate the non-bonded interactions given the particle
information
 Fixed-point arithmetic, function table lookup, and
interpolation
 Pipelined architecture
Compute Engine Block Diagram

i(x, y, z)

|Δr|²           ZBT
Function: |Δr|² =
Memory Lookup/
j(x, y, z)   |Δx|² + |Δy|² + |Δz|²
Linear Interpolation

ix: {7.25}

F(x, y, z)
constant
Multiplication/
Function Lookup Table

 The function to be looked up is a function of |r|2 (the
separation distance between a pair of atoms)
 Block floating point lookup
 Partition function based on different precision
Function Lookup Table

Slope
Value

Value and Slope

r           ZBT Memory Bank

Partition
Hardware Testing Configuration

NAMD
main( )

Ewald
Hardware Engine

Compute Object              Compute Object
Ewald( )                Lennard_Jones( )
Lennard-Jones
Hardware Engine

Communication
Bus
4. Results and Performance
Simulation Overview

 Software model
 Different coordinate precisions and lookup table sizes
 Obtain the error compared to computation using double
precision
Total Energy Fluctuation

Total Energy Fluctuation: Ewald Direct Space

Time-step 1.0fs      Time-step 0.1fs

0

-1
log(Relative rms Fluctuation)

-2

-3

-4

-5

-6
10^5x   10^4x   10^3x   10^2x    10^1x     1K       4K     16K   FP
-7
Various Precision
Average Total Energy

Average Total Energy: Ewald Direct Space

Time-step 1.0fs     Time-step 0.1fs

282

280

278

276
|<E>|

274

272

270

10^5x   10^4x   10^3x   10^2x   10^1x     1K      4K      16K   FP
268
Various Precision
Operating Frequency

Compute Engine   Arithmetic Core

Lennard-Jones   43.6 MHz         80.0 MHz

Ewald Direct    47.5 MHz         82.2 MHz
Space
Latency and Throughput

Latency     Throughput

Lennard-Jones   59 clocks   33.33%

Ewald Direct    44 clocks   100%
Space
Hardware Improvement

Operating frequency:
 Place-and-route constraints
 More pipeline stages

Throughput:
 More hardware resources
 Avoid sharing of multipliers
Compared with previous work

Lennard-Jones      Latency             Operating
System             (clocks)            Frequency (MHz)
Transmogrifier3    11                  26.0

Xilinx Virtex-II   59                  80.0

 Pipelined adders and multipliers
 Block floating point memory lookup
 Support different types of atoms
5. Conclusions and Future Work
Hardware Precision

 A combination of fixed-point arithmetic, function table
lookup, and interpolation can achieve high precision
 Similar result in RMS energy fluctuation and average
energy
 Coordinate precision of {7.41}
 Table lookup size of 1K
 Block floating memory
 Data precision maximized
 Different types of functions
Hardware Performance

 Compute engines operating frequency:
 Ewald Direct Space 82.2 MHz
 Lennard-Jones 80.0 MHz
 Achieving 100 MHz is feasible with newer FPGAs
Future Work

 Study different types of MD systems
 Simulate computation error with different table lookup
sizes and interpolation orders
 Hardware usage: storing data in block RAMs instead of
external ZBT memory

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 1 posted: 3/28/2011 language: English pages: 34