chui presentation th Computer Engineering Research Group

Document Sample
chui presentation th Computer Engineering Research Group Powered By Docstoc
					   An FPGA Implementation of the
Ewald Direct Space and Lennard-Jones
          Compute Engines

               By: David Chui

        Supervisor: Professor P. Chow
Overview

 Introduction and Motivation
 Background and Previous Work
 Hardware Compute Engines
 Results and Performance
 Conclusions and Future Work
1. Introduction and Motivation
What is Molecular Dynamics (MD)
simulation?
 Biomolecular simulations
 Structure and behavior of biological systems
 Uses classical mechanics to model a molecular system
 Newtonian equations of motion (F = ma)
 Compute forces and integrate acceleration through time
  to move atoms
 A large scale MD system takes years to simulate
Why is this an interesting computational
problem?
Physical time for simulation                  1e-4 sec

Time-step size                                1e-15 sec

Number of time-steps                          1e11

Number of atoms in a protein system           32,000

Number of interactions                        1e9

Number of instructions/force calculation      1e3

Total number of machine instructions          1e23

Estimated simulation time on a petaflop/sec   3 years
capacity machine
Motivation

 Special-purpose computers for MD simulation have
  become an interesting application
 FPGA technology
    Reconfigurable
    Low cost for system prototype
    Short turn around time and development cycle
    Latest technology
    Design portability
Objectives

 Implement the compute engines on FPGA
 Calculate the non-bonded interactions in an MD
  simulation (Lennard-Jones and Ewald Direct Space)
 Explore the hardware resources
 Study the trade-off between hardware resources and
  computational precision
 Analyze the hardware pipeline performance
 Become the components of a larger project in the future
2. Background and Previous Work
Lennard-Jones Potential

   Attraction due to instantaneous dipole of molecules
   Pair-wise non-bonded interactions O(N2)
   Short range force
   Use cut-off radius to reduce computations
   Reduced complexity close to O(N)
Lennard-Jones Potential of Argon gas
                       300

                       250

                       200

         v(r)/kb (K)   150

                       100

                        50

                         0
                              0.3    0.5     0.7     0.9     1.1   1.3   1.5
                        -50

                       -100

                       -150
                                                   r (nm )




                                                  12    6 
                                    U LJ    4      
                                                 r 
                                                        r    
Electrostatic Potential

 Attraction and repulsion due to electrostatic charge of
  particles (long range force)
 Reformulate using Ewald Summation
 Decompose to Direct Space and Reciprocal Space
 Direct Space computation similar to Lennard-Jones
 Direct Space complexity close to O(N)
Ewald Summation - Direct Space
                1 '             N                 erfc(rij ,n )
             U    r
                                        qi q j
                2 n             ij                      rij ,n


                  1.2


                   1


                  0.8
        erfc(x)




                  0.6


                  0.4


                  0.2


                   0
                        0   1        2        3           4      5   6   7
                                                    x
Previous Hardware Developments

Project      Technology   Year

MD-GRAPE     0.6um        1996

MD-Engine    0.8um        1997

BlueGene/L   0.13um       2003

MD-GRAPE3    0.13um       2004
Recent work - FPGA based MD simulator

Transmogrifier-3 FPGA system
 University of Toronto (2003)
    Estimated speedup of over 20 times over software with better
     hardware resources
    Fixed-point arithmetic, function table lookup, and interpolation


Xilinx Virtex-II Pro XC2VP70 FPGA
 Boston University (2005)
    Achieved a speedup of over 88 times over software
    Fixed-point arithmetic, function table lookup, and interpolation
MD Simulation software - NAMD

 Parallel runtime system (Charm++/Converse)
 Highly scalable
 Largest system simulated has over 300,000 atoms on
  1000 processors
 Spatial decomposition
 Double precision floating point
NAMD - Spatial Decomposition


                 Cutoff Radius


                                              Simulation Box



                                            Cell
      Cutoff Radius




                            Cutoff Radius
3. Hardware Compute Engines
Purpose and Design Approach

 Implement the functionality of the software compute
  object
 Calculate the non-bonded interactions given the particle
  information
 Fixed-point arithmetic, function table lookup, and
  interpolation
 Pipelined architecture
Compute Engine Block Diagram


   i(x, y, z)

                                           |Δr|²           ZBT
                  Function: |Δr|² =
                                                    Memory Lookup/
   j(x, y, z)   |Δx|² + |Δy|² + |Δz|²
                                                   Linear Interpolation

                         ix: {7.25}




                                                                          F(x, y, z)
                                        constant
                                                     Multiplication/
                                                       Addition           E
Function Lookup Table

 The function to be looked up is a function of |r|2 (the
  separation distance between a pair of atoms)
 Block floating point lookup
 Partition function based on different precision
Function Lookup Table


                                             Slope
                                          Value




                    Value and Slope




                                      r           ZBT Memory Bank

        Partition
Hardware Testing Configuration



                        NAMD
                        main( )


                                                         Ewald
                                                     Hardware Engine

       Compute Object              Compute Object
          Ewald( )                Lennard_Jones( )
                                                      Lennard-Jones
                                                     Hardware Engine




        Communication
            Bus
4. Results and Performance
Simulation Overview

 Software model
 Different coordinate precisions and lookup table sizes
 Obtain the error compared to computation using double
  precision
Total Energy Fluctuation

                                                  Total Energy Fluctuation: Ewald Direct Space

                                                              Time-step 1.0fs      Time-step 0.1fs

                                        0


                                        -1
        log(Relative rms Fluctuation)




                                        -2


                                        -3

                                        -4

                                        -5


                                        -6
                                          10^5x   10^4x   10^3x   10^2x    10^1x     1K       4K     16K   FP
                                        -7
                                                                   Various Precision
Average Total Energy

                           Average Total Energy: Ewald Direct Space

                                       Time-step 1.0fs     Time-step 0.1fs

                282


                280

                278


                276
        |<E>|




                274

                272


                270

                   10^5x   10^4x   10^3x   10^2x   10^1x     1K      4K      16K   FP
                268
                                            Various Precision
Operating Frequency

                Compute Engine   Arithmetic Core


Lennard-Jones   43.6 MHz         80.0 MHz

Ewald Direct    47.5 MHz         82.2 MHz
Space
Latency and Throughput

                Latency     Throughput


Lennard-Jones   59 clocks   33.33%

Ewald Direct    44 clocks   100%
Space
Hardware Improvement

Operating frequency:
 Place-and-route constraints
 More pipeline stages

Throughput:
 More hardware resources
 Avoid sharing of multipliers
Compared with previous work

Lennard-Jones      Latency             Operating
System             (clocks)            Frequency (MHz)
Transmogrifier3    11                  26.0

Xilinx Virtex-II   59                  80.0


 Pipelined adders and multipliers
 Block floating point memory lookup
 Support different types of atoms
5. Conclusions and Future Work
Hardware Precision

 A combination of fixed-point arithmetic, function table
  lookup, and interpolation can achieve high precision
 Similar result in RMS energy fluctuation and average
  energy
    Coordinate precision of {7.41}
    Table lookup size of 1K
 Block floating memory
    Data precision maximized
    Different types of functions
Hardware Performance

 Compute engines operating frequency:
    Ewald Direct Space 82.2 MHz
    Lennard-Jones 80.0 MHz
 Achieving 100 MHz is feasible with newer FPGAs
Future Work

 Study different types of MD systems
 Simulate computation error with different table lookup
  sizes and interpolation orders
 Hardware usage: storing data in block RAMs instead of
  external ZBT memory

				
DOCUMENT INFO