Protein Explorer A Petaflops Special Purpose Computer System for by dffhrtcv3


									Protein Explorer: A Petaflops
Special Purpose Computer
System for Molecular Dynamics

              David Gobaud
        Computational Drug Discovery
            Stanford University
               7 March 2006
   Overview
   Background
   Delft Molecular Dynamics Processor
   Protein Explorer Summary
   MDGRAPE-3 Chip
       Force Calculation Pipeline
       J-Particle Memory and Control Units
   System Architecture
   Software
   Cost
   Questions
   Protein Explorer
       Petaflop special-purpose computer system for
        molecular dynamics simulations
            High-precision screening for drug design
            Large-scale simulations of huge proteins/complexes
       PC cluster with special-purpose engines to perform
        the most time-consuming calculations
       Dedicated LSI MDGRAPE-3 chip performs force
        calculations at 165 Gflops or higher
       ETA 2006
   PCs are universal machines
       Various applications
       Hardware can be designed independent of
   Obstacles to high-performance
       Memory bandwidth bottleneck
       Heat dissipation problem
       Can be overcome by developing specialized
Delft Molecular Dynamics
Processor (DMDP)
   Pioneered high-performance special-purpose
       Not able to achieve effective cost-performance
            Demanded too much time and money in development
            Speed of development is a crucial factor affecting cost-
             performance because electronic device technology
             continues to develop rapidly
            Almost all calculations performed by DMDP making
             hardware very complex
GRAPE (GRAvity PipE)
   One of the most successful attempts to
    develop high-performance special-purpose
   Specialized for simulations of classical
   Most time spent on calculation of long-range
    forces (gravitational, Coulomb, and van der
       Thus special hardware only performs these
       Hardware very simple and cost-effective
GRAPE (GRAvity PipE)
   In 1995 first machine to break teraflops
    barrier in nominal peak performance
   Since 2001 leader in performance has been
    Molecular Dynamics Machine at RIKEN at 78-
   2002 @ University of Tokyo a 64-TFlop
    GRAPE-6 completed
   Protein Explorer launched based on 2002
    University of Tokyo success
Protein Explorer Summary
   Host PC cluster with special purpose boards attached
   Boards calculate only non-bounded forces
       Very simple hardware and software
       No detailed knowledge of hardware needed to write
   Communication time between host and boards is
    proportional to number of particles
   Calculation time proportional to
       N^2 for direct summation of long-range forces
       N*Nc for short range forces where Nc is the average number
        of particles within the cutoff radius
   0.25 byte/1000 operations
MDGRAPE-3 Chip - Force
Calculation Pipeline
   3 subtractor units
   6 adder units
   8 multiplier units
   1 function-evaluation unit
   Can perform ~33 equivalent
    operations/sec when it calculates the
    Coulomb force
MDGRAPE-3 Chip - Force
Calculation Pipeline
MDGRAPE-3 Chip - Force
Calculation Pipeline
   Most operations done in 32-bit single
    precision floating point format
   Force accumulation is 80-bit fixed point
       Can be converted to 64-bit double precision
        floating point
   Coordinates stored in 40-bit fixed-point
       Makes implementation of periodic boundary
        condition easy
MDGRAPE-3 Chip - Force
Calculation Pipeline
   Function Evaluator
       Most important part of pipeline
       Allows calculation of arbitrary smooth function
       Has memory unit which contains a table for
        polynomial coefficients and exponents and a
        hardwired pipeline for fourth-order polynomial
       Interpolates an arbitrary smooth function g(x)
        using segmented fourth-order polynomials by
        Homer’s method
MDGRAPE-3 Chip - J-Particle
Memory and Control Units
   20 Force Calculation Pipelines
   j-Particle Memory Unit
       32,768 bodies
       “Main Memory”
       6.6 Mbits constructed by static RAM
   Cell-Index Controller
       Controls j-Particle memory – generates addresses
   Force Simulation Unit
   Master Controller
       Manages timings and inputs/outputs of the chip
   2 virtual pipelines/physical pipeline
   Physical bandwidth of j-particle unit 2.5
    Gbytes/sec but virtual bandwidth will
    reach 100 Gbytes/sec
   340 arithmetic units
   20 function-evaluator units which work
   165 Gflops at 250MHz
   Chip made by Hitachi
   6M gates
   10M bits of memory
   Chip size is ~220 mm^2
   Dissipate 20 watts at core voltage of
   .12 W/Gflops much better than P4 3GHz
    which is 14 W/Gflop
System Architecture
   Host PC cluster will use Itanium or Opteron CPU
   256 nodes with 512 CPUs each
   Performance of node is 3.96 Tflops
       Total reaches a petaflop
   Require 10G-bit/sec network
       Infiniband 10G Ethernet or future Myrinet
   Network topology will be a 2D hyper-crossbar
   Each node has 24 MDGRAPE-3 chips
   MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz
   19” rack can house 6 nodes
       43 racks total
   Power dissipation ~150 KWatts
   Occupy 100 m^2
System Architecture
Protein Explorer Board
   Very easy to create programs for
   All computational abilities provided in a
       No special knowledge of device needed
   $20 million including labor
   Less than $10/Gflop
       At least ten times better than general-
        purpose computers even when compared
        with relatively cheap BlueGene/L
   What is Myrinet?
   What is a two-dimensional hyper-
    crossbar network topology?
   How does this compare to massive
    distributed computing such as
       Advantages?
       Disadvantages?

To top