link to Majumdar Power Point presentation - SDSC

Document Sample
link to Majumdar Power Point presentation - SDSC Powered By Docstoc
					                                                       SDSC Summer Institute, July 17 2006




       Overview of HPC –
Eye Towards Petascale Computing



                       Amit Majumdar


     Scientific Computing Applications Group
        San Diego Supercomputer Center
        University of California San Diego




  SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
                                                        SDSC Summer Institute, July 17 2006



                          Topics

1. Supercomputing in General

2. Supercomputers at SDSC

3. Eye Towards Petascale Computing




   SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             2
                                                                  SDSC Summer Institute, July 17 2006



      DOE, DOD, NASA, NSF Centers in US
• DOE National Labs - LANL, LNNL, Sandia
• DOE Office of Science Labs – ORNL, NERSC
• DOD, NASA Supercomputer Centers

• National Science Foundation supercomputer centers
  for academic users
  •   San Diego Supercomputer Center (UCSD)
  •   National Center for Supercomputer Applications (UIUC)
  •   Pittsburgh Supercomputer Center (Pittsburgh)
  •   Others at Texas, Indiana-Purdue, ANL-Chicago


             SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             3
                                                                                         SDSC Summer Institute, July 17 2006

       TeraGrid: Integrating NSF Cyberinfrastructure



                                                                                                  Buffalo
                                                                            Wisc


                                                                      UC/ANL                                Cornell
                  Utah
                                                                    Iowa                  PU
                                   NCAR                                                             PSC
                                                                           NCSA          IU


      Caltech
                                                                                  ORNL
                USC-ISI                                                                        UNC-RENCI



         SDSC
                                                    TACC




TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer
Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center
for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.

                    SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO                               4
                                                                   SDSC Summer Institute, July 17 2006



            Measure of Supercomputers

• Top 500 list (HPL code performance)
   • Is one of the measures, but not the measure
   • Japan’s Earth Simulator (NEC) was on top for 3 years
• In Nov 2005 LLNL IBM BlueGene reached the top spot
  ~65000 nodes, 280 TFLOP on HPL, 367 TFLOP peak
   • First 100 TFLOP sustained on a real application last year
   • Very recently 200+ TFLOP sustained on a real application
• New HPCC benchmarks
• Many others – NAS, NERSC, NSF, DOD TI06 etc.
• Ultimate measure is usefulness of a center for you –
  enabling better or new science through simulations on
  balanced machines

              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             5
                                                                     SDSC Summer Institute, July 17 2006



                       Top500 Benchmarks

• 27th Top 500 – June 2006

• NSF Supercomputer Centers in Top500
       Procs           Rmax (GFLOP)        Rpeak (GFLOP)                         Nmax
 #37 NCSA, PowerEdge 1750, P4 Xeon, 3.06 Ghz, Myrinet
       2500                 9819                15300                           630000
 #44 SDSC, IBM Power4, P655/690, 1.5/1.7 Ghz, Federation,
       2464                 9121                15628                           605000
 #55 PSC, Cray XT3, 2.4 Ghz AMD-X86, XT3 internal interconnect
       2060                7935.82               9888




                SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             6
                                                              SDSC Summer Institute, July 17 2006


         Historical Trends in Top500




• 1000 X increase in top machine power in 10 years
         SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             7
                                                                 SDSC Summer Institute, July 17 2006



                     Other Benchmarks

• HPCC – High Performance Computing Challenge
  benchmarks – no rankings

• NSF benchmarks – HPCC, SPIO, and
  applications: WRF, OOCORE, GAMESS, MILC,
  PARATEC, HOMME – (these are changing , new ones are
  considered)


• DoD HPCMP – TI06 benchmarks


            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             8
                                                     SDSC Summer Institute, July 17 2006

               Kiviat diagrams




SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             9
                                                                 SDSC Summer Institute, July 17 2006



  Capability Computing                            Capacity Computing
• Full power of a machine is                • Modest problems are
  used for a given scientific                 tackled, often
  problem utilizing - CPUs,                   simultaneously, on a
  memory, interconnect, I/O                   machine, each with less
  performance                                 demanding requirements
• Enables the solution of                   • Smaller or cheaper systems
  problems that cannot                        are used for capacity
  otherwise be solved in a                    computing, where smaller
  reasonable period of time -                 problems are solved
  figure of merit time to                   • Parametric studies or to
  solution                                    explore design alternatives
• E.g moving from a two-                    • The main figure of merit is
  dimensional to a three-                     sustained performance per
  dimensional simulation,                     unit cost
  using finer grids, or using
  more realistic models
            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            10
                                                                   SDSC Summer Institute, July 17 2006



        Strong Scaling                                     Weak Scaling

• For a fixed problem size                   • How the time to solution
                                               varies with processor count
  how does the time to                         with a fixed problem size per
  solution vary with the                       processor
  number of processors
                                             • Interesting for O(N)
• Run a fixed size problem                     algorithms where perfect
  and plot the speedup                         weak scaling is a constant
                                               time to solution, independent
                                               of processor count
• When scaling of parallel
  codes is discussed it is     • Deviations from this indicate
  normally strong scaling that   that either
  is being referred to            • The algorithm is not truly O(N) or
                                                  • The overhead due to parallelism is
                                                    increasing, or both


              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            11
                                                                SDSC Summer Institute, July 17 2006



      Weak Vs Strong Scaling Examples

• The linked cell algorithm employed in DL_POLY 3
  [1] for the short ranged forces should be strictly O(N)
  in time.
• Study the weak scaling of three model systems (two
  shown next), the times being reported for HPCx, a
  large IBM P690+ cluster sited at Daresbury.
• http://www.cse.clrc.ac.uk/arc/dlpoly_scale.shtml
• I.J.Bush and W.Smith, CCLRC Daresbury
  Laboratory


           SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            12
                                                                   SDSC Summer Institute, July 17 2006




Weak scaling for Argon is shown. The smallest system size is 32,000 atoms, the
largest 32,768,000. It can be seen that the scaling is very good, the time step
increasing from 0.6s to 0.7s on going from 1 processor to 1024. This simulation is
a direct test of the linked cell algorithm as it only requires short ranged forces, and
so the results show it is behaving as expected.


              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            13
                                                                  SDSC Summer Institute, July 17 2006




Weak scaling for water. The time step increasing from 1.9 second on 1
processor, where the system size is 20,736 particles, to 3.9 on 1024 ( system
size 21,233,664 ). Ewald terms must also be calculated in this case, but
constraint forces must be calculated. These forces are short range and should
scale as O(N); their calculation requires a large number of short messages to be
sent, and some latency effects become appreciable.

             SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            14
                                                                SDSC Summer Institute, July 17 2006



     Next Leap in Supercomputer Power

                       15
• PetaFLOP : 10             floating point operations/sec

• Expected multiple PFLOP(s) machines in the US
  during 2008 - 2011

• NSF, DOE (ORNL, LANL, NNSA) are considering
  this

• Similar initiative in Japan, Europe


           SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            15
                                                        SDSC Summer Institute, July 17 2006



                           Topic

1. Supercomputing in General

2. Supercomputers at SDSC

3. Eye Towards Petascale Computing




   SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            16
                                                                                               SDSC Summer Institute, July 17 2006


                                SDSC’s focus: Apps in top two quadrants
                                    Data Storage/Preservation Env          Extreme I/O Environment


                                                   SDSC Data Science Env                          Climate
                                                       SCEC              SCEC
                                                    Post-processing      Simulation            ENZO          1.   Time Variation
 (Increasing I/O and storage)


                                                                                               simulation
                                                                                                                  of Field
                                         EOL     NVO                           ENZO                               Variable
                                                                            Post-precessing   Turbulence          Simulation
                                                                                                 field
                                                                                                             2.   Out-of-Core
                                      Cypres                                   CFD

                                                                      GaussianCHARMM

                                                                                              CPMD
                                             Campus,
                                         Departmental and                 Turbulence          QCD
                                             Desktop                     Reattachment
                                                                            length            Protein
                                            Computing
Data




                                                                                              Folding

                                                                           Traditional HEC Env


                                                   Compute                 (increasing FLOPS)
                                  SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO                    17
                                                                       SDSC Summer Institute, July 17 2006


    SDSC Production Computing Environment
                 25TF compute, 1.4PB disk, 6PB tape




TeraGrid Linux Cluster
    IBM/Intel IA-64
                                           DataStar                        Blue Gene Data
      4.4 TFlops
                                          IBM Power4+                        IBM PowerPC
                                           15.6 TFlops                         5.7 TFlops




Storage Area Network
        Disk                                                                 Archival Systems
                                         Sun F15K
      1400 TB                                                             6PB capacity (~3PB used)
                                         Disk Server



                  SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            18
                                                                           SDSC Summer Institute, July 17 2006


      DataStar is a powerful compute resource well-
          suited to “extreme I/O” applications
•   Peak speed 15.6 TFlops
•   #44 in June 2006 Top500 list                          Due to consistent high demand, in FY05
•   IBM Power4+ processors (2528 total)                   we added 96 1.7GHz/32GB p655 nodes &
•   Hybrid of 2 node types, all on single switch         increased GPFS storage from 60 ->125TB
      • 272 8-way p655 nodes:                                - Enables 2048-processor capability jobs
           • 176 1.5 GHz proc, 16 GB/node (2 GB/proc)            - ~50% more throughput capacity
           • 96 1.7 GHz proc, 32 GB/node (4 GB/proc)
                                                               - More GPFS capacity and bandwidth
      • 11 32-way p690 nodes: 1.3 and 1.7 GHz, 64-256
         GB/node (2-8 GB/proc)

•   Federation switch: ~6 msec latency, ~1.4 GB/sec pp-
    bandwidth
•   At 283 nodes, ours is one of the largest IBM Federation
    switches
•   All nodes are direct-attached to high-performance SAN
    disk , 3.8 GB/sec write, 2.0 GB/sec read to GPFS
•   GPFS now has 125TB capacity


•   226 TB of gpfs-wan across NCSA, ANL


                      SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            19
                                                                    SDSC Summer Institute, July 17 2006

          BG System Overview:
 Novel, massively parallel system from IBM
• Full system installed at LLNL from 4Q04 to 3Q05
   •   65,000+ compute nodes in 64 racks
   •   Each node being two low-power PowerPC processors + memory
   •   Compact footprint with very high processor density
   •   Slow processors & modest memory per processor
   •   Very high peak speed of 367 Tflop/s
   •   #1 Linpack speed of 280 Tflop/s
• 1024 compute nodes in single rack installed at SDSC in 4Q04
   • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing
• Systems at 14 sites outside IBM & 4 within IBM as of 2Q06
• Need to select apps carefully
   • Must scale (at least weakly) to many processors (because they’re slow)
   • Must fit in limited memory




               SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            20
                                                                             SDSC Summer Institute, July 17 2006


   SDSC was first academic institution with
        an IBM Blue Gene system
                                                                              System
                                                                      (64 cabinets, 64x32x32)
SDSC procured 1-rack system 12/04.
                                                          Cabinet
Used initially for code evaluation and            (32 Node boards, 8x8x16)
 benchmarking; production 10/05.
    (LLNL system is 64 racks.)

                                   Node Board
                                 (32 chips, 4x4x2)
                                16 Compute Cards


                  Compute Card
                 (2 chips, 2x1x1)                                                180/360 TF/s
                                                                                  16 TB DDR
             Chip
        (2 processors)
                                                            2.9/5.7 TF/s
                                                            256 GB DDR
                                             90/180 GF/s
                                              8 GB DDR             SDSC rack has maximum ratio of I/O
                            5.6/11.2 GF/s                           to compute nodes at 1:8 (LLNL’s is
        2.8/5.6 GF/s        0.5 GB DDR                             1:64). Each of 128 I/O nodes in rack
            4 MB                                                  has 1 Gbps Ethernet connection => 16
                                                                           GBps/rack potential.



                       SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO             21
                                                                         SDSC Summer Institute, July 17 2006


           SDSC Blue Gene - a new resource
In Dec „04, SDSC brought in a single-rack Blue Gene system
- Initially an experimental system to evaluate NSF applications on
                        this unique architecture
                   -Tailored to high I/O applications
    - Entered production as allocated resource in October 2005

•    First academic installation of this novel architecture

•    Configured for data-intensive computing
      •   1,024 compute nodes, 128 I/O nodes
      •   Peak compute performance of 5.7 TFLOPS
      •   Two 700-MHz PowerPC 440 CPUs, 512 MB per node
      •   IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth
      •   I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads
          achieved on GPFS-WAN
      •   Has own GPFS of 20 TB and gpfs-wan

•    System targets runs of 512 CPUs or more

•    Production in October 2005
      •   Multiple 1 million-SU awards at LRAC and several smaller
          awards for physics, engineering, biochemistry



                    SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            22
                                                                                              SDSC Summer Institute, July 17 2006



BG System Overview: Processor Chip (1)
          32k/32k L1                                    256
                            128     L2
         440 CPU                                                                                       4MB
                                                                                                       EDRAM
                                                                         Shared
        “Double FPU”                                                     L3 directory             L3 Cache
                                                        Multiported      for EDRAM        1024+
                                               256                                                or
                                  snoop                 Shared                            144 ECC
                                                                                                  Memory
                                                        SRAM
          32k/32k L1
                            128                         Buffer
          440 CPU                   L2
                                               256                       Includes ECC
          I/O proc
                                                              256
        “Double FPU”                                                                                   l


                                                     128


                                                                                          DDR
    Ethernet       JTAG                                                                   Control
     Gbit          Access          Torus         Tree             Global                  with ECC
                                                                  Interrupt


     Gbit          JTAG      6 out and          3 out and                               144 bit wide
    Ethernet                 6 in, each at      3 in, each at                             DDR
                             1.4 Gbit/s link    2.8 Gbit/s link                           512MB




               SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO                                      23
                                                                     SDSC Summer Institute, July 17 2006

  BG System Overview: Processor Chip (2)
          (= System-on-a-chip)
• Two 700-MHz PowerPC 440 processors
   •   Each with two floating-point units
   •   Each with 32-kB L1 data caches that are not coherent
   •   4 flops/proc-clock peak (=2.8 Gflop/s-proc)
   •   2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc)
• Shared 2-kB L2 cache (or prefetch buffer)
• Shared 4-MB L3 cache
• Five network controllers (though not all wired to each node)
   •   3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways)
   •   Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways)
   •   Global interrupt (for MPI_Barrier: low latency)
   •   Gigabit Ethernet (for I/O)
   •   JTAG (for machine control)
• Memory controller for 512 MB of off-chip, shared memory



                SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            24
                                                        SDSC Summer Institute, July 17 2006


DataStar p655 Usage, by Node Size




   SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            25
                                                         SDSC Summer Institute, July 17 2006



SDSC Academic Use, by Directorate




    SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            26
                                                                    SDSC Summer Institute, July 17 2006



            Strategic Applications Collaborations

•   Cellulose to Ethanol : Biochemistry (J. Brady, Cornell)
•   LES Turbelence :       Mechanics (M. Krishnan, U. Minnesota)
•   NEES :                 Earthquake Engr (Ahmed Elgamal, UCSD)
•   ENZO :                 Astronomy (M. Norman, UCSD)
•   EM Tomography :        Neuroscience (M. Ellisman, UCSD)
•   DNS Turbulence :       Aerospace Engr (PK Yeung, Georgia
                           Tech)
•   NVO Mosaicking :       Astronomy (R. Williams, Caltech, Alex
                           Szalay, Johns Hopkins)
•   UnderstandingPronouns: Linguistics (A. Kehler, UCSD)
•   Climate :              Atmospheric Sc. (C. Wunsch, MIT)
•   Protein Structure :    Biochemistry (D. Baker, Univ. of
                           Washington)
•   SCEC, TeraShake :      Geological Science (T. Jordan and C.
                           Kesselman USC, K. Olsen UCSB,
                           B. Minster, SIO)




               SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            27
                                                        SDSC Summer Institute, July 17 2006



                           Topic

1. Supercomputing in General

2. Supercomputers at SDSC

3. Eye Towards Petascale Computing
    3.1 Petascale Hardware
    3.2 Petascale Software




   SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            28
                                                     SDSC Summer Institute, July 17 2006




    3.1 Petascale Hardware




SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            29
                                                               SDSC Summer Institute, July 17 2006



NERSC director Horst Simon (few days ago)

When I talk about petaflop computing, what I have in mind is
the longer-term perspective, the time when the HPC
community enters the age of petascale computing.

What I mean is the time when you must achieve petaflop
Rmax performance to make the TOP500 list. An intriguing
question is, when will this happen?

If you do a straight-line extrapolation from today's TOP500
list, you come up with the year 2016. In any case, it's eight to
10 years from now, and we will have to master several
challenges to reach the age of petascale computing.

          SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            30
                                                                   SDSC Summer Institute, July 17 2006



                         The Memory Wall




Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            31
                                                                        SDSC Summer Institute, July 17 2006

 Number of processors in the most
highly parallel system in the TOP500
                                          # of processors


   70000

   60000                                                 IBM BG/L

   50000

   40000
                                                                           # of processors
   30000

   20000

   10000                                  ASCI RED
                Intel
                Paragon XP
       0
      93
            94
                 95
                      96
                           97
                                98
                                     99
                                          00
                                               01
                                                    02
                                                         03
                                                              04
                                                                   05
     19
           19
                19
                     19
                          19
                               19
                                    19
                                         20
                                              20
                                                   20
                                                        20
                                                             20
                                                                  20




     SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO                          32
                                                                     SDSC Summer Institute, July 17 2006


    Petascale Power Problem (Horst Simon)
• Power consumption is really pushing the environment most of the
  computing centers have
   • A peak-petaflop Cray XT3 or cluster would need 8-9 megawatts for the
     computer alone
   • The 2011 HPCS sustained petaflop systems would require about 20
     megawatts
   • Efficient power solutions needed Blue Gene is better, but it still requires high
     megawatts for a petaflop system

• At 10 cents per kilowatt-hour cost a 20-megawatt system would cost $12
  million or more a year just for electricity

• Need to exploit different processor curves, such as the low-cost
  processors used in embedded technology –
  the Cell processor comes from low-end, embedded game technology -
  has great potential, there is a huge step from initial assessment to a
  production solution

• Don’t forget Space problem, MTBF


                SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            33
                                                                              SDSC Summer Institute, July 17 2006


                                Applications and HPCC
                    (next 7 slides from Rolf Rabenseifner, U. of Stuttgart)


high                PTRANS                                                           HPL
                    STREAM                                                           DGEMM
 Spatial locality




                                    CFD                    Radar X section

                                                 Applications
                                    TSP                              DSP



                    RANDOM
                    ACCESS                                                                FFT
low                                   Temporal locality                                high


                         SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            34
                                                                 SDSC Summer Institute, July 17 2006



      Balance Analysis of Machines with HPCC

• Balance expressed as a set of ratios
  • Normalized by CPU speed (HPL Tflop/s rate)
• Basis
  •   Linpack (HPL): Computational Speed
  •   Parallel STREAM Copy or Triad: Memory bandwidth
  •   Random Ring Bandwidth: Inter-node communication
  •   FFT: low spatial and high temporal locality
  •   PTRANS: total communication capacity of network




            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            35
                                                           SDSC Summer Institute, July 17 2006



Balance between memory and CPU speed




      SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            36
                                                             SDSC Summer Institute, July 17 2006

Balance between Random Ring BW (network BW)
               and CPU speed




        SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            37
                                                             SDSC Summer Institute, July 17 2006

Balance between Fast Fourier Transform (FFTE)
              and CPU Speed




        SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            38
                                                             SDSC Summer Institute, July 17 2006

Balance between Matrix Transpose (PTRANS) and
                 CPU Speed




        SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            39
                                                                   SDSC Summer Institute, July 17 2006



            Balance of Today‟s Machines

• Today, balance factors are in a range of

  •   20   inter-node communication / HPL-TFlop/s
  •   10   memory speed / HPL-TFlop/s
  •   20   FFTE / HPL-TFlop/s
  •   30   PTRANS / HPL-TFlop/s




              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            40
                                                                 SDSC Summer Institute, July 17 2006



                    A Petscale Machine
• 10 GFLOP 100,000 procs
  • Higher GFLOP machine – less processors (< 100K)
  • Lower GFLOP machine - more processors ( > 1000K)
• Commodity Processors
  • Heterogeneous processors
     • Clearspeed card, graphic cards, FPGA
     • Cray Adaptive Supercomputing
         – combine standard microprocessors (scalar processing), vector
           processing, multithreading and hardware accelerators in one high-
           performance computing platform
     • Sony – Toshiba – IBM Cell processors
• Memory, interconnect, I/O performance should
  scale in a balanced way with CPU speed

            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            41
                                                                SDSC Summer Institute, July 17 2006



            ORNL Petascale Roadmap

• ORNL will reach peak petaflops performance in
  stages, now through 2008:
     • 2006: upgrade 25-teraflopsCray XT3 (5294 nodes, each with a
       2.4-GHz AMD Opteron processor and 2 GB of memory ) system
       to 50 teraflops via dual-core AMD Opteron™ processors
     • Late 2006: move to 100 teraflops with system codenamed "Hood"
     • Late 2007: upgrade "Hood" to 250 teraflops
     • Late 2008: move to peak petaflops with new architecture
       codenamed "Baker“
  • Cray Adaptive Supercomputing - Powerful compilers and
    other software will automatically match an application to
    the processor blade that is best suited for it.


           SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            42
                                                     SDSC Summer Institute, July 17 2006




    3.2 Petascale Software




SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            43
                                                                SDSC Summer Institute, July 17 2006



                  Parallel Applications

• Higher level domain decomposition of some sort or
  embarrassingly parallel types (astro/physics, engr,
  chemistry, CFD, MD, climate, materials)

• Mid level parallel math libraries (linear system
  solvers, FFT, random# generators etc.)

• Lower level search/sort algorithms , other computer
  science algorithms


           SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            44
                                                               SDSC Summer Institute, July 17 2006



                  Application Scaling
1. Performance characterization and prediction of
   apps – computer science approach

2. Scaling current apps for petascale machines –
   computational science approach

3. Developing petascale apps/algorithms –
   numerical methods approach

4. New languages for petascale applications

          SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            45
                                                                 SDSC Summer Institute, July 17 2006

   1. Performance Characterization/Prediction –
            computer science approach
• Characterizing and understanding current applications'
  performance (next talk by Pfeiffer)
• How much time is spent in memory/cache access and
  access pattern
• How much time spent in communication and what kind of
  communication pattern involved i.e. processor to processor
  communication or global communication operations where
  all the processors participate or both of these
• How much time is spent in I/O, I/O pattern
• Understanding these will allow us to figure out the
  importance/effect of various parts of a supercomputer on the
  application performance

            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            46
                                                                       SDSC Summer Institute, July 17 2006



 Performance Modeling & Characterization

• PMaC lab at SDSC (www.sdsc.edu/PMaC)

                Machine Profile:                            Application Signature:
     Rate at which a machine can perform           Operations needed to be carried out by the
        different operations collecting:                    application collecting:
               rate op1, op2, op3                        number of op1, op2, and op3


                                     Convolution:
                      Mapping of a machines performance (rates) to
                            applications needed operations

               Execution time = operation1•operation2•operation3
                                 rate op1    rate op2      rate op3
                          where•operator could be + or MAX depending on operation overlap




              SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO                47
                                                                  SDSC Summer Institute, July 17 2006

 2. Scaling Current Apps – computational science
                     approach
• At another level we need to understand what kind of
  algorithms and numerical methods are used and how those
  will scale or if we need to go to different approach for scaling
  improvement
• P.K. Yeung's DNS code example next slide (detail talk on
  this Wednesday morning: Yeung, Pekurovsky)
• Example of domain decomposition algorithmic level
  modification for scaling towards a petaflop machine
• One can do these types of analysis for all the computational
  science fields (molecular dynamics, climate/atmos models,
  CFD turbulence, astrophysics, QCD, fusion etc. etc.)
• May be some already has the optimal algorithm and will
  scale to a petascale machine (this is being very optimistic)
  and will now be able to solve a bigger higher resolution
  problem


             SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            48
                                                              SDSC Summer Institute, July 17 2006


                        DNS Problem
• DNS study of turbulence and turbulent mixing
• 3D space is decomposed in 1 dimension among
  processors
• 90% of time spent in 3D FFT routines
• Limited scaling up to 2048 processors solving problems
  up to 2048^3 (N=2048 girds)
• Number of processors limited by the linear problem size
  (N) due to 1D decomposition
• Would like to scale to many more processors to study
  problems 4096^3 and larger, using IBM Blue Gene and
  future (sub)petascale architectures
• Solution: decompose in 2D - max processor N^2
• For 4096^3 can use max of 16,777,216 processors
• There has to be need and scaling


         SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            49
                                                                 SDSC Summer Institute, July 17 2006

3. Develop Petascale Apps/Algorithms – numerical
               methods approach
• Develop new algorithms/numerics from scratch for a
  particular field keeping in mind that now we will have (say)
  100,000 processor machine
• When the original algorithm/code was implemented
  researchers were thinking of few 100s or 1000 processors
• Climate models using spectral element methods provide lot
  higher scalability due to less communication overhead,
  better cache performance etc. associated with the
  fundamental characteristics of spectral element numerical
  methods (Friday morning talk: Amik St-Cyr from NCAR)
• So climate researchers have moved to develop parallel
  codes using this kind of numercal methods for last few years
  expecting petaflop types machines will have large number of
  processors

            SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            50
                                                                        SDSC Summer Institute, July 17 2006


                             Spectral Elements
•   SPECTRAL ELEMENTS: Best approach consists into using quadrilaterals. You can write
    1,2 and 3D operators on each element as Matrix-Matrix operations O(N^3) (2D) for tensor
    forms and O(N^4) (2D) for non-tensor forms (eg triangles with high-order basis). It is
    possible to use a tuned blas-level-3 call. There is no assembly of the matrix. Instead, the
    action of the assembled matrix on a vector is coded. The quadrature rules are used in a way
    that the mass matrix is diagonal and therefore trivially invertible

•   FEM: In low order finite-elements, the nice MxM operations are not there, assembly of the
    global matrix is necessary and leads to issues of load balancing: the parallel matrix might be
    distributed differently than the actual element data. Also, the mass matrix is not diagonal
    and its inversion is necessary even in the case of explicit time-stepping

•   (Pseudo) SPECTRAL: The problem with the global spectral transform is that the
    transposition of a (huge) array of data is necessary (enormous all-2-all type of
    communication). The spectral-element approach uses only nearest neighbors
    communications (locally). Eventually, the network bandwidth/contention will limit the scaling.
    They are also constrained to a certain type of domain: periodic domains are good
    candidates. For ocean modelers, it is not possible to use a global pseudo spectral method.

•   Suppose N unknowns on a sphere (2D):
     • Spherical harmonics (natural global spectral basis on the sphere) cost O(N^(3/2)) per
       time step
     • Discrete Fourier transform cost O(N log N) per time step
     • Spectral elements: O(N)



                   SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            51
                                                                  SDSC Summer Institute, July 17 2006



   4. New Languages for Petascale Applications

• Can we write, just as we do today, codes in fortran, C, C++
  and use MPI and effectively use petascale machines

• New languages provide the means to write codes as if the
  machine has a shared memory appearance, and write codes
  at a lot higher level and let these languages, libraries, do the
  lower level MPI type work

• PGAS (Partitioned Global Address Space) languages,
  compilers are Co-Array Fortran, UPC (Unified Paralle
  C),Titanium (ask Harkness - Tuesday morning talk)




             SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            52
                                                               SDSC Summer Institute, July 17 2006




  Summary: Scaling Scaling and Scaling

• Balanced scalability in hardware (memory
  performance, interconnect performance, I/O
  performance, CPU performance) – vendors and
  centers’ problem

• Scalability in software – mostly your problem



                              Thank you

          SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO            53

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:8/14/2011
language:English
pages:53