R2 - Linderman Emerging HPC Arch_ENS_v2

Document Sample
R2 - Linderman Emerging HPC Arch_ENS_v2 Powered By Docstoc
					Emerging HPC Architectures:
 Impact on Applications and

           Dr. Richard W. Linderman
           AF Senior Scientist for ACA
           Voice: (315) 330-2208


• As the server market drove price-performance
 improvements that the HPC community leveraged over the
 past decade, now the gaming marketplace may deliver 10x-
 45x improvements.
  – $3000 3.2 GHz dual Xeon® (25.6 GFLOPS) (baseline system)
  – $399 3.2 GHz PS3® with Cell Broadband Engine® (153 GFLOPS)
  – 6X FLOPS/board, 7.5X cheaper

  Compared to 3 GHz dual-quad server: 192 GFLOPS, ~$5K
    • 0.8X FLOPS/board, 12.5X cheaper implies 10X
  Nvidia Tesla 1 Tflops cards for $1500 are 5X performance
       and 3.3X cheaper for 17X overall

               Cell Cluster Architecture

• The Cell Cluster has a peak
performance of 51.5 Teraflops from 336
PS3s and additional 1.4 TF from the
headnodes on its 14 subclusters.
• Cost: $361K ($257K from HPCMP)
     •PS3s 37% of cost
• Price Performance: 147 TFLOPS/$M
• The 24 PS3s in aggregate contain 6
GB of memory and 960 GB of disk. The
dual quad-core Xeon headnodes have
32 GB of DRAM and 4 TB of disk.

         PlayStation3 Fundamentals

                         Cell BE ® processor
                         256 MB RDRAM (only)
                          40 GB hard drive
                          Gigabit Ethernet (only)
                         153 Gflops Single
                         Precision Peak
                             380 TFLOPS/$M
                         Sony Hypervisor
                         Fedora Core 7 Linux
6 of 8 SPEs available    IBM CELL SDK 3.0
25.6 GB/sec to RDRAM
~140 Watts

                Key Questions

• Which codes could scale given these constraints?
• Can a hybrid mixture of PS3s and traditional
  servers mitigate the weaknesses of the PS3s alone
  and still deliver outstanding price-performance?
• What level of effort is required to deliver a
  reasonable percentage of the enormous peak
• A case study approach is being taken to explore
  these questions

6                                                                                 HPCMP OIPT/CSB/HPCAP 2009

                Cell Cluster: Early Access to Commodity Multicore

    This project provides the HPCMP             Dr. Richard Linderman, AFRL/RI, Rome, NY
    community with early access to                                       … but beginning to perceive
    HPC scale commodity multicore                                        that the handcuffs were not
    through a 336 node cluster of PS3                                    for me and that the military
                                                                         had so far got …
    gaming consoles (53 TF).
                                                        Neuromorphic example:
    Applications leveraging the >10X                    Robust recognition of occluded text
    price-performance advantage
    large scale simulations of
    neuromorphic computing models
    GOTCHA radar video SAR for wide
    area persistent surveillance                                                   Gotcha SAR
    Real-time PCID image
    enhancement for space situational

     10 March 2009
                              Solving the hard problems . . .
           Image Enhancement Example

•   The Physically-Constrained Iterative
    Deconvolution (PCID) algorithm is a
    multi-frame blind deconvolution
    algorithm developed for removing
    the image blur from atmospheric
• PCID processes sets (tens to
    hundreds) of blurred image frames
    into highly resolved reconstructed
    images through an iterative multi-
    frame blind deconvolution (MFBD).

         Multicore Optimization Approach

• The focus is primarily on Intel Xeon®
  processors and the IBM Cell Broadband
  Engine® as found in the Playstation3®
  gaming consoles.
• The optimization and porting work has been
  conducted on two HPCs, the 1280 node
  JAWS cluster at the Maui High Performance
  Computer Center and the 336 node Cell
  Cluster at the AFRL HPC center in Rome,
• JAWS features 1280 Dell nodes each with
  two Woodcrest 3 GHz dual core Xeons, 4
  GBytes of memory, and an infiniband
  networking fabric.
• Cell Cluster delivers outstanding price-
  performance approaching 200 teraflops per
  million dollars.

                     FFT Optimizations

• We were able to reuse and retune a high performance corner-turning
  code and combine it with the 1D FFTs from the Intel IPP library.

• The baseline code used double precision (64 bit IEEE format)
  throughout. However, single precision may suffice.

• This change to single precision has roughly a 2X performance
  improvement on the Xeon cores, but a tremendous 14X improvement
  on the PS3 cores (which do not pipeline the double precision

• More recent results under the SPIRAL system seem likely to further
  push 1D FFT performance toward achieving 50% of peak performance.


• The PCID code was successfully
  ported from a 72 node Cray XD1
  computer to JAWS and the dual-
  quad Xeon headnodes of the
  Cell Cluster.

• Code profiling revealed that
  approximately 80% of runtime
  was in the FFTW calls.

• The baseline code running
  FFTW3.1 was achieving
  approximately 700 MFLOPS on
  the Xeon cores.

               Information Management
• The need for information management services more robust and flexible
  than point-to-point intercommunications arose far from the HPC realm.

• The work on the PCID algorithm leveraged work at AFRL to greatly
  accelerate the Joint Battlespace Infosphere (JBI) so that it could be
  useful within large HPCs.

• In one test case, multicasting to 4096 clients takes just 278
  microseconds on 10 gigabit ethernet.

PCID Multicore Flow of Information

              Neuromorphic Computing
             Architecture Simulation Case
• The driving application behind developing a 50 TF class cluster was to
    support basic research into alternative neuromorphic computing
•   The first of these to be optimized for the PS3 was the ―Brain-State-In –
    A-Box‖ (BSB)—looking for 1M BSBs simulating in real time
•   Optimized the BSB for the PS3 and achieved 18 GFLOPS on each
    core of the PS3 [6]. Across the 6 cores, 108 GFLOPS/PS3, over 70%
    of peak was sustained.
     – 12 staff week effort for first PS3 optimization experience
•   Constructing hybrid simulations with BSBs and ―Confabulation‖ models


• Cell BE Cluster networking infrastructure is more than up to the
   challenge of handling the I/O from most I/O intensive models
   under examination.

                       BlueGene L & P

•   BlueGene L is number 2 on TOP500 List of Supercomputers
•   Cores= 212992
•   Rmax=478.20
•   Rpeak= 596.38
•   Power = 2329.60

•   BlueGene P is number 3 on TOP500 List of Supercomputers
•   Cores= 163840
•   Rmax=450.30
•   Rpeak= 557.06
•   Power = 1260.00

                  Cray XT5 Performance: TLM
Network Simulation for Mobile Ad-hoc

Problem: highly accurate physical layer
modeling for large-scale networks in complex
environments much faster than real-time

Current state of the art: 4X slower than real-
time for 1K radios using ray-tracing

Solution: Parallel discrete event simulation of
the Transmission Line Matrix method

California Foothills, 100km2

Use 1,000 compute nodes on XT5, 1 CPU core
per node due to memory requirements
   IMPACT: 4X faster than real-time for 10K radios, improved by several orders of magnitude AND
   highly accurate for complex terrain (urban and mountainous areas).
              Blue Gene Performance: PHOLD, TLM
Dramatic improvement for models that                      TLM
communicate large amount of data over the
network. The TLM model on Blue Gene                                         Weak Scaling
computes 10K radios in < 1% of real-time.

Why? Blue Gene hardware design is well
balanced in order to handle network-bound

For large remote event rates, these models just        PHOLD
don’t even run in a reasonable amount of time on
Linux cluster computers.

Cray XT5 performance? Better than Linux
clusters, but still inefficient. The problem is that
hardware imbalance means much of the CPU
horsepower and memory capacity is wasted,
even for low remote event rates (25%)

                                                       10 Billion events processed every second.
             ENS/C4I Scalability Comments
Recent work on Blue Gene has proven the effectiveness of hardware for
“reduce” operations to greatly accelerate Parallel Discrete Event Simulation
(PDES)—this looks to be a significant breakthrough!
•       Huge impact on ENS and FMS
•       Motivates need for Blue Gene for DoD

Reliable multicasting is key for accelerating many codes of interest on large
•        PGM or other similar capability lacking
•        Multicasting support lacking under MPI
•        AFRL Pub-Sub dissemination is beating MPI in this area

Publish-subscribe-query is being pursued by AFRL as an alternative to MPI and
shared file systems to better deal with mounting system complexity issues

HPC-backed web services is an emerging concept with great prospects for
speeding technology transition to the Services                        RWL033106-18

Shared By: