High Performance Computing - PDF by liuqingyan


									High Performance Computing
        - The Future

         Dr M. Probert
       Autumn Term 2010
• Big Computing
  – Beowulf vs BlueGene
  – RoadRunner and Tianhe-1
  – The Grid
• HPC Languages
  – Java and Fortress
  – Fortran 2003 and Co-Array Fortran
  – UPC
• GPU programming
  – CUDA, OpenCL and HMPP
• New CPUs
  – Low Power HPC, AMD and Intel, Tile64
Big Computing
• Beowulf designs are cheap and popular
   – Likely to become more so
   – Rapid growth in recent years – large part of Top500
   – Enabled by powerful and cheap CPUs and developments
     in network technology (Myrinet, InfiniBand, Quadrics,
     SCALI, etc.)
• Still typically a “fast compute, slow interconnect”
  class machine
   – Challenges to large-scale parallelism
   – Need lots of latency hiding to get good scaling
   – Also a problem in many cluster-based solutions
                 ASCI Projects
• ASCI has produced some startling successes
   – E.g. ASCI BlueGene / L
   – Based on slower (recall 1/S dependence) low heat CPUs
      • 1.5 MW for 65,536 2.8 GFLOPS CPUs
      • c.f. 3.5 MW for 65,536 Equivalent Pentium 3s
      • or 6.4 MW for 65,536 3 GHz Pentium 4s
   – Lower heat hence tighter packing and lower latency.
      • Maybe Intel Atom CPUs in future? Now dual-core and HT
      • 3.2 GFLOPS for 8W => 0.5 MW for same total TFLOPS
   – The IBM BlueGene technology is also being sold in
     smaller units so maybe ASCI is a loss-leader?
   – Do we need a new programming paradigm to handle these
N.B. GiB = 1 Gibibyte = 230 Bytes
c.f. GB = 1 Gigabyte = 109 Bytes
          Early Blue-Gene Results
• Domain decomposed MD with dynamic non-regular domains
• A hybrid architecture
   – dual-core AMD Opteron “front end” with PowerXCell
     “back end” – like an accelerator
   – A big challenge to program
      • XCell has only 256 kB “local store” shared between 8 SPEs
      • Very low-level – coder must take responsibility for the data in
        the local store – must be explicitly loaded/stored from/to main
        memory – SPEs do not talk directly to main memory!
      • Can program with a master-slave task farming model.
      • Can also “chain the SPEs” for stream processing
      • IBM developing tools to make it easier …
      • http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Woodward.pdf
                   The Cell Processor
    • Designed by IBM/Sony/Toshiba, initially for the
      PlayStation 3.
    • Only had single precision until US Govt got interested ...

Each chip contains
• 1 Power Processor
    Element (PPE)
• 8 Synergistic Processor
    Elements (SPEs)
• Element Interconnect
    Bus (EIB)

Plus substantial built-in
    networking for linking
    to other cells.
               Task Farming on a Chip
• PPE (master)
   – PPE is a 64-bit IBM core with 512Kb cache chip (similar to Power Mac
     G5 and Xbox 360 cores)
   – “AltiVec” vector instructions
   – 2 Floating Point units with simultaneous multithreading.
• SPE (slave)
   –   128 128-bit registers
   –   4 Floating Point Units with fused multiply-add
   –   256 Kb “Local store” instead of a cache
   –   4 GHz plus – theoretical peak of 32 GFLOPS per SPE
• Floating Point Hardware
   – Original versions only single precision in hardware
        • double precision = 14 GLOPS Cell total
        • Jack Dongarra et al. got a 3.2 GHz Cell with 8 SPEs to do LINPACK 4096
          using “iterative refinement” to get equivalent of 100 GFLOPS!
   – PowerXCell (released in 2008) has double precision hardware
        • now get 102 GFLOPS in 8 SPE
• Another hybrid architecture
  – CPU + GPU
  – Another challenging machine to program
  – Is this headline grabbing or capable of generating
    serious science?
• Many of the Top500 (particularly Top10) are
  hybrid machines
  – Not a trend that is going away soon
  – But we desperately need some common open
    standards to get portability and longevity of codes
          ExaScale Computing
• Currently plans are being drawn up as to
  how to get to ExaScale
  –   1018 FLOPs by 2018 according to exponential
  –   Power/cooling limitations
  –   Programming methods
  –   Component reliability - MTBF
  –   Parallel scaling challenges
• What science will become capable? How to
  manage the data generated?
                      The “Grid”
• Grid computing has it origins in a 1998 book by Carl
  Kesselman and Ian Foster called "The Grid: Blueprint
  for a New Computing Infrastructure.“
• Basic idea – to make the provision of HPC resources
  as ubiquitous as the electrical grid
   – When you turn on an electrical switch you don‟t know and
     don‟t care where the power has come from – so how about
   – Hence with so many wasted clock-cycles with modern PCs
     running screen savers, why not harness that for more useful
                 The Grid Idea
• So anyone with spare computing power could donate
  it (or sell it?) to the Grid
• And anyone needing extra computing power could
  access (or buy?) it from the Grid
• So the Grid is essentially middleware – a broker
  service between supplier and customer
• Currently fashionable – lot of e-science money is
  being spent on developing Grid technology
  – But who will use it? Does it work? What are the
    advantages? What are the risks?
                Grid Advantages
• The next generation particle accelerators at CERN
  will generate huge datasets – TB/day – which need to
  be stored and analysed
   – Hence CERN is at the forefront of implementing Grid
     technology – cannot store & process such large amounts of
     data – needs to be able to distribute it around the world to
     get local storage and analysis
   – Will put intense strain on networks – need massive upgrade
     in bandwidth to handle this data
   – Sounds like the only practical way of managing the
     volumes of data to be generated – a likely success
   – But is it really “Grid” as originally envisaged?
                    Grid Users
• So particle physicists (inventors of the Web) will be
  big users of the Grid. Who else?
   – Other users with large data sets requiring distributed
   – But only where sites are trusted and powerful enough – not
     standard PC types.
• The White Rose Grid?
   – A “distributed computer” – nodes in Sheffield, Leeds and
     York – mixture of Sun and Linux PC.
   – But inter-node usage heavily restricted by 10 MB/s inter-
     site Ethernet interconnect!
• National Grid Service
   – Has standard suite of software available for all users
• Distributed computing projects, such as SETI@home,
  etc. are already “Grid-like”
• The TeraGyroid Project
  – Demonstrated at SC2003
  – Joined USA TeraGrid (supercomputers at Pittsburg and San
    Diego) and with UK supercomputers (HPCx and CSAR)
  – Dedicated trans-Atlantic fibreoptics to make an interactive
    supercomputer to study 109 particles.
• Folding@Home
  –   1st computing project ever to sustain 1 PFLOP in Sept07
  –   Now running at 5 PFLOPs with 400,000 active machines
  –   Data generated has produced 69 papers
  –   Code comes in x86, GPU and Cell (PS3) variants
  –   MPI parallel for multi-core/cpu since 2006, threads since 2010
HPC Languages
• In mid-2000s there was a lot of effort to repackage
  Java as a HPC language (Jave Grande) but:
   –   No IEEE 754 support
   –   Not in MPI or OpenMP standards
   –   Lack of intrinsic mathematical functions
   –   Slow unless converted to native code
   –   Java Grande Forum seems to have lost its way
   –   Initial hopes of a renewal when Java went GPL (end
       06) have not materialised …
• SUN therefore created a new HPC language –
   – Intended to “do for Fortran what Java has done for C”
   – Initially DARPA funded – part of the HPCS project that
     created Chapel (Cray) and X10 (IBM) – now open-
     source …
From a recent presentation by SUN……
From a recent presentation by SUN……
           More Cool Fortress Stuff
• Meaning of superscript is context sensitive
   – B=AT means A to the power of T if A is scalar and T a variable
   – B=AT means transpose if A is a matrix and T is undeclared.
   – B=Ay means Hermitian transpose if A is a matrix.
• Variables can have metadata, e.g. all variables must have
  specified dimensions (length, time, mass etc)
   – Throws an exception if expression not dimensionally correct.
   – Like Java, exceptions MUST be handled – cannot ignore then
     – enforces good practice.
• „for‟ loop is intrinsically parallel
   – actually a library function so implementation can be changed!
• v1.0 interpreter (runs in JVM) now available for
• Very active despite Oracle takeover …
                    Fortran 2003
•   IEEE exception handling
•   Allocatables in derived types
•   Interoperability with C
•   More OOP:
    – procedure pointers and structure components, structure
      finalization, type extension and inheritance,
•   Access to environment (similar to argc etc)
•   Asynchronous I/O
•   And more …
•   NB Many features are already available in ifort,
    gfortran, etc. …
               Co-Array Fortran
• An extension that allows SPMD within Fortran
   – easier to use than MPI
   – designed for data decomposition
• Example
   X(:) = Y(:)[Q]
   – Additional [] shows that this item is a co-array and is
   – Second line shows how to copy values from 1 “memory
     image” to another (c.f. MPI_Send/Recv)
• Available on Cray systems already, adopted by
  ISO Fortran committee in 2005, should appear in
  Fortran 2008, standard ratified Sept 2010
   – available in g95 since October 2008, due gfortran v4.6
     UPC (Unified Parallel C)
• Based upon C99 with SPMD model
• Can handle either shared or distributed
  memory machines
  – An explicitly parallel execution model
  – Appears as shared address space to programmer
     • any variable can be r/w from any processor but
       physically associated with a single processor
  – Synchronization primitives and a memory
    consistency model
  – Memory management primitives
     GPU Programming - CUDA
• GPU has many inherently parallel features
• nVidia has released (v1.0 Feb 2007) CUDA
   – a standard API for high level languages to access their GPU
     hardware but no double precision hardware available
   – SDK supports PathScale Open64 C compiler
   – third-party wrappers available for Python, .Net and Java
   – v2 (Feb 08) includes support for Windows, Mac and Linux
   – Hardware now has double precision support but not full
     IEEE 754 floating point standard
   – v3 (Jan 2010) aka Fermi – supports more languages and full
     IEEE 754 with significantly improved dp performance
   – CUDA Fortran now available from Portland Group as well
     as an “Accelerator” model – vendor specific!
• A language for data and task parallel computing
  using CPUs and GPUs
   –   Created by Apple and based on c99
   –   released as open standard in June 2008
   –   v1.0 spec in Dec 08
   –   Now built into MacOS 10.6 (“Snow Leopard”)
   –   Works on NVidia, AMD, IBM, S3, etc
   –   Platform independent cf. OpenGL
• Low level – even more so than CUDA – but
  device independent and an open standard ...
• Microsoft has released DirectCompute as set of
  DirectX APIs to enable GPU usage in Windows
• CUDA and OpenCL have steep learning curve
  – Need to know about device memory etc.
• HMPP is a compiler directive based approach
  – Hence much higher level, more like OpenMP
  – Minimal change to existing codes
  – Supports Fortran, C/C++, etc
  – Backend generates CUDA or OpenCL as required!
  – Proprietary – CAPS – but now linking up with
    Pathscale and desire to make into open standard
  – Similar ideas in Portland Group „accelerator model‟...
New CPUs
                Low Power HPC
• PowerXCell in 3 of the top 10 machines in Green500
   – Green500 focuses on power-per-Watt
   – Top machine (Blue Gene/Q) has 1684 MFLOPs/W
   – But now 6 machines use GPUs ...
• Intel Atom
   –   low complexity and low power design
   –   no instruction reordering or speculation
   –   16-stage pipeline
   –   64kB L1 and 512kB L2 cache
   –   available in HT and single/dual core versions
• See also nVidia Tegra, Via Nano, AMD Bobcat, etc.
• Contains 64 tiles
• Each tile is CPU
+L1+L2 cache + switch
• Can snoop cache
on other tiles
• 4 memory controllers
• Version 2 (700 MHz) takes 20 W total
• Each core can run separate O/S and be separately
  powered down, or can run SMP on all cores at once.
• v1 launched August 2007, v2 (bigger caches and better
  comms) launched September 2008.
• 100 core variant launched in early 2010
• Interlagos
  – 16 core, based on Bulldozer core,
    HyperTransport 3.1, due mid-2011
• Fusion
  – Combination of CPU + GPU + memory
    controller in a single chip
  – Flat memory model programmable via OpenCL
    or DirectCompute
  – First device (Zacate) due early 2011
• More multi-core processors
  – Experimental 48-core “Single Chip Cloud
    Computer” released mid-2010 – not on roadmap
  – Design “can be scaled to 1000 cores” claim
     • Limited by on-chip network
     • Issues with cache coherency
     • 6x4 array of Pentium tiles
• Larabee
  – CPU+GPU fusion but with x86 instruction set
     • Full cache coherency
     • Much more flexible and easier to programme
     • v1 was due early 2010 but not released as uncompetitive
       with GPUs – project still in active development
                       Further Reading
•   Grid at http://www.epcc.ed.ac.uk/services/grid-computing/
•   White Rose Grid at http://www.wrg.york.ac.uk
•   National Grid Service at http://www.ngs.ac.uk
•   Folding@Home at http://folding.stanford.edu/

•   Fortress at http://projectfortress.sun.com
•   Fortran2003 at http://www-users.york.ac.uk/~mijp1/COL/fortran_2003.pdf
•   Co-Array Fortran at http://www.co-array.org
•   UPC at http://upc.gwu.edu/

•   Green500 at http://www.green500.org
•   Tile64 at http://www.tilera.com

•   CUDA at http://www.nvidia.com/object/cuda_home.html#
•   OpenCL at http://www.khronos.org/opencl
•   HMPP at http://www.caps-enterprise.com

To top