Docstoc

Cluster Basics - School of Computer Science - School of Computer

Document Sample
Cluster Basics - School of Computer Science - School of Computer Powered By Docstoc
					         Cluster Computing

• Introduction: quick history, trends
• Markets for High Performance Computing
• Cluster Anatomy
   – Hardware: compute nodes, networks
   – Software: Schedulers and MPI software
• Programming for HPC platforms
   – Introduction to Parallel Programming
            Back then…


• Cray, Fujitsu, Vector and MPP
  machines
• Highly specialised hardware designed
  for High Performance Applications
• Proprietary software such as cray-
  shmem
               ..in the olden days..


Note: Money left over
  after building
  computer for
  attaching lights, a
  lick of paint and
  some additional
  seating.
Things were good
                Moores Law

• #transistors on a chip doubles every 18 months
• keeping revenues unchanged requires
  getting customers to want twice the power
  every 18 months, so that they'll keep
  paying the same old price.
• If they want the same old power, there
  have to be twice as many customers --
  because they'll only be paying half the
  price.
               Beowulf Cluster


• Donald Becker, NASA assembles 16
  DX4 PCs with 10 Mb/s ethernet
  – Rewrote ethernet drivers to get channel
    bonding
• Beowulf Project spun off
Experimental days: Furbywulf
                Runs Furby Linux!
                Overclocked Furbies!
                Hot-swappable Furby units!
                     Today

• Not so glamorous
  but…
   • Cheap!
   • Fast!
   • Portable!
   • Open!
            Cluster
         Installations

• Used to be Workstations for
  research + National
  Compute Resource
• Now, most UK Universities
  have at least one cluster
  installation, usually several
• Average Cluster size ~10-32
  processors
What is a Commodity Cluster?

• A machine built from commodity
  components for high performance:
   – Beowulf Cluster
   – Job Farm
   – NOW – Network of Workstations
• One master node and many slaves
• Strong take up of open-source software
   – Linux, OpenMosix, PVM
       Commodity Clusters

• Network of Workstations
   – Layer of software on workstations. At night,
     batch jobs can run.
• Job Farm
   – Dedicated compute nodes for serial batch work
     only
• Beowulf Cluster
   – Specialised interconnect between nodes for
     efficient parallel communications
       Markets and Applications

• Commercial and Academic use of Large
  Scale Simulations consume major part of
  HPC compute resource
• New applications for HPC being continually
  developed
   –   Finance
   –   Bioinformatics
   –   Protein Unfolding
   –   Brain Modelling
                    ASCI Project
• Comprehensive test ban treaty is
  good news for computer vendors
• “Of all the remarkable things that
  supercomputers will be able to
  accomplish, none will be more
  important than helping to make sure
  that the world is safe from the threat
  of nuclear weapons” –Bill Clinton
       Materials Modelling


• Very rigorous calculations
  on large numbers of
  atoms now possible
• Invent new materials
• Analyse electronic,
  structural, magnetic,
  optical properties
Weather/Global Modelling
Oil Exploration

• Seismic survey analysis
• Oil Drilling modelling for decision
  support
Beowulf Cluster Anatomy
         Typical Cluster Today
• Master Node
  – Dual 2.6 GHz Opteron, 8GB
    RAM
  – 4u
  – Mirrored hot-swap SCSI
  – 4 PCIX, 2PCI
  – Triply redundant Power and
    cooling
  – Hot-swap power supplies
     Typical Cluster Today


• Compute nodes
  – 2x Dual Core 2.2GHz
    Opteron
  – 4GB RAM
  – 40GB SATA disk
  – Dual Gbit ethernet
  – 1 x PCIX
      Typical Cluster Today

• A serial network for
  management
• A Gbit network for data
• A high performance
  networks (e.g. Myrinet
  network) for parallel
  communications
       High Performance
         Interconnects

• Myrinet
  – Market leader
• Infiniband
  – Rapidly expanding market share
• Quadrics
• Gigabit Ethernet??
• 10GB Ethernet
               Infiniband

• Open Interconnect Standard
• Fairly recent entrant to HPC
• Multi vendor support
• Not just an HPC interconnect!
• Backplane, and external interconnect
• 2.5 or 10 Gb/s (or 30Gb/s)
              Quadrics

• QsNet II
• Fat-Tree network topology
• Higher-End Solution
• Elan4 processor optimised for
  short message latency 3-4ms
• 900MB/s Bandwidth
• Cost?
                 GbE


• Can use different drivers for HPC
  – Latency reduced
  – Bandwidth Increased
• Ultimate commodity interconnect
• Bandwidth limited but Myricom now
  ship GbE
                                                                                Myrinet
        • 2.6 microsec Latency
        • Very High Bandwidth
        • Low host CPU utilisation




 8       8       8       8       8       8       8       8       8       8       8       8       8       8       8       8
hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts   hosts
         Cluster Software

• General requirements
  – We need free or low cost software
• Specifics
  – Fast OS
  – Support parallel programming
  – Organise jobs (load balancing, file
    staging)
  – Monitoring and management tools
Cluster Software Stack
Software stack example
Job Management Systems
Job Management Systems
    Job Scheduling Systems

●   or Distributed Resource Managers
●   Interface between users and the HPC
    system
●   Take requests for resources (wall clock
    time, #CPUs, memory req., license req....)
●   Schedule onto resources based on
     ●  Load
     ●  Requirements
     ●  Access
     ●  Priority
            DRM choices

• OpenPBS
• Sun Grid Engine
• LSF
• Alternative “high throughput”
  software based on SETI@home
  principle
              OpenPBS


• Used to be scheduler of choice for
  academia
• Unsupported, no official development
• Can get patches from variety of
  sources
  – Http://bellatrix.pcl.ox.ac.uk/~ben/pbs/
  – http://www-unix.mcs.anl.gov/openpbs/
              OpenPBS

• Three daemons:
  – pbs_server server daemon on master
  – pbs_sched scheduler daemon on master
  – pbs_mom executor & resource monitor on
    compute nodes
• Server accepts commands and
 communicates with daemons
  – qsub submit a job script
  – qstat view q and monitor job status
  – Qdel delete a job
             OpenPBS


• Jobs sent to queues
• Each queue comprises a number of
  slots distributed across cluster
• When job finishes queueing, it
  executes!
                         Maui

• Reservations for running jobs – all running jobs have
  reservation on their resources (processors and memory).
  This prevents over subscription
• Reservations for pending jobs – top priority jobs
  automatically get reservations. This solves the job-lockout
  problem
• Advanced reservations – reserve processors and
  memory for a specified duration in the future for users,
  groups, and hosts.
            Maui – Backfill

Lower-priority pending jobs can bypass higher-
  priority jobs provided their dispatch would not
  cause the reservations for the bypassed jobs
  to be violated.
      Maui-Simulation/analysis


Workloads can be analyzed to help determine the
 effect of scheduling parameters.
Simulations can be run and analyzed to help tune
  scheduling parameters.
           Grid Engine


• (Codine) bought out by Sun ~2000
• OpenSource and free or
• Sun Support and $$
• Strong roadmap
• Good support
• Multi OS support
Parallel Computing Paradigms


• OpenMP (Open Message Passing)
• PVM (Parallel Virtual Machine)
• MPI (Message Passing Interface)
                OpenMP
• Used in shared memory architecture
• Relatively simple to program
    – (all parallelisation performed by
      compiler)
•   Directives given by user
•   Incremental parallelisation
•   Poor scalability?
•   Restricted to SMP servers
    (expensive)
PVM – parallel virtual machine

• Free (Legacy) parallel open-source software
• Suitable for distributed memory architectures
• Supports dynamic process groups
• A little unwieldy, processes tend to get lost
• PVM largely superceded by MPI
                        MPI
• Message Passing Interface
   – A published specification for a set of library
     interface standards (1.2, now 2.0)
   – Can by used by Fortran, C, C++ and others
     (even Java)
   – Many implementations of MPI available from
     competing companies, open-source sites
   – Good scalability
   – But – checkpointing difficult
   – Integration with Job schedulers varies
   – Debugging can be difficult
  You need to rewrite your
program if you want to run in
          parallel!
Serial Program:
#include <stdio.h>
void main(int argc, char** argv) {
    printf("Hello-world\n");
}
                   Hello World

      PROGRAM simple
      include 'mpif.h'
      integer errcode
C Initialise MPI
      call MPI_INIT (errcode)
C Main part of program ....
C Terminate MPI call
       MPI_FINALIZE (errcode)


      end
           Communicator

• A process in the communicator can query
  the communicator with:
• MPI_COMM_RANK (comm, rank): returns
  rank the rank of the calling process in the
  group associated with the communicator
  comm. MPI_COMM_SIZE (comm, rank):
  returns size the number of processes in the
  group associated with the communicator
  comm.
                Parallel Program
include <stdio.h>
#include "mpi.h"
void main(int argc, char **argv) {
    int rank;
    int size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello-world, I'm rank %d; Size is %d\n", rank, size);
    MPI_Finalize();
}
              Single Transfer

     Pingpong
• Single message
  transfered between
  hosts
           Single Transfer

Pingping
            Parallel Transfer

     Sendrecv
• Periodic
  communication
  chain
• Same as
  pingping for 2
  procs
              Parallel Transfer


  Exchange
• Boundary
  exchanges
        Collective Transfer


• Allgather
   – Each process inputs x bytes and receives
     x*(# processor) Bytes
• Alltoall
   – Each process sends X bytes to all others
     and receives X bytes from all others
Collective
Operations
      Alternatives in Parallelising a task


  Serial



Pipelined



Partitioned


  –   Not always possible to equally distribute work
  –   Communication required between workers
  –   Often one worker must wait for another worker…
     Key Concepts for data parallel
             applications
• Load Balancing
   – Each processor should undertake a similar amount
     of computation !
• Communication Bandwidth
   – Communication between processors should be
     minimised
• Communication Latency
   – Small messages should be avoided where possible
• Synchronisation
   – Equal amounts of work must be allocated to each
     processor between synchronisation points
• Amdahl’s law
       MPI Implementations

• MPICH (MPICH-GM) – the default
   – Just files – libraries
   – High startup latency
• LAM MPI
   – Well performing MPI freeware
   – Uses daemons to speed up startup
• Score
   – Japanese developed open source s/w
   – Advanced…
               SCore

• Gigabit Drivers (low latency)
• Gang Scheduling for interactive use
• Compiler wrappers for easy multi-
  compiler usage
• Network transparency
• MPI job checkpointing support
        Commercial MPIs


• Version of MPI are ported to use
  communication devices which are
  written for specialised interconnect
  hardware
• E.g. Myrinet: MPICH-MX
                 Summary

• Standards continuing to develop for MPI,
  Schedulers,
• Compute hardware has become highly
  commoditised and pretty cheap
• Network hardware less specialised
• Still very tricky to write good parallel
  software
       Cluster Computing

• Little question of future dominance
  of commodity computing
• Issues remain in management,
  scalability, parallel program
  development

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:4/15/2011
language:English
pages:61