Bulls HPC solutions by hkksew3563rd



Bull’s HPC solutions
   Page 1
 © Bull 2003
   HPC: a Bull’s
strategic direction

   Page 2
 © Bull 2003
A long-term strategic choice for Bull
 Development of Intel® Itanium® based servers
Intel IA-32 and Itanium® Processor Family (IPF) both
providing long-term roadmap with predictable performance
Intel-based server market growing to 50%
of the market in 2005
Bull launched its FAME program early 1998
o   Large-scale enterprise servers built on
    standard technologies
    n  Standard technologies deployable up to the back-office
o   Convergence of commercial and HPC architectures
o   Ultimately single processor source for Bull server products
o   Multi-operating systems
    n  Windows, Linux
    n  Foundation for long-term GCOS 7 and GCOS 8 evolutions
o   Highly flexible servers

FAME: Flexible Architecture for Multiple Environments
           Page 3
         © Bull 2003
HPC: a Bull’s strategic direction
 New market conditions with Itanium® 2
 Scientific computing: a strategic challenge for Europe
 o   European research needs powerful computing infrastructure
 o   Partnership is key to make the most out of HPC solutions

 Bull’s range of powerful servers
 o   Long-term investment and time-to-market
 o   Technical differentiators
 Bull: the European computer maker
 o   High-end solutions expertise
 o   Direct and close contact with European customers

        Page 4
      © Bull 2003
Bull HPC focus
 Optimized NUMA architecture
 o   Use of standard building blocks
 o   Design of efficient silicon chip to build larger SMPs
 o   Result: best performance/price ratio
 Efficient software stack
 o   Investment in Linux for high end 64 bits server
 o   Optimization of middleware like MPI (Message Passing Interface)
 o   Selection of Open Source and commercial software
     for comprehensive software environment
 Optimization capabilities
 o   Computing and storage infrastructure
 o   Applications development
 o   Applications porting and tuning

        Page 5
      © Bull 2003
Bull investment: Competence Centers
 Hardware Development Center                            Les Clayes/France
 Windows Solutions Expert Center                        Les Clayes/France
 o   Porting of multiple operating systems onto FAME
 o   Application Solution Center (ASC) for assistance
     with application- and software- porting
 Linux Solution Expert Center                           Grenoble/France
 o   Clustering and administration
 o   Open Source
 o   ASC for application porting
 o   Customer assistance for application
     customization and deployment
 High Performance Computing Center                      Grenoble/France
 o   Supported by the Linux Solution expert center
 o   Dedicated benchmarking and tuning teams
 o   Dedicated platforms and clusters
 o   Project and Partner support

        Page 6
      © Bull 2003
Bull NovaScale™

   Page 7
 © Bull 2003
Bull NovaScale™ servers benefit from Bull’s
solid partnership with Intel
  Early access to Itanium® specifications
  Architecture reviews
  for Bull chipset validation
  Bull’s inputs for the E8870 chipset design
  and Intel Quad Building Block (QBB) fabric
  Two Application Solution Centers
  (Paris/Windows .NET and Grenoble/Linux)
  Pioneered introduction of large-scale Itanium®-based servers
  o   Demonstrating the first 16-way Itanium® 2 server at Intel Developer
      Forum in Munich (may 2002)

          Page 8
        © Bull 2003
Bull’s contribution to Linux/Open Source
 Support the most important distributions on the market

 Facilitate the deployment of Linux/Open Source
 o   Open Source infrastructure packages (pre-set scenarii)
 o   Optimization of business solutions under Linux
 Contribute to the development of Open Source
 o   Member of ObjectWeb, Atlas programs, LES, etc
 Participate to the ATLAS program
 o   Joint Open Source development project with Bull, HP, Intel, IBM,
     NEC, SGI, SuSE enhancing critical features of Itanium® 2-based

        Page 9
      © Bull 2003
FAME technology
for Bull NovaScale servers

Itanium® 2

        Page 10
       © Bull 2003
Bull NovaScale servers
   4-8-12-16-way servers to target HPC servers,
   Database Servers, and Application servers
   o   Single server
   o   Clustering
   o   Server value within all market segments
   o   Complete and homogeneous offer solution

Scale Up                                     Scale Out
- 4, 8, 12, 16 ways                          - 2 to 4 and 16 ways nodes
- SMP                                        - Clustering up to 256 nodes
- Shared memory                              - Fast interconnect
- Shared storage                             - Shared memory clustering
- High availability                          - Storage node

                       36U Rack   19U Rack                                  36U Rack

          Page 11
         © Bull 2003
NovaScale HE servers fully leverages Bull’s
FAME architecture based on Itanium®
 Flexible Architecture for Multiple
                                                    FAME Scalability Switch
 o   Mainframe disciplines and reliability:
     no single point of failure
 o   Flexible and scalable architecture
     n   Multi-Operating System
     n   Physical Partitioning
     n   4 to 32 way servers
 o   Dense design for improved TCO
     n   Use of industry-standard building blocks
     n   Optimized floor space utilization
 o   Built-in management

  A unique architecture to build SMP servers with shared
 memory, low memory latency and well balanced throughput

             Page 12
            © Bull 2003
  The Quad-Brick Block (NovaScale 4040,
  5080 and 5160 building block)
                               A cost effective 4-processor
                               engine with memory included
Itanium ® 2
                               Physically, a compact design, in a
                               protective enclosure
                               Up to 4 x Itanium® 2 processor
                               o   1,3 GHz with 3 MB level 3 cache
                   CPU board
                               o   1,4 GHz with 4 MB level 3 cache
                               o   1,5 GHz with 6 MB level 3 cache
                               Up to 32 GB memory (è64)
                               6 GB/s memory bandwidth
                               Planned upgradability
                               to future Madison+ and Montecito

               Page 13
              © Bull 2003
FSS: FAME Scalability Switch
designed by Bull
 The crux of a large multi-processor
 o   Insures global memory
     and cache coherence
 o   Optimizes coherence traffic (“snoop filtering”)
 o   Synchronizes, orchestrates, and routes all multiprocessor
     communications (aggregate data rate in excess of 50 GB/s)
 o   Implemented in pairs for performance and availability
 A highly elaborate piece of silicon
 o   18,3 x18,37 mm, 0.18µ Cu process, 60 M transistors,
     1520 input/output pins
 o   Simultaneous bi-directional interfaces in the GHz range
 o   FSS to FSS communication at 2,5 GHz

       Page 14
      © Bull 2003
 NovaScale 5160 characteristics
Hardware                                                                14-slot Raid disks
• 17U x 28" density-optimized rack mountable                          (extension to 22 disks)
  module including: Platform Administration
  Processor, LCD console with its mouse and                                8-slot Raid disks
  keyboard, 8-port KVM switch                                           (extension to 16 disks)
• 4-8-12-16 Intel ® Itanium® 2 processors : 1.3 GHz
                                                       Platform Administration Processor
  3MB/L3, 1.4GHz 4MB/L3, 1.5GHz 6MB/L3
• E8870™ Chipset with Scalability Port Switch         15” LCD with Keyboard mouse
• 2GB to 128 GB of DDR200 memory (32 DIMMs)
• I/O board with: 2x USB, 2x Serial ports, 1 SVGA      KVM concentrator
  video port (LCD console), 1x Eth10/100 port,
  1x DVD/CD-ROM, 1x LS240 FD
• 11 Hot-plug PCI-X slots(5x 133MHz & 6x 100 MHz),
  (add 11 Hot-plug PCI-X slots as option)
• 4 x redundant Hot-Swap Power Supplies, 12 (6x 2)
  redundant Hot-Swap Fans
• 8-slot Raid Disks or 14-slot Raid Disks
                                                       4-8-12-16- way module                      36U Rack
• System Administration and Management with
  built-in Platform Administration and Maintenance
  Software Package
• Operating System: Linux HPC
• Standard warranty: 1-year on-site
                Page 15
               © Bull 2003
NovaScale 5160: key features for HPC

 Low memory latency
 F   excellent NUMA factor

 Well balanced internal throughput
 o QBB level
 o FSS level

 o IO

       Page 16
      © Bull 2003
NovaScale: the “Commodity mainframe”
High-end system with mainframe-class RAS
  No single point of failure configurations
  N+1 redundant power and fans
  Redundant FSS
  Hot pluggable everything : QBB, IOB, PCI, MP, power, fans …
  Memory protected by ECC and ChipKill mechanisms
  All data paths, including FSS, are ECC protected
  Integrity self test at power-on
  Full isolation between Domains, including power-off
  and power-on operations
  RAID protected system storage
  System ID card including BIOS and all firmware revisions
  Auto calls to support center on programmable thresholds
  … Managed from an independent Platform Management Server

      Page 17
     © Bull 2003
System Management view
 Consistent handling of
 o   Domain partitioning: set of resources,
     processors, memory, I/O,..)
 o   Storage Area Network
 o   Storage Subsystem
 Hardware Platform Monitoring
 o   Hardware and firmware identification
 o   Hardware status and threshold display and
 o   Temperature, Voltage, Fans alarm detection
 o   Hardware fault detection and report
 o   Platform history management
 Maintenance Tools
 o   Test monitoring
 o   Firmware updating
 o   Error log collection
 o   Individual power control of resources

       Page 18
      © Bull 2003
Performance and scalability

  Page 19
 © Bull 2003
Performance and Scalability
 Processor for NovaScale servers
 o   SPECint : 1047                             Performances
                                                  Itanium® 2
 o   SPECfp : 1980                               « Madison »
 o   Linpack
     n    HPL : 5.2 Gflops
     n    100x100 : 1107 Mflops

     64.1 NovaScale 4040 (4-way)
     209 NovaScale 5160 (16-way)        Scaling factor 3.3

     Linpack HPL: 19 Gflops NovaScale 4040         Scaling factor 3.7
     Linpack HPL: 71 Gflops NovaScale 5160

          Page 20
         © Bull 2003
  Bull’s HPC offer
Clustering Solutions

   Page 21
  © Bull 2003
HPC clustering solutions
 Cluster architecture
 o   Architecture key point
       w Compute node selection
       w High speed interconnect and Message Passing
       w Data storage and Global File System
 o   Operating key point
       w Development environment
       w Resource management
       w Cluster management

 Building a 2 TB/TF configuration

       Page 22
      © Bull 2003
 HPC infrastructure schema
                                         LIB           Portal secure
                                                       Access Node

Compute NODE
                                                                          Compile &
                                                                         ADMIN service
                           ADMIN NET                                       NODES

                                       IO meta data

 Serial &
 ADMIN Net                                                      IO SAN
                                 COMPUTE NET

                           PAM                        IONODE

                            POWER CONTROL

              Page 23
             © Bull 2003
Architecture key point
 Node characteristics : Processors performance,
 packaging density, cooling
 Master the complexity: Running very large
 applications limiting the number of systems,
     w Limited concurrent access to global services
     w Limited cabling and switch connectivity
     w Limited administration costs
 Run a large set of applications
     w    Evaluate the trade-off applications and resource needs
     w    Applications memory bounded or I/O bounded,
     w    Fine grained domain decomposition
     w    Critical performance latency dependencies
     w    Legacy applications (OpenMP,…)

 Integration into the existing center infrastructure

     Page 24
    © Bull 2003
Parallel applications types
 Fine grained decomposition
                                                   CPU/mem                        CPU/mem
 /Application                                      processing                     processing
 o   Well adapted for small node                                Message passing
                                            I/O                 data exchange                      I/O
 o   If Node number >100                                        Synchronization
                                      data load                                                data load
      n        => complexity
      n        => expensive interconnect
 Complex decomposition                                 CPU /mem                      CPU/mem
                                                       processing                    processing
 o   High data exchange between domains
                                                                    Message passing
 o   Medium node 8-16 way                                           data exchange
                                           I/O                                                         I/O
                                           data load                Synchronization               data load

 OpenMP applications or none
 // applications
 o   Larger node is the best                                        I/O           // processing
                                                                    data load
 o   BUT => SMP scalability ratio decrease                                        Multi threading
 o   BUT => traditionally expensive hardware

       Page 25
      © Bull 2003
Bull computing node selection
 Fine grained applications + small configuration
   n Computer node = NovaScale 4040, 4-way Itanium®

However for
 Heterogeneous // applications
 Large configuration
 OpenMP applications
   n Computer Node = NovaScale 5160, 16 way

   n Economics: Performance / Price

        16-way processor price = 4* 4way-processors price

   n    Limit the packaging needs and cabling complexity
        w less high speed connection needs

        Page 26
       © Bull 2003
High-speed interconnect
and Message passing libraries
 Adapted ratio for a 16-way powerful node
 Low latencies, high throughput
 Multi-rail technology
 Robustness and scalable
 Associated to an optimized MPI library
 Good price/performance ratio

     Page 27
    © Bull 2003
MPI on NovaScale
 MPI Characterization- Bull MPI
 MPI objective : have a regular throughput/latency ratio
   n    Based on MPICH 1.2.5
   n    Compliant MPI 1.2, ROMIO (MPI-IO), MPE
   n    Compatible Linux 2.5 & 2.6

 NUMA placement for processes
 Objective : optimization of memory bandwidth

        Page 28
       © Bull 2003
Technical details
  NUMA placement of processus
 Spreading the load on all the CPUs, keep the CPU/memory affinity

 Key issues of standard                   QBB 1    QBB 2       QBB 3     QBB 4
 Default kernel usage is to keep      1        5           9        13
 a process locality .                 2        6         10         14
 => Memory allocation mainly
 on QBB 1.                            3        7         11         15
                                      4        8         12         16

Bull’s optimized NUMA
architecture                              QBB 1    QBB 2       QBB 3     QBB 4
NUMA API : CPU affinity               1        5           9        13
Well balanced memory allocation
In regard of CPU placement            2        6         10         14
                                      3        7         11         15
                                      4        8         12         16

        Page 29
       © Bull 2003
   Software environment

 Page 30
© Bull 2003
Coverage of Bull’s HPC software stack
                    Solutions    Improvements
Applications             ü      ISV Porting

 Cluster Mgt             ü      Integration of Multiple solutions

File System //           ü      NFS parallel, PFS, Lustre Lite
 Perf. Tools             ü      MPI and HW counters extension
 Libraries &             ü      Tuning of Libraries
                                MPI optimization for SMPs
Interconnect             ü
   Kernel                ü      Moving to large SMPs and NUMA

       Page 31
      © Bull 2003
 Intel C/C++ and Fortran 95 v7.1
 o   Support of standard
 o   OpenMP and CDIR$ directives support
 o   Optimization
     n    Floating point operation throughput
     n    Interprocedure calls
     n    Prefetch of data and instructions
     n    Speculation
     n    Software pipelining
 GNU compiler
 o   Gcc, g++, g77 or f77

          Page 32
         © Bull 2003
Scientific libraries
  Intel MKL 6.0
  o   LAPACK, BLAS, discrete Fourier transforms, vector math
      functions, vector statistical functions
  o   Sparse implementation for BLAS 1
  o   Threading support
  o   Cache management optimization
  Other libraries
  o   IMSL: optimized version for Itanium® 2
  o   NAG: optimized version for Itanium® 2
  o   Open Source: PETSc, FFTW available
  o   Method developed by UVSQ (University of Versailles - St Quentin)
  o   Limited effort for a specific function
  o   Test on Linpack 100x100

        Page 33
       © Bull 2003
Tools: profiling and analysis, debugging
 o   Principles
     n    Hardware counters (cycles, fp instructions) generate interrupts
     n    Collected by a kernel component
     n    Used by profiling tools
 o   Implementation : supports Itanium® 2 processor counters
 o   Extension: memory events, FSS data
 Others open source tools
 o   vprof and cprof
 o   pfmom
 o   Performeter
 Vampir + Vampir trace
 Debugging : GDB, Intel LDB, TotalView 6.2

          Page 34
         © Bull 2003
Resource Management (1)
Open Source based
Fair share of the resources of the cluster among jobs
  o   Job Management (Sequential/Parallel)
  o   Default FIFO Job Scheduler (Pluggable)
  o   Resources Management
      n    Nodes, memory, processor, time reservation
      n    Synchronization between processes
  o   Advanced High Level Job Scheduler
      n    Better algorithms
      n    Extensive control ( prioritization, planning, where to run)
  o   Plugs into PBS

           Page 35
          © Bull 2003
Resource Management (2)
 LSF task manager (loader from Platform)
 o   User interface to job submission
 o   Queue, node-resource-CPU, partition declaration
 o   Check point and restart

 RMS (from Quadrics)
 o   Effective CPU resource allocation
 o   Node selection in regards of load , or exclusive reservation
 o   Well balanced system and interconnect activities

       Page 36
      © Bull 2003
Cluster management
A comprehensive set of tools integrated by Bull
  Platform management
  o   PAM : configuration, error messages, diagnosis, remote
  Deployment : System Installer suite
  o   Automatic management of the node image
  o   Global and consistent data base
  Cluster system monitoring : Nagios
  o   Visualization of the cluster
  o   Access to node information
  Performance monitoring :Ganglia
  o   CPU, memory, network

        Page 37
       © Bull 2003

 Page 38
© Bull 2003
Bull HPC solutions’ advantages
 Itanium® Processor Family : the best HPC processor
   n    Itanium® 2 leads standard processors
   n    Clear roadmap from Intel®
 Flexible and powerful architecture: FAME
   n    Scalability of NovaScale® server
   n    NovaScale best price/performance ratio
   n    Large SMP equivalent to small servers
 Balance and efficient solutions
   n    Best clustering technologies
   n    Performance of storage
 Open and complete software environment
   n    Based on standard
   n    Optimized for the NovaScale servers
   n    Integrated and supported by Bull
 European competence centers
   n    Architecture design
   n    Optimisation
        Page 39
       © Bull 2003
Comprehensive solutions
                           Mid & High Range SMPs
                           Mainframe class RAS
                           Economics due to maximum
                           re-use of large volume
                           Cluster solutions for mid
                           or high-end nodes
                           Complete software stack
                           Supported by highly skilled
                           experts and dedicated
                           competence centers
                 …preparados para afrontar
                 los desafíos del HPC

    Page 40
   © Bull 2003

To top