CLUSTER MA TIC An Innovative Approach to Cluster Computing by sid76703


                            An Innovative Approach
                              Cluster Computing
                                     2004 LACSI Symposium
                            The Los Alamos Computer Science Institute


                             Tutorial Outline (Morning)
    Time            Module                  Outcomes                                 Presenter
     9:00 - 9:05    Tutorial Introduction   Introduction to tutorial process         Greg Watson
     9:05 - 9:30    Clustermatic            An understanding of the overall          Greg Watson
                                            Clustermatic architecture
     9:30 - 10:30   BProc & Beoboot         An understanding of BProc & Beoboot      Erik Hendriks
                                            software; how to configure and install
                                            on a cluster
    10:30 - 11:00   Break
    11:00 - 11:30   BProc & Beoboot         Continued…                               Erik Hendriks
    11:30 - 12:30   LinuxBIOS               Ability to build and install on a node   Ron Minnich
    12:30 - 2:00    Lunch Break


                           Tutorial Outline (Afternoon)
    Time           Module                Outcomes                                 Presenter
    2:00 - 3:00    Filesystems           Familiar with network filesystem         Ron Minnich
                                         options and how to configure for a
    3:00- 3:30     Supermon              An introduction to monitoring and how    Matt Sottile
                                         to use Supermon to effectively
                                         monitor a cluster
     3:30 - 4:00   Break
    4:00 - 4:30    BJS                   An understanding of the purpose of       Matt Sottile
                                         BJS and how to configure for a cluster
    4:30 - 5:25    MPI                   How to compile, run and debug an         Matt Sottile
                                         MPI program
    5:25 - 5:30    Feedback              Obtain feedback from participants


                                 Tutorial Introduction

       •   Tutorial is divided into modules
       •   Each module has clear objectives
       •   Modules comprise short theory component, followed by hands on
       •        indicates theory            indicates hands-on

                              Please ask questions at any time!


              Module 1: Overview of Clustermatic
                                   Presenter: Greg Watson

    •   Objective
         •   To provide a brief overview of the Clustermatic architecture
    •   Contents
         •   What is Clustermatic?
         •   Why Use Clustermatic?
         •   Clustermatic Components
         •   Installing Clustermatic
    •   More Information


                            What is Clustermatic?

    •   Clustermatic is a suite of software that completely controls a cluster
        from the BIOS to the high level programming environment
    •   Clustermatic is modular
         •   Each component is responsible for a
             specific set of activities in the cluster    Compilers
         •   Each component can be used                                             BJS
             independently of other components

                                                               MPI                          BProc


                                                                                              v 9fs



                          Why Use Clustermatic?
    •   Clustermatic clusters are easy to build, manage and program
         •   A cluster can be installed and operational in a few minutes
    •   The architecture is designed for simplicity, performance and
         •   Utilization is maximized by ensuring machine is always available
    •   Supports machines from 2 to 1024 nodes (and counting)
    •   System administration is no more onerous than for a single
        machine, regardless of the size of the cluster
         •   Upgrade O/S on entire machine with a single command
         •   No need to synchronize node software versions
    •   The entire software suite is GPL open-source


                       Clustermatic Components

    •   LinuxBIOS
         •   Replaces normal BIOS
         •   Improves boot performance and node startup times
         •   Elimates reliance on proprietary BIOS
         •   No interaction required, important for 100’s of nodes



                                          MPI                           BProc


                                      LinuxBIOS  LinuxBIOS


                       Clustermatic Components
     •   Linux
         •   Mature O/S
         •   Demonstrated performance in HPC applications
         •   No proprietary O/S issues
         •   Extensive hardware and network device support


                                         MPI    Supermon             BProc


                                           Linux    Linux



                       Clustermatic Components
     •   V9FS
         •   Avoids problems associated with global mounts
         •   Processes are provided with a private shared filesystem
         •   Namespace exists only for duration of process
         •   Nodes are returned to “pristine” state once process is complete


                                         MPI                         BProc

                                                                  v9fs v9fs



                       Clustermatic Components
     •   Beoboot
         •   Manages booting cluster nodes
         •   Employs a tree-based boot scheme for fast/scalable booting
         •   Responsible for configuring nodes once they have booted


                                         MPI    Supermon              BProc

                                         Beoboot            Beoboot




                       Clustermatic Components
     •   BProc
         •   Manages a single process-space across machine
         •   Responsible for process startup and management
         •   Provides commands for starting processes, copying files to nodes, etc.






                       Clustermatic Components
     •   BJS
         •   BProc Job Scheduler
         •   Enforces policies for allocating jobs to nodes
         •   Nodes are allocated to “pools” which can have different policies







                       Clustermatic Components

     •   Supermon
         •   Provides a system monitoring infrastructure
         •   Provides kernel and hardware status information
         •   Low overhead on compute nodes and interconnect
         •   Extensible protocol based on s-expressions



                                          MPI                         BProc




                         Clustermatic Components

     •   MPI
         •   Uses standard MPICH 1.2 (ANL) or LA-MPI (LANL)
         •   Supports Myrinet (GM) and Ethernet (P4) devices
         •   Supports debugging with TotalView






                         Clustermatic Components
     •   Compilers & Debuggers
         •   Commercial and non-commercial compilers available
               •   GNU, Intel, Absoft
         •   Commercial and non-commercial debuggers available
               •   Gdb, TotalView, DDT



                                              MPI                         BProc




                                      Linux Support

     •   Linux Variants
          •   For RedHat Linux
               •   Installed as a series of RPM’s
               •   Supports RH 9 + 2.4.22 kernel
          •   For other Linux’s
               •   Must be compiled and installed from source


                              Tutorial CD Contents

     •   RPMs for all Clustermatic components
          •   Architectures included for x86, x86_64, athlon, ppc and alpha
          •   Full distibution available on Clustermatic web site (
     •   SRPMs for all Clustermatic components
     •   Miscellaneous RPMs
     •   Full source tree for LinuxBIOS (gzipped tar format)
     •   Source for MPI example programs
     •   Presentation handouts


                          Cluster Hardware Setup

     •   Laptop installed with RH9
          •   Will act as the master node
     •   Two slave nodes
          •   Preloaded with LinuxBIOS and a phase 1 kernel in flash
          •   iTuner M-100 VIA EPIA 533MHz 128Mb
     •   8 port 100baseT switch
     •   Total cost (excluding laptop) ~$800


                          Clustermatic Installation

     •    Installation process for RedHat
          •    Log into laptop
               •   Username: root
               •   Password: lci2004
          •    Insert and mount CD-ROM
               # mount /mnt/cdrom
          •    Locate install script
               # cd /mnt/cdrom/LCI
          •    Install Clustermatic
               # ./install_clustermatic
          •    Reboot to load new kernel
               # reboot


                      Module 2: BProc & Beoboot
                                  Presenter: Erik Hendriks

     •   Objective
          •   To introduce BProc and gain a basic understanding of how it works
          •   To introduce Beoboot and understand how it fits together with BProc
          •   To configure and manage a BProc cluster
     •   Contents
          •   Overview of BProc
          •   Overview of Beoboot
          •   Configuring BProc For Your Cluster
          •   Bringing Up BProc
          •   Bringing Up The Nodes
          •   Using the Cluster
          •   Managing a Cluster
          •   Troubleshooting Techniques


                                  BProc Overview

     •   BProc = Beowulf Distributed Process Space
     •   BProc is a Linux kernel modification which provides
          •   A single system image for process control in a cluster
          •   Process migration for creating processes in a cluster
     •   BProc is the foundation for the rest of the Clustermatic software


                                         Process Space

     •     A process space is:
            •   A pool of process IDs
            •   A process tree
                  •   A set of parent/child relationships
            •   Every instance of the Linux kernel has a process space
     •     A distributed process space allows parts of one node’s process
           space to exist on other nodes


                            Distributed Process Space
                                         •   With a distributed process space, some processes
                                             will exist on other nodes
                                         •   Every remote process has a place holder in the
                                             process tree
                                               •   All remote processes remain visible
                                         •   Process related system calls (fork, wait, kill, etc.)
                                             work identically on local and remote processes
                                         •   Kill works on remote processes
         A node with two remote                •   No runaway processes
               processes                 •   Ptrace works on remote processes
                                               •   Strace, gdb, TotalView transparently work on remote


              Distributed Process Space Example
         Master                                      Slave


     •   The master starts processes on slave nodes
     •   These remote processes remain visible on the master node
     •   Not all processes on the slave are part of the master’s process space


                       Process Creation Example
         Master                                      Slave

                         A                                A

                         B                                 B

     •   Process A migrates to the slave node
     •   Process A calls fork() to create a child – process B
     •   A new place holder for B is created
     •   Once the place holder exists B is allowed to run


                                  BProc in a Cluster

     •   In a BProc cluster, there is a single master and many slaves
     •   Users (including root) only log into the master
     •   The master’s process space is the process space for the cluster
     •   All processes in the cluster are
          •   Created from the master
          •   Visible on the master
          •   Controlled from the master


                                  Process Migration

     •   BProc provides a process migration system to place processes on
         other nodes in the cluster
     •   Process migration on BProc is not
          •   Transparent
          •   Preemptive
               •   A process must call the migration system call in order to move
     •   Process migration on BProc is
          •   Very fast (1.9s to place a 16MB process on 1024 nodes)
          •   Scalable
               •   It can create many copies for the same process (e.g. MPI startup) very efficiently
               •   O(log #copies)


                               Process Migration

     •   Process migration does preserve
         •   The contents of memory and memory related metadata
         •   CPU State (registers)
         •   Signal handler state
     •   Process migration does not preserve
         •   Shared memory regions
         •   Open files
         •   SysV IPC resources
         •   Just about anything else that isn’t “memory”


                        Running on a Slave Node

     •   BProc is a process management system
         •   All other system calls are handled locally on the slave node
     •   BProc does not impose any extra overhead on non-process related
         system calls
     •   File and Network I/O are always handled locally
         •   Calling open() will not cause contact with the master node
         •   This means network and file I/O are as fast as they can be


     •   All processes are started from the master with process migration
     •   All processes remain visible on the master
          •   No runaways
     •   Normal UNIX process control works for ALL processes in the
          •   No need for direct interaction
     •   There is no need to log into a node to control what is running there
     •   No software is required on the nodes except the BProc slave
     •   ZERO software maintenance on the nodes!
     •   Diskless nodes without NFS root
          •   Reliable nodes



     •   BProc does not provide any mechanism to get a node booted
     •   Beoboot fills this role
          •   Hardware detection and driver loading
          •   Configuration of network hardware
          •   Generic network boot using Linux
          •   Starts the BProc slave daemon
     •   Beoboot also provides the corresponding boot servers and utility
         programs on the front end


                             Booting a Slave Node
         Master                                                     Slave
                                        Request (who am I?)
                                                                         Phase 1
                                     Response (IPs, servers, etc)        Small kernel
                                       Request phase 2 Image         Minimal functionality
                                                  Phase 2 Image

                                                                                Load phase 2 Image
                                                                                   (Using magic)
                                     Request (who am I again?)

                                                                         Phase 2
                                                                      Operation kernel
                                        BProc Slave Connect            Full featured


                        Loading the Phase 2 Image

     •    Two Kernel Monte is a piece of software which will load a new
          Linux kernel replacing one that is already running
     •    This allows you to use Linux as your boot loader!
     •    Using Linux means you can use any network that Linux supports.
           •   There is no PXE bios or Etherboot support for Myrinet, Quadrics or Infiniband
           •   “Pink” network boots on Myrinet which allowed us to avoid buying a 1024
               port ethernet network
     •    Currently supports x86 (including AMD64) and Alpha


                              BProc Configuration

     •   Main configuration file
          •   /etc/clustermatic/config
     •   Edit with favorite text editor
          •   Lines consist of comments (starting with #)
          •   Rest are keyword followed by arguments
     •   Specify interface:
          •   interface eth0
          •   eth0 is interface connected to nodes
          •   IP of master node is
          •   Netmask of master node is
          •   Interface will be configured when BProc is started


                              BProc Configuration

     •   Specify range of IP addresses for nodes:
          •   iprange 0
          •   Start assigning IP addresses at node 0
          •   First address is, last is
          •   The size of this range determines the number of nodes in the cluster
     •   Next entries are default libraries to be installed on nodes
          •   Can explicitly specify libraries or extract library information from an
          •   Need to add entry to install extra libraries
               •   librariesfrombinary /bin/ls /usr/bin/gdb
          •   The bplib command can be used to see libraries that will be loaded


                            BProc Configuration

     •   Next line specifies the name of the phase 2 image
          •   bootfile /var/clustermatic/boot.img
          •   Should be no need to change this
     •   Need to add a line to specify kernel command line
          •   kernelcommandline apm=off console=ttyS0,19200
          •   Turn APM support off (since these nodes don’t have any)
          •   Set console to use ttyS0 and speed to 19200
          •   This is used by beoboot command when building phase 2 image


                            BProc Configuration

     •   Final lines specify ethernet addresses of nodes, examples given
          #node 0 00:50:56:00:00:00
          #node   00:50:56:00:00:01
          •   Needed so node can learn its IP address from master
          •   First 0 is optional, assign this address to node 0
     •   Can automatically determine and add ethernet addresses using the
         nodeadd command
     •   We will use this command later, so no need to change now
     •   Save file and exit from editor


                               BProc Configuration

     •   Other configuration files
          • Should not need to be changed for this configuration
     •   /etc/clustermatic/config.boot
          •   Specifies PCI devices that are going to be used by the nodes at boot time
          •   Modules are included in phase 1 and phase 2 boot images
          •   By default the node will try all network interfaces it can find
     •   /etc/clustermatic/node_up.conf
          •   Specifies actions to be taken in order to bring a node up
          •   Load modules
          •   Configure network interfaces
          •   Probe for PCI devices
          •   Copy files and special devices out to node


                                 Bringing Up BProc

     •   Check BProc will be started at boot time
          # chkconfig --list clustermatic
     •   Restart master daemon and boot server
          # service bjs stop
          # service clustermatic restart
          # service bjs start
          •   Load the new configuration
          •   BJS uses BProc, so needs to be stopped first
     •   Check interface has been configured correctly
          # ifconfig eth0
          •   Should have IP address we specified in config file


                       Build a Phase 2 Image
     •   Run the beoboot command on the master
         # beoboot -2 -n --plugin mon
         • -2        this is a phase 2 image
         • -n        image will boot over network
         • --plugin  add plugin to the boot image
     •   The following warning messages can be safely ignored
         •   WARNING: Didn’t find a kernel module called gmac.o
         •   WARNING: Didn’t find a kernel module called bmac.o

     •   Check phase 2 image is available
         # ls -l /var/clustermatic/boot.img


                   Bringing Up The First Node
     •   Ensure both nodes are powered off
     •   Run the nodeadd command on the master
         # /usr/lib/beoboot/bin/nodeadd -a -e -n 0 eth0
         • -a     automatically reload daemon
         • -e     write a node number for every node
         • -n 0   start node numbering at 0
         • eth0   interface to listen on for RARP requests
     •   Power on the first node
     •   Once the node boots, nodeadd will display a message
         New MAC: 00:30:48:23:ac:9c
         Sending SIGHUP to beoserv.


                    Bringing Up The Second Node
     •   Power on the the second node
     •   In a few seconds you should see another message:
          New MAC: 00:30:48:23:ad:e1
          Sending SIGHUP to beoserv.
     •   Exit nodeadd when second node detected (^C)
     •   At this point, cluster is up and fully operational
     •   Check cluster status
          # bpstat -U

     Node(s)          Status             Mode       User                  Group
     0-1              up                 ---x------ root                  root


                                Using the Cluster
     •   bpsh
          •   Migrates a process to one or more nodes
          •   Process is started on front-end, but is immediately migrated onto nodes
          •   Effect similar to rsh command, but no login is performed and no shell is
          •   I/O forwarding can be controlled
          •   Output can be prefixed with node number
          •   Run date command on all nodes which are up
                # bpsh -a -p date
          •   See other arguments that are available
               # bpsh -h


                                 Using the Cluster
     •   bpcp
         •   Copies files to a node
         •   Files can come from master node, or other nodes
         •   Note that a node only has a ram disk by default
         •   Copy /etc/hosts from master to /tmp/hosts on node 0
               # bpcp /etc/hosts 0:/tmp/hosts
               # bpsh 0 cat /tmp/hosts


                             Managing the Cluster
     •   bpstat
         •   Shows status of nodes
              •   up       node is up and available
              •   down     node is down or can’t be contacted by master
              •   boot     node is coming up (running node_up)
              •   error    an error occurred while the node was booting
         •   Shows owner and group of node
              •   Combined with permissions, determines who can start jobs on the node
         •   Shows permissions of the node
              • ---x------ execute permission for node owner

              • ------x--- execute permission for users in node group

              • ---------x execute permission for other users


                             Managing the Cluster
     •   bpctl
         •   Control a nodes status
         •   Reboot node 1 (takes about a minute)
               # bpctl -S 1 -R
         •   Set state of node 0
               # bpctl -S 0 -s groovy
               • Only up, down, boot and error have special meaning, everything else
                 means not down
         •   Set owner of node 0
               # bpctl -S 0 -u nobody
         •   Set permissions of node 0 so anyone can execute a job
               # bpctl -S 0 -m 111


                             Managing the Cluster
     •   bplib
         •   Manage libraries that are loaded on a node
         •   List libraries to be loaded
               # bplib -l
         •   Add a library to the list
               # bplib -a /lib/
         •   Remove a library from the list
               # bplib -d /lib/


                     Troubleshooting Techniques
     •   The tcpdump command can be used to check for node activity
         during and after a node has booted
     •   Connect a cable to serial port on node to check console output for
         errors in boot process
     •   Once node reaches node_up processing, messages will be logged
         in /var/log/clustermatic/node.N (where N is node


                             Module 3: LinuxBIOS
                                  Presenter: Ron Minnich

     •   Objective
          •   To introduce LinuxBIOS
          •   Build and install LinuxBIOS on a cluster node
     •   Contents
          •   Overview
          •   Obtaining LinuxBIOS
          •   Source Tree
          •   Building LinuxBIOS
          •   Installing LinuxBIOS
          •   Booting a Cluster without LinuxBIOS
     •   More Information


                                   LinuxBIOS Overview

     •   Replacement for proprietary BIOS
     •   Based entirely on open source code
     •   Can boot from a variety of devices
     •   Supports a wide range of architectures
          •   Intel P3 & P4
          •   AMD K7 & K8 (Opteron)
          •   PPC
          •   Alpha
     •   Ports available for many systems
          compaq        ibm        lippert      rlx       tyan           advantech    dell         intel
          matsonic      sis        via          asus      digitallogic   irobot       motorola     stpc
          winfast6300   bcm        elitegroup   lanner    nano           supermicro   bitworks     leadtek
          pcchips       supertek   cocom        gigabit   lex            rcn          technoland


                                   Why Use LinuxBIOS?

     •   Proprietary BIOS’s are inherently interactive
          •   Major problem when building clusters with 100’s or 1000’s of nodes
     •   Proprietary BIOS’s misconfigure hardware
          •   Impossible to fix
          •   Examples that really happen
                •   Put in faster memory, but it doesn’t run faster
                •   Can misconfigure PCI address space - huge problem
     •   Proprietary BIOS’s can’t boot over HPC networks
          •   No Myrinet or Quadrics drivers for Phoenix BIOS
     •   LinuxBIOS is FAST
          •   This is the least important thing about LinuxBIOS



     •   Bus
         •   Two or more wires used to connect two or more chips
     •   Bridge
         •   A chip that connects two or more busses of the same or different type
     •   Mainboard
         •   Aka motherboard/platform
         •   Carrier for chips that are interconnected via buses and bridges
     •   Target
         •   A particular instance of a mainboard, chips and LinuxBIOS configuration
     •   Payload
         •   Software loaded by LinuxBIOS from non volatile storage into RAM


                                Typical Mainboard
                   Rev D
                               CPU                                     CPU

                                            Front-side Bus

                                      AGP                       DDR
                           Video                Northbridge               RAM

                                                        I/O Buses (PCI)

                           Keyboard                                    Chip

                                       Floopy                 Legacy


                             What Is LinuxBIOS?
     •     That question has changed over time
     •     In 1999, at the start of the project, LinuxBIOS was literal
           •   Linux is the BIOS
           •   Hence the name
     •     The key questions are:
           •   Can you learn all about the hardware on the system by asking the hardware
               on the system?
           •   Does the OS know how to do that?
     •     The answer, in 1995 or so on PCs, was “NO” in both cases
     •     OS needed the BIOS to do significant work to get the machine
           ready to use


                What Does The BIOS Do Anyway?
     1.    Make the processor(s) sane
     2.    Make the chipsets sane
     3.    Make the memory work (HARD on newer systems)
     4.    Set up devices so you can talk to them
     5.    Set up interrupts so the go to the right place
     6.    Initialize memory even though you don’t want it to
     7.    Totally useless memory test
           •   I’ve never seen a useful BIOS memory test
     8.    Spin up the disks
     9.    Load primary bootstrap from the right place
     10.   Start up the bootstrap


         Is It Possible With Open-Source Software?

     •   1995: very hard - tightly coded assembly that barely fits into 32KB
     •   1999: pretty easy - the Flash is HUGH (256KB at least)
     •   So the key in 1999 was knowing how to do the startup
     •   Lots of secret knowledge which took a while to work out
     •   Vendors continue to make this hard, some help
          •   AMD is good example of a very helpful vendor
     •   LinuxBIOS community wrote the first-ever open-source code that
          •   Start up Intel and AMP SMPs
          •   Enable L2 cache on the PII
          •   Initialize SDRAM and DDRAM


         Only Really Became Possible In 1999

     •   Huge 512K byte Flash parts could hold the huge kernel
          •   Almost 400KB
     •   PCI bus had self-identifying hardware
          •   Old ISA, EISA, etc. were DEAD thank goodness!
     •   SGI Visual Workstation showed you could build x86 systems
         without standard BIOS
     •   Linux learned how to do a lot of configuration, ignoring the BIOS
     •   In summary
          •   The hardware could do it (we thought)
          •   Linux could do it (we thought)


              LinuxBIOS Image In The 512KB Flash
     PC memory
     (not to scale)
     0xffffffff                                    •Top 16 bytes - jump to BIOS
                        Flash memory at top        •Top 64K bytes - startup code

     0xfff80000                                    •Rest of flash - Linux kernel

                        PCI devices in middle

     0x40000000         Top of memory
                        Main memory at
     0x00000000         bottom

               The Basic Load Sequence ca. 1999

     •   Top 16 bytes: jump to top 64K
     •   Top 64K:
          •   Set up hardware for Linux
          •   Copy Linux from FLASH to bottom of memory
          •   Jump to 0x100020 (start of Linux)
     •   Linux: do all the stuff you normally do
          •   2.2: not much, was a problem
          •   2.4: did almost everything
     •   In 1999, Linux did not do all we needed (2.2)
     •   In 2000, 2.4 could do almost as much as we want
     •   The 64K bootstrap ended up doing more than we planned


               What We Thought Linux Would Do

     •   Do ALL the PCI setup
     •   Do ALL the odd processor setup
     •   In fact, do everything: all the “64K” code had to do was copy Linux
         to RAM


              What We Changed (Due To Hardware)
     •    DRAM does not start life operational, like the old days
     •    Turn-on for DRAM is very complex
          •   The single hardest part of LinuxBIOS is DRAM support
     •    To turn on DRAM, you need to turn on chipsets
     •    To turn on chipsets, you need to set up PCI
     •    And, on AMD Athlon SMPs, we need to grab hold of all the CPUs
          (save one) and idle them
     •    So the “64K” chunk ended up doing more


                                 Getting To DRAM
                 Rev D
                             CPU                                     CPU

                                          Front-side Bus

                                    AGP                       DDR
                         Video                Northbridge               RAM

                                                      I/O Buses (PCI)

                         Keyboard                                    Chip

                                     Floopy                 Legacy


                                 Another Problem

     •   IRQ wiring can not be determined from hardware!
     •   Botch in PCI results in having to put tables in the BIOS
     •   This is true for all motherboards
     •   So, although PCI hardware is self-identifying, hardware interrupts
         are not
     •   So Linux can’t figure out what interrupt is for what card
     •   LinuxBIOS has to pick up this additional function


                           The PCI Interrupt Botch

                 1                                                A
                 2                                                B
                 3                                                C
                 4                                                D

                 1                                                A
                 2                                                B
                 3                                                C
                 4                                                D


                 What We Changed (Due To Linux)

     •   Linux could not set up a totally empty PCI bus
          •   Needed some minimal configuration
     •   Linux couldn’t find the IRQs
          •   Not really its fault, but…
     •   Linux needed SMP hardware set up “as per BIOS”
     •   Linux needed per-CPU hardware set up “as per BIOS”
     •   Linux needed tables (IRQ, ACPI, etc.) set up “as per BIOS”
     •   Over time, this is changing
          •   Someone has a patent on the “SRAT” ACPI table
          •   SRAT describes hardware
          •   So Linux ignores SRAT, talks to hardware directly


                               As Of 2000/2001

     •   We could boot Linux from flash (quickly)
     •   Linux would find the hardware and the tables ready for it
     •   Linux would be up and running in 3-12 seconds
     •   Problem solved?



     •   Looking at trends, in 1999 we counted on motherboard flash sizes
         doubling every 2 years or so
     •   From 1999 to 2000 the average flash size either shrank or stayed
         the same
     •   Linux continued to grow in size though…
     •   Linux outgrew the existing flash parts, even as they were getting
     •   Venders went to a socket that couldn’t hold a larger replacement
     •   Why did vendors do this?
          •   Everyone wants cheap mainboards!


                           LinuxBIOS Was Too Big

     •   Enter the alternate bootstraps
          •    Etherboot
          •    FILO
          •    Built-in etherboot
          •    Built-in USB loader


                                     The New Picture
              Flash (256KB)            Compact Flash (32MB)

             Top 16 bytes
         Top 64K (LinuxBIOS)               Linux Kernel
         Next 64K (Etherboot)
                                                              Loaded over
                                                              IDE channel by
                   Empty                                      bootloader



                                 LinuxBIOS Now

     •   The aggregate of the “64K loader”, Etherboot (or FILO), and Linux
         from Compact Flash?
          •   Too confusing
     •   LinuxBIOS now means only the 64K piece, even though it’s not
         Linux any more
     •   On older systems, LinuxBIOS loads Etherboot which loads Linux
         from Compact Flash
          •   Compact Flash read as raw set of blocks
     •   On newer systems, LinuxBIOS loads FILO which loads Linux from
         Compact Flash
          •   Compact Flash treated as ext2 filesystem


                                  Final Question

     •   You’re reflashing 1024 nodes on a cluster and the power fails
     •   You’re now the proud owner of 1024 bricks, right?
     •   Wrong…
          •   Linux NetworX developed fallback BIOS technology


                                 Fallback BIOS
         Flash (256KB)
         Jump to BIOS        •   “Jump to BIOS” jumps to fallback BIOS
         Fallback BIOS       •   Fallback BIOS checks conditions
         Normal BIOS              •   Was the last boot successful?
         Fallback FILO            •   Do we want to just use fallback anyway?
         Normal FILO              •   Does “normal” BIOS look ok?
                             •   If things are good, use normal
                             •   If things are bad, use fallback
                             •   Note there is also a fallback and normal FILO
                                  •   These load different files from CF
                             •   So normal kernel, FILO, and BIOS can be
                                 hosed and you’re ok

                     Rules For Upgrading Flash

     •    NEVER replace the fallback BIOS
     •    NEVER replace the fallback FILO
     •    NEVER replace the fallback kernel
     •    Mess up other images at will, because you can always fall back


                           A Last Word On Flash Size

     •   Flash size decreased to 256KB from 1999-2003
          •   Driven by packaging constraints
     •   Newer technology uses address-address multiplexing to pack lots
         of address bits onto 3 address lines - up to 128 MB!
          •   Driven by cell phone and MP3 player demand
     •   So the same small package can support 1,2,4,8… MB
          •   Will need them: kernel + initrd can be 4MB!
     •   This will allow us to realize our original vision
          •   Linux in flash
     •   Etherboot, FILO, etc., are really a hiccup


                                              Source Tree
     •   COPYING                                            •   /console
                                                                 •   Device independent console
     •   NEWS                                                        support
     •   ChangeLog                                          •   /cpu
     •   documentation                                           •   Implementation specific files
          •   Not enough!                                   •   /devices
     •   src                                                     •   Dynamic device allocation routines
          •   /arch                                         •   /include
                                                                 •   Header files
               •   Architecture specific files, including
                   initial startup code                     •   /lib
          •   /boot                                              •   Generic library functions (atoi)
               •   Main LinuxBIOS entry code:
          •   /config
               •   Configuration for a given platform


                                                  Source Tree
         •   /mainboard                                          •   /stream
                  •    Mainboard specific code                        •   Source of payload data
         •   /northbridge                                        •   /superio
                  •    Memory and bus interface routines              •   Chip to emulate legacy crap
         •   /pc80                                          •   targets
                  •    Legacy crap
                                                                 • Instances of specific platforms
         •   /pmc
                  •    Processor mezzanine cards            •   utils
         •   /ram                                                •   Utility programs
                  •    Generic RAM support
         •   /sdram
                  •    Synchronous RAM support
         •   /southbridge
                  •    Bridge to interface to legacy crap


                                        Building LinuxBIOS

     •       For this demonstration, untar source from CDROM
              #       mount /mnt/cdrom
              #       cd /tmp
              #       tar zxvf /mnt/cdrom/LCI/linuxbios/freebios2.tgz
              #       cd freebios2
     •       Find target that matches your hardware
              # cd targets/via/epia
     •       Edit configuration file and change any settings
             specific to your board
              •       Should not need to make any changes in this case


                            Building LinuxBIOS

     •   Build the target configuration files
          # cd ../..
          # ./buildtarget via/epia
     •   Now build the ROM image
          # cd via/epia/epia
          # make
     •   Should result in a single file
          •   linuxbios.rom
     •   Copy ROM image onto a node
          # bpcp linuxbios.rom 0:/tmp


                           Installing LinuxBIOS

     •   This will overwrite old BIOS with LinuxBIOS
          •   Prudent to keep a copy of the old BIOS chip
          •   Bad BIOS = useless junk
     •   Build flash utility
          # cd /tmp/freebios2/util/flash_and_burn
          # make
     •   Now flash the ROM image - please do not do this step
          # bpsh 0 ./flash_rom /tmp/linuxbios.rom
     •   Reboot node and make sure it comes up
          # bpctl -S 0 -R
          •   Use BProc troubleshooting techniques if not!


              Booting a Cluster Without LinuxBIOS
     •   Although an important part of Clustermatic, it’s not always possible
         to deploy LinuxBIOS
          •   Requires detailed understanding of technical details
          •   May not be available for a particular mainboard
     •   In this situation it is still possible to set up and boot a cluster using
         a combination of DHCP, TFTP and PXE
     •   Dynamic Host Configuration Protocol (DHCP)
          •   Used by node to obtain IP address and bootloader image name
     •   Trivial File Transfer Program (TFTP)
          •   Simple protocol to ransfer files across an IP network
     •   Pre-Execution Environment (PXE)
          •   BIOS support for network booting


                               Configuring DHCP
     •   Copy configuration file
          •   cp /mnt/cdrom/LCI/pxi/dhcpd.conf /etc
     •   Contains the following entry (one “host” entry for each node)
          ddns-update-style ad-hoc;
          subnet netmask {
            host node1 {
              hardware ethernet xx:xx:xx:xx:xx:xx;
              filename “pxelinux.0”;
     •   Replace xx:xx:xx:xx:xx:xx with MAC address of node
     •   Restart server to load new configuration
          # service dhcpd restart

                            Configuring TFTP
     •   Create directory to hold bootloader
          # mkdir -p /tftpboot
     •   Edit TFTP config file
          •   /etc/xinetd.d/tftp
     •   Enable TFTP
          •   Change
               • disable =         yes
          •   To
               • disable =         no
     •   Restart server
          # service xinetd restart


                             Configuring PXE
     •   Depends on BIOS, enabled through menu
     •   Create correct directories
          # mkdir -p /tftpboot/pxelinux.cfg
     •   Copy bootloader and config file
          # cd /mnt/cdrom/LCI/pxe
          # cp pxelinux.0 /tftpboot/
          # cp default /tftpboot/pxelinux.cfg/
     •   Generate a bootable phase 2 image
          # beoboot -2 -i -o /tftpboot/node --plugin mon
     •   Creates a kernel and initrd image
          •   /tftpboot/node
          •   /tftpboot/node.initrd


                               Booting The Cluster
     •   Run nodeadd to add node to config file
          # /usr/lib/beoboot/bin/nodeadd -a -e eth0
     •   Node can now be powered on
     •   BIOS uses DHCP to obtain IP address and filename
     •   pxelinux.0 will be loaded
     •   pxelinux.0 will in turn load phase 2 image and initrd
     •   Node should boot
          •   Check status using bpstat command
     •   Requires monitor to observe behavior of node


                             Module 4: Filesystems
                                    Presenter: Ron Minnich

     •   Objective
          •   To show the different kinds of filesystems that can be used with a BProc
              cluster and demonstrate the advantages and disadvantages of each
     •   Contents
          •   Overview
          •   No Local Disk, No Network Filesystem
          •   Local Disks
          •   Global Network Filesystems
               •   NFS
               •   Third Party Filesystems
          •   Private Network Filesystems
               •   V9FS


                           Filesystems Overview

     •   Nodes in a Clustermatic cluster do not require any type of local or
         network filesystem to operate
     •   Jobs that operate with only local data need no other filesystems
     •   Clustermatic can provide a range of different filesystem options


              No Local Disk, No Network Filesystem

     •   Root filesystem is a tmpfs located in system RAM, so size is limited
         to RAM size of nodes
     •   Applications that need an input deck must copy necessary files to
         nodes prior to execution and from nodes after execution
          •   30K input deck can be copied to 1023 nodes in under 2.5 seconds
     •   This can be a very fast option for suitable applications
     •   Removes node dependency on potentially unreliable fileserver


                                      Local Disks

     •   Nodes can be provided with one or more local disks
     •   Disks are automatically mounted by creating entry in
     •   Solves local space problem, but filesystems are still not shared
     •   Also reduces reliability of nodes since they are now dependent on
         spinning hardware



     •   Simplest solution to providing a shared filesystem on nodes
     •   Will work in most environments
     •   Nodes are now dependent on availability of NFS server
     •   Master can act as NFS server
          •   Adds extra load
          •   Master may already be loaded if there are a large number of nodes
     •   Better option is to provide a dedicated server
          •   Configuration can be more complex if server is on a different network
          •   May require mutliple network adapters in master
     •   Performance is never going to be high


                Configuring Master as NFS Server

     •   Standard Linux NFS configuration on server
     •   Check NFS is enabled at boot time
          # chkconfig --list nfs
          # chkconfig nfs on
     •   Start NFS daemons
          # service nfs start
     •   Add exported filesystem to /etc/exports
          •   /home,sync,no_root_squash)
     •   Export filesystem
          # exportfs -a


                  Configuring Nodes To Use NFS
     •   Edit /etc/clustermatic/fstab to mount filesystem
         when node boots
          •   MASTER:/home /home nfs nolock 0 0
          •   MASTER will be replaced with IP address of front end
          •   nolock must be used unless portmap is run on each node
          •   /home will be automatically created on node at boot time
     •   Reboot nodes
          # bpctl -S allup -R
     •   When nodes have rebooted, check NFS mount is available
          # bpsh 0-1 df


                           Third Party Filesystems

     •   GPFS (
     •   Panasas (
     •   Lustre (


     •   Supports up to 2.4.21 kernel (latest is 2.4.26 or 2.6.5)
     •   Data striping across multiple disks and multiple nodes
     •   Client-side data caching
     •   Large blocksize option for higher efficiencies
     •   Read-ahead and write-behind support
     •   Block level locking supports concurrent access to files
     •   Network Shared Disk Model
          •   Subset of nodes are allocated as storage nodes
          •   Software layer ships I/O requests from application node to storage nodes across
              cluster interconnect
     •   Direct Attached Model
          •   Each node must have direct connection to all disks
          •   Requires Fibre Channel Switch and Storage Area Network disk configuration


     •   Latest version supports 2.4.26 kernel
     •   Object Storage Device (OSD)
          •   Intelligent disk drive
          •   Can be directly accessed in parallel
     •   PanFS Client
          •   Object-based installable filesystem
          •   Handles all mounting, namespace operations, file I/O operations
          •   Parallel access to multiple object storage devices
     •   Metadata Director
          •   Separate control path for managing OSDs
          •   “mapping” of directories and files to data objects
          •   Authentication and secure access
     •   Metadata Director and OSD require dedicated proprietary hardware
     •   PanFS Client is open source


     •   Lustre Lite supports 2.4.24 kernel
     •   Full Lustre will support 2.6 kernel
     •   Luster Lite = Lustre - clustered metadata scalability
     •   All open source
     •   Meta Data Servers (MDSs)
          •   Supports all filesystem namespace operations
          •   Lock manager and concurrency support
          •   Transaction log of metadata operations
          •   Handles failover of metadata servers
     •   Object Storage Targets (OSTs)
          •   Handles actual file I/O operations
          •   Manages storage on Object-Based Disks (OBDs)
          •   Object-Based Disk drivers support normal Linux filesystems
     •   Arbitrary network support through Network Abstraction Layer
     •   MDSs and OSTs can be standard Linux hosts


     •   Provides a shared private network filesystem
     •   Shared
          •   All nodes running a parallel process can access the filesystem
     •   Private
          •   Only processes in a single process group can see or access files in the
     •   Mounts exist only for duration of process
          •   Node cleanup is automatic
          •   No “hanging mount” problems
     •   Protocol is lightweight
     •   Pluggable authentication services



     •   Experimental
     •   Can be mounted across a secure channel (e.g. ssh) for additional
     •   1000+ concurrent mounts in 20 seconds
          •   Multiple servers will improve this
     •   Servers can run on cluster nodes or dedicated systems
     •   Filesystem can use cluster interconnect or dedicated network
     •   More information


                Configuring Master as V9FS Server

      •   Start server
           # v9fs_server
           •   Can be started at boot if desired
      •   Create mount point on nodes
           # bpsh 0-1 mkdir /private
           • Can add mkdir command to end of node_up script if desired


                           V9FS Server Commands

      •   Define filesystems to be mounted on the nodes
           # v9fs_addmount /private
      •   List filesystems to be mounted
           # v9fs_lsmount


                               V9FS On The Cluster

      •   Once filesystem mounts have been defined on the server,
          filesystems will be automatically mounted when a process is
          migrated to the node
           # cp /etc/hosts /home
           # bpsh 0-1 ls -l /private
           # bpsh 0 cat /private/hosts
      •   Remove filesystems to be mounted
           # v9fs_rmmount /private
           # bpsh 0-1 ls -l /private


                                           One Note

      •   Note that we ran the file server as root
      •   You can actually run the file server as you
      •   If run as you, there is added security
           •   The server can’t run amok
      •   And subtracted security
           •   We need a better authentication system
           •   Can use ssh, but something tailored to the cluster would be better
      •   Note that the server can chroot for even more safety
           •   Or be told to serve from a file, not a file system
      •   There is tremendous flexibility and capability in this approach



      •   Recall that on 2.4.19 and later there is a /proc entry for each
           •   /proc/mounts
      •   It really is quite private
           •   There is a lot of potential capability here we have not started to use
      •   Still trying to determine need/demand


                                   Why Use V9FS?

      •   You’ve got some wacko library you need to use for one application
      •   You’ve got a giant file which you want to serve as a file system
      •   You’ve got data that you want visible to you only
           •   Original motivation: compartmentation in grids (1996)
      •   You want a mount point but it’s not possible for some reason
      •   You want an encrypted data file system


                                      Wacko Library

      •   Clustermatic systems (intentionally) limit the number of libraries on
           •   Current systems have about 2GB worth of libraries
           •   Putting all these on nodes would take 2GB of memory!
      •   Keeping node configuration consistent is a big task on 1000+ nodes
           •   Need to do rsync, or whatever
           •   Lots of work, lots of time for libraries you don’t need
      •   What if you want some special library available all the time
           •   Painful to ship it out, set up paths, etc., every time
      •   V9FS allows custom mounts to be served from your home directory


                           Giant File As File System

      •   V9FS is a user-level server
           •   i.e. an ordinary program
      •   On Plan 9, there are all sorts of nifty uses of this
           •   Servers for making a tar file look like a read-only file system
           •   Or cpio archive, or whatever…
      •   So, instead of trying to locate something in the middle of a huge tar
           •   Run the server to serve the tar file
           •   Save disk blocks and time


                           Data Visible To You Only

      •   This usage is still very important
      •   Run your own personal server (assuming authentication is fixed) or
          use the global server
      •   Files that you see are not visible to anyone else at all
           •   Even root
      •   On Unix, if you can’t get to the mount point, you can’t see the files
      •   On Linux with private mounts, other people don’t even know the
          mount point exists


          You Want A Mount Point But Can’t Get One

      •   “Please Mr. Sysadmin, sir, can I have another mount point?”
      •   “NO!”
      •   System administrators have enough to do, than to
           •   Modify fstab on all nodes
           •   Modify permissions on a server
           •   And so on…
      •   Just to make your library available on the nodes?
           •   Doubtful
      •   V9FS gives a level of flexibility that you can’t get otherwise


                  Want Encrypted Data File System

      •   This one is really interesting
      •   Crypto file systems are out there in abundance
           •   But they always require lots of “root” involvement to set up
      •   Since V9FS is user-level, you can run one yourself
      •   Set up your own keys, crypto, all your own stuff
      •   Serve a file system out of one big encrypted file
      •   Copy the file elsewhere, leaving it encrypted
           •   Not easily done with existing file systems
      •   So you have a personal, portable, encrypted file system


                                So Why Use V9FS?

      •   Opens up a wealth of new ways to store, access and protect your
      •   Don’t have to bother System Administrators all the time
      •   Can extend the file system name space of a node to your
      •   Can create a whole file system in one file, and easily move that file
          system around (cp, scp, etc.)
      •   Can do special per-user policy on the file system
           •   Tar or compressed file format
           •   Per-user crypto file system
      •   Provides capabilities you can’t get any other way


                                 Module 5: Supermon
                                    Presenter: Matt Sottile

      •   Objectives
           •   Present an overview of supermon
           •   Demonstrate how to install and use supermon to monitor a cluster
      •   Contents
           •   Overview of Supermon
           •   Starting Supermon
           •   Monitoring the Cluster
      •   More Information


                            Overview of Supermon

      •   Provides monitoring solution for clusters
      •   Capable of high sampling rates (Hz)
                         Nodes            All data       One variable
                            1              5200             9400
                           16               120              200
                           128              13               25
                          1024               1                2
      •   Very small memory and computational footprint
           •   Sampling rates are controlled by clients at run-time
      •   Completely extensible without modification
           •   User applications
           •   Kernel modules

                                        Node View

      •   Data sources
           •   Kernel module(s)
           •   User application
      •   Mon daemon
      •   IANA-registered port number
           •   2709


                                    Cluster View

      •   Data sources
           •   Node mon daemons
           •   Other supermons
      •   Supermon daemon
      •   Same port number
           •   2709
      •   Same protocol at every level
           •   Composable, extensible


                                         Data Format

      •   S-expressions
          •   Used in LISP, Scheme, etc.
          •   Very mature
      •   Extensible, composable, ASCII
          •   Very portable
          •   Easily changed to support richer data and structures
          •   Composable
               •   (expr 1) o (expr 2) = ((expr 1) (expr 2))
      •   Fast to parse, low memory and time overhead


                                        Data Protocol
      •   # command
          •   Provides description of what data is provided and how it is structured
          •   Shows how the data is organized in terms of rough categories containing
              specific data variables (e.g. cpuinfo category, usertime variable)
      •   S command
          •   Request actual data
          •   Structure matches that described in # command
      •   R command
          •   Revive clients that disappeared and were restarted
      •   N command
          •   Add new clients


                               User Defined Data

      •   Each node allows user-space programs to push data into mon to
          be sent out on the next sample
      •   Only requirement
          •   Data is arbitrary text
          •   Recommended to be an s-expression
      •   Very simple interface
          •   Uses UNIX domain socket for security


                              Starting Supermon

      •   Start supermon daemon
          # supermon n0 n1 2> /dev/null &
      •   Check output from kernel
          # bpsh 1 cat /proc/sys/supermon/#
          # bpsh 1 cat /proc/sys/supermon/S
      •   Check sensor output from kernel
          # bpsh 1 cat /proc/sys/supermon_sensors_t/#
          # bpsh 1 cat /proc/sys/supermon_sensors_t/S


                            Supermon In Action

      •   Check mon output from a node
           # telnet n1 2709
      •   Check output from supermon daemon
           # telnet localhost 2709


                            Supermon In Action

      •   Read supermon data and display to console
           # supermon_stats [options…]
      •   Create trace file for off-line analysis
           # supermon_tracer [options…]
           • supermon_stats can be used to process trace data off-line


                                   Module 6: BJS
                                   Presenter: Matt Sottile

      •   Objectives
           •   Introduce the BJS scheduler
           •   Configure and submit jobs using BJS
      •   Contents
           •   Overview of BJS
           •   BJS Configuration
           •   Using BJS


                                   Overview of BJS

      •   Designed to cover the needs of most users
      •   Simple, easy to use
      •   Extensible interface for adding policies
      •   Used in production environments
      •   Optimized for use with BProc
      •   Traditional schedulers require O(N) processes, BJS requires O(1)
      •   Schedules and unschedules 1000 processes in 0.1 seconds


                                    BJS Configuration

      •   Nodes are divided into pools, each with a policy
      •   Standard policies
           •   Filler
                 •   Attempts to backfill unused nodes
           •   Shared
                 •   Allows multiple jobs to run on a single node
           •   Simple
                 •   Very simple FIFO scheduling algorithm


                                        Extending BJS

      •   BJS was designed to be extensible
      •   Policies are “plug-ins”
           •   They require coding to the BJS C API
           •   Not hard, but nontrivial
           •   Particularly useful for installation-specific policies
           •   Based on shared-object libraries
      •   A “fair-share” policy is currently in testing at LANL for BJS
           •   Enforce fairness between groups
           •   Enforce fairness between users within a group
           •   Optimal scheduling between user’s own jobs


                                 BJS Configuration
      •   BJS configuration file
           •   /etc/clustermatic/bjs.config
      •   Global configuration options (usually don’t need to be changed)
           •   Location of spool files
                •   spooldir
           •   Location of dynamically loaded policy modules
                •   policypath
           •   Location of UNIX domain socket
                •   socketpath
           •   Location of user accouting log file
                •   acctlog


                                 BJS Configuration

      •   Per-pool configuration options
           •   Defines the default pool
                •   pool default
           •   Name of policy module for this pool (must exist in policydir)
                •   policy filler
           •   Nodes that are in this pool
                •   nodes 0-10000
           •   Maximum duration of a job (wall clock time)
                •   maxsecs 86400
           •   Optional: Users permitted to submit to this pool
                •   users
           •   Optional: Groups permitted to submit to this pool
                •   groups


                             BJS Configuration

      •   Restart BJS daemon to accept changes
          # service bjs restart
      •   Check nodes are available
          # bjsstat

      Pool: default          Nodes (total/up/free): 5/2/2
      ID      User            Command               Requirements


                                    Using BJS
      •   bjssub
          • Submit a request to allocate nodes
          • ONLY runs the command on the front end

          • The command is responsible for executing on nodes
          -p specify node pool
          -n number of nodes to allocate
          -s run time of job (in seconds)
          -i run in interactive mode
          -b run in batch mode (default)
          -D set working directory
          -O redirect command output to file


                                         Using BJS
      •   bjsstat
          •   Show status of node pools
               •   Name of pool
               •   Total number of nodes in pool
               •   Number of operational nodes in pool
               •   Number of free nodes in pool
          •   Lists status of jobs in each pool


                                         Using BJS
      •   bjsctl
          • Terminate a running job
          -r specify ID number of job to terminate


                               Interactive vs Batch
      •   Interactive jobs
           •   Schedule a node or set of nodes for use interactively
           •   bjssub will wait until nodes are available, then run the command
           •   Good during development
           •   Good for single run, short runtime jobs
           •   “Hands-on” interaction with nodes

           # bjssub -p default -n 2 -s 1000 -i bash
           Waiting for interactive job nodes.
           (nodes 0 1)
           Starting interactive job.
           > bpsh $NODES date
           > exit

                               Interactive vs Batch
      •   Batch jobs
           •   Schedule a job to run as soon as requested nodes are available
           •   bjssub will queue the command until nodes are available
           •   Good for long running jobs that require little or no interaction

           # bjssub -p default -n 2 -s 1000 -O output -b date
           # cat date
           Wed Sep 15 15:48:51 MDT 2004


                                How It Works
      •   BJS allocates a set of nodes and changes their permissions
      •   Sets the environment variable NODES to contain the list of
          allocated nodes
      •   This is used by the command to access allocated nodes
      •   For example, say I allocate 2 nodes on the cluster:

           # bjssub -n 2 -s 50 ‘echo $NODES > stuff’
           # cat stuff
           # bjssub -n 2 -s 50 ‘bpsh $NODES date’


                              BJS and Scripts
      •   Given that BJS provides NODES, scripts are easy to schedule!
      •   Good for scheduling tasks that are composed of subtasks, such as
          copying input data to nodes, running a job, and retrieving outputs

           # cat >
           bpsh $NODES -O /tmp/stuff cat < stuff
           bpsh $NODES -O /tmp/count wc /tmp/stuff
           bpsh -p $NODES cat /tmp/count > the_output
           bpsh $NODES rm /tmp/stuff /tmp/count
           # bjssub -n 2 -s 10
           # cat the_output


                                   Module 7: MPI
                                  Presenter: Matt Sottile

      •   Objective
      •   Contents
           •   Overview of MPI
           •   How MPI Works with Clustermatic
           •   Setting Up
           •   Compiling an MPI Program
           •   Running an MPI Program
           •   Debugging an MPI Program
           •   Using MPI with BJS
           •   Staging Data


                                 Overview of MPI

      •   MPI (Message Passing Interface) is a standard API for parallel
      •   Provides portable approach for parallel programming on shared-
          memory systems and clusters
      •   Current widely available open-source implementations
           •   MPICH - Argonne
           •   LAM-MPI - Indiana University
           •   OpenMPI - Collaboration between Los Alamos, LAM, and others
      •   All three are available for Clustermatic
           •   Demonstrating MPICH today


                   How MPI Works with Clustermatic

      •   Uses standard interface to start MPI program
            •   mpirun [options…] mpi-program
      •   Mpirun loads mpi-prog on master
      •   Mpi-prog process is migrated to each node
      •   Each mpi-prog exchanges connection information via mpirun on
          master and starts executing
                                                                mpi-prog       Node 0

                                                                mpi-prog       Node 1


                                       MPI and BJS
      •   As we showed in the BJS section, bjssub allocates nodes and
          sets the NODES environment variable
      •   mpirun looks at this to determine where to run the job
      •   The --np argument is ignored if NODES is set
            •   It uses the scheduled nodes if it detects it has been run by the scheduler
      •   Transparent to the user


                                   Setting Up

      •   Set path to include MPICH
           # export PATH=$PATH:/usr/mpich-p4/bin
      •   Copy sample programs
           # mount /mnt/cdrom
           • If not already mounted
           # cd
           # cp /mnt/cdrom/LCI/mpi/* .


                     Compiling an MPI Program
      •   Use mpicc to supply correct include files and libraries
           # mpicc -g -o hello hello.c
           • Normal gcc options

      •   Compile the other sample programs
           # mpicc -g -o mpi-ping mpi-ping.c
      •   Jacobi requires the math library to compile
           # mpicc -g -o jacobi jacobi.c -lm


                          Running an MPI Program

      •   Run program on two nodes using P4 device
           # mpirun --np 2 --p4 ./hello
           • Must use ./hello since dot is not in path

      •   See what happens on one node
           # mpirun --np 1 --p4 ./hello


                           Seeing What’s Running

      •   Start the mpi-ping program running on both nodes
           # mpirun --np 2 --p4 ./mpi-ping 0 100000 10000
      •   Find process ID’s in another window
           # ps -mfx
           3273    ?            S         0:00 \_kdeinit: konsole
           3274    pts/1        S         0:00    \_ /bin/bash
           3339    pts/1        S         0:00        \_ mpirun --np 2 …
           3341    ?            RW        0:43            \_ [mpi-ping]
           3343    ?            SW        0:00            |   \_ [mpi-ping]
           3342    ?            SW        0:38            \_ [mpi-ping]
           3344    ?            SW        0:00                \_ [mpi-ping]
      •   MPICH starts 2 processes on each node
           •   Parent is actual running process

                     Debugging an MPI Program

      •   Start gdb and attach to running process
           # gdb mpi-ping
           (gdb) attach 3341
      •   You are now attached to a remote process from the master!
      •   Print call stack
           (gdb) bt
      •   Select frame (choose frame in main)
           (gdb) frame 14
      •   Print contents of variable
           (gdb) print status


                 Using MPI with BJS (interactive)

      •   Use BJS to allocate 2 nodes for 10 minutes
           # bjssub -s 600 -n 2 -i bash
      •   BJS will start a new (bash) shell
      •   Check nodes have been allocated
           # bjsstat
           Pool: default         Nodes(total/up/free): 5/2/0
           ID     User            Command           Requirements
              0 R root            (interactive)     nodes=2,secs=600
      •   Now run the job on the allocated nodes
           # mpirun --np 2 --p4 ./mpi-ping 0 500 100
      •   When job is complete, exit shell
           # exit

                        Using MPI with BJS (batch)

      •    Usually, long production runs should use batch mode
      •    Use BJS to allocate 2 nodes and invoke mpirun
          # bjssub -s 600 -n 2 mpirun --np 2 --p4 ./mpi-ping 0 1000 5
      •    Check nodes have been allocated
            # bjsstat
            Pool: default           Nodes(total/up/free): 5/2/0
            ID     User              Command           Requirements
               1 R root              mpirun --np 2     nodes=2,secs=600
      •    Remove job from queue
            # bjsctl -r 1


                     Programs That Work With Data

      •    Programs running on nodes can read and write data via NFS
            •   This can be very slow if large amounts of data are involved
            •   Can also be a problem if many processes are accessing NFS simultaneously
      •    Copy Jacobi input data to /home
            # cp jacobi.input /home
      •    Run the Jacobi program using data obtained via NFS
            # bjssub -n 2 -s 10 mpirun --np 2 --p4 ./jacobi /home
      •    Once run is complete, output files should be in /home
            # ls -l /home/output*


                         Staging Data To Nodes

      •   To overcome problems using NFS, data can be copied out to the
           # bpcp jacobi.input 0:/tmp/jacobi.input
           # bpcp jacobi.input 1:/tmp/jacobi.input
      •   Now run the Jacobi program
           # bjssub -n 2 -s 10 mpirun --np 2 --p4 ./jacobi /tmp
      •   Copy staged output back to master
           # bpcp 0:/tmp/output_0.dat .
           # bpcp 1:/tmp/output_1.dat .


                  Staging Data To Nodes (cont…)

      •   In reality, this would be scripted as follows

           bpsh $NODES -O /tmp/jacobi.input cat < /tmp/jacobi.input
           mpirun --np 2 --p4 ./jacobi /tmp
           for i in `echo $NODES | tr ‘,’ ‘ ‘`
             bpsh $i cat /tmp/output.dat > output_$i.dat


      •   Please complete a feedback form now and collect a free
          Clustermatic CD
      •   Either leave forms on table, or return by post (addess on form)
      •   Your feedback allows us to improve this tutorial

                  Thanks for attending the Clustermatic Tutorial!
                          We hope you found it valuable



To top