CLUSTER MA TIC An Innovative Approach to Cluster Computing by sid76703

VIEWS: 10 PAGES: 75

									                                CLUSTERMATIC
                            An Innovative Approach
                                      to
                              Cluster Computing
                                     2004 LACSI Symposium
                            The Los Alamos Computer Science Institute




                                                                                          LA-UR-03-8015




                             Tutorial Outline (Morning)
    Time            Module                  Outcomes                                 Presenter
     9:00 - 9:05    Tutorial Introduction   Introduction to tutorial process         Greg Watson
     9:05 - 9:30    Clustermatic            An understanding of the overall          Greg Watson
                                            Clustermatic architecture
     9:30 - 10:30   BProc & Beoboot         An understanding of BProc & Beoboot      Erik Hendriks
                                            software; how to configure and install
                                            on a cluster
    10:30 - 11:00   Break
    11:00 - 11:30   BProc & Beoboot         Continued…                               Erik Hendriks
    11:30 - 12:30   LinuxBIOS               Ability to build and install on a node   Ron Minnich
    12:30 - 2:00    Lunch Break




2




                                                                                                          1
                           Tutorial Outline (Afternoon)
    Time           Module                Outcomes                                 Presenter
    2:00 - 3:00    Filesystems           Familiar with network filesystem         Ron Minnich
                                         options and how to configure for a
                                         cluster
    3:00- 3:30     Supermon              An introduction to monitoring and how    Matt Sottile
                                         to use Supermon to effectively
                                         monitor a cluster
     3:30 - 4:00   Break
    4:00 - 4:30    BJS                   An understanding of the purpose of       Matt Sottile
                                         BJS and how to configure for a cluster
    4:30 - 5:25    MPI                   How to compile, run and debug an         Matt Sottile
                                         MPI program
    5:25 - 5:30    Feedback              Obtain feedback from participants



3




                                 Tutorial Introduction

       •   Tutorial is divided into modules
       •   Each module has clear objectives
       •   Modules comprise short theory component, followed by hands on
       •        indicates theory            indicates hands-on




                              Please ask questions at any time!




4




                                                                                                 2
              Module 1: Overview of Clustermatic
                                   Presenter: Greg Watson

    •   Objective
         •   To provide a brief overview of the Clustermatic architecture
    •   Contents
         •   What is Clustermatic?
         •   Why Use Clustermatic?
         •   Clustermatic Components
         •   Installing Clustermatic
    •   More Information
         •   http://www.clustermatic.org




5




                            What is Clustermatic?

    •   Clustermatic is a suite of software that completely controls a cluster
        from the BIOS to the high level programming environment
    •   Clustermatic is modular
                                                                       Users
         •   Each component is responsible for a
             specific set of activities in the cluster    Compilers
                                                          Debuggers
         •   Each component can be used                                             BJS
             independently of other components
                                                                       Supermon




                                                               MPI                          BProc


                                                                                  Beoboot


                                                                                              v 9fs
                                                                         Linux




                                                                      LinuxBIOS




6




                                                                                                      3
                          Why Use Clustermatic?
    •   Clustermatic clusters are easy to build, manage and program
         •   A cluster can be installed and operational in a few minutes
    •   The architecture is designed for simplicity, performance and
        reliability
         •   Utilization is maximized by ensuring machine is always available
    •   Supports machines from 2 to 1024 nodes (and counting)
    •   System administration is no more onerous than for a single
        machine, regardless of the size of the cluster
         •   Upgrade O/S on entire machine with a single command
         •   No need to synchronize node software versions
    •   The entire software suite is GPL open-source


7




                       Clustermatic Components

    •   LinuxBIOS
         •   Replaces normal BIOS
         •   Improves boot performance and node startup times
         •   Elimates reliance on proprietary BIOS
         •   No interaction required, important for 100’s of nodes


                                                    Users

                                     Compilers
                                     Debuggers
                                                               BJS
                                                   Supermon




                                          MPI                           BProc
                                                              Beoboot

                                                                          v9fs
                                                       Linux



                                      LinuxBIOS  LinuxBIOS




8




                                                                                 4
                       Clustermatic Components
     •   Linux
         •   Mature O/S
         •   Demonstrated performance in HPC applications
         •   No proprietary O/S issues
         •   Extensive hardware and network device support


                                                 Users
                                    Compilers
                                    Debuggers
                                                            BJS

                                         MPI    Supermon             BProc
                                                           Beoboot

                                                                       v9fs

                                           Linux    Linux


                                                LinuxBIOS




9




                       Clustermatic Components
     •   V9FS
         •   Avoids problems associated with global mounts
         •   Processes are provided with a private shared filesystem
         •   Namespace exists only for duration of process
         •   Nodes are returned to “pristine” state once process is complete


                                                 Users
                                    Compilers
                                    Debuggers
                                                            BJS
                                                Supermon




                                         MPI                         BProc
                                                           Beoboot



                                                    Linux
                                                                  v9fs v9fs



                                                LinuxBIOS




10




                                                                               5
                       Clustermatic Components
     •   Beoboot
         •   Manages booting cluster nodes
         •   Employs a tree-based boot scheme for fast/scalable booting
         •   Responsible for configuring nodes once they have booted




                                                 Users
                                    Compilers
                                    Debuggers
                                                             BJS

                                         MPI    Supermon              BProc


                                         Beoboot            Beoboot

                                                                        v9fs
                                                    Linux


                                                LinuxBIOS




11




                       Clustermatic Components
     •   BProc
         •   Manages a single process-space across machine
         •   Responsible for process startup and management
         •   Provides commands for starting processes, copying files to nodes, etc.



                                                 Users
                                    Compilers
                                    Debuggers
                                                             BJS
                                                 Supermon




                                         MPI
                                                            Beoboot
                                                                   BProc
                                                                      BProc



                                                                        v9fs
                                                     Linux


                                                LinuxBIOS




12




                                                                                      6
                       Clustermatic Components
     •   BJS
         •   BProc Job Scheduler
         •   Enforces policies for allocating jobs to nodes
         •   Nodes are allocated to “pools” which can have different policies




                                                  Users
                                     Compilers
                                     Debuggers


                                                        BJS
                                                 Supermon
                                                             BJS

                                                                      BProc
                                          MPI
                                                            Beoboot

                                                                        v9fs
                                                     Linux


                                                 LinuxBIOS




13




                       Clustermatic Components

     •   Supermon
         •   Provides a system monitoring infrastructure
         •   Provides kernel and hardware status information
         •   Low overhead on compute nodes and interconnect
         •   Extensible protocol based on s-expressions


                                                  Users
                                                 Supermon




                                     Compilers
                                     Debuggers
                                                             BJS
                                                 Supermon




                                          MPI                         BProc
                                                            Beoboot

                                                                        v9fs
                                                     Linux


                                                 LinuxBIOS




14




                                                                                7
                         Clustermatic Components

     •   MPI
         •   Uses standard MPICH 1.2 (ANL) or LA-MPI (LANL)
         •   Supports Myrinet (GM) and Ethernet (P4) devices
         •   Supports debugging with TotalView



                                                      Users
                                         Compilers
                                         Debuggers
                                                                 BJS

                                                     Supermon
                                           MPIMPI
                                                                Beoboot
                                                                          BProc



                                                                            v9fs
                                                         Linux


                                                     LinuxBIOS




15




                         Clustermatic Components
     •   Compilers & Debuggers
         •   Commercial and non-commercial compilers available
               •   GNU, Intel, Absoft
         •   Commercial and non-commercial debuggers available
               •   Gdb, TotalView, DDT



                                  Compilers
                                         Compilers
                                                      Users



                                  Debuggers
                                         Debuggers
                                                                 BJS
                                                     Supermon




                                              MPI                         BProc
                                                                Beoboot

                                                                            v9fs
                                                         Linux


                                                     LinuxBIOS




16




                                                                                   8
                                      Linux Support

     •   Linux Variants
          •   For RedHat Linux
               •   Installed as a series of RPM’s
               •   Supports RH 9 + 2.4.22 kernel
          •   For other Linux’s
               •   Must be compiled and installed from source




17




                              Tutorial CD Contents

     •   RPMs for all Clustermatic components
          •   Architectures included for x86, x86_64, athlon, ppc and alpha
          •   Full distibution available on Clustermatic web site (www.clustermatic.org)
     •   SRPMs for all Clustermatic components
     •   Miscellaneous RPMs
     •   Full source tree for LinuxBIOS (gzipped tar format)
     •   Source for MPI example programs
     •   Presentation handouts




18




                                                                                           9
                          Cluster Hardware Setup

     •   Laptop installed with RH9
          •   Will act as the master node
     •   Two slave nodes
          •   Preloaded with LinuxBIOS and a phase 1 kernel in flash
          •   iTuner M-100 VIA EPIA 533MHz 128Mb
     •   8 port 100baseT switch
     •   Total cost (excluding laptop) ~$800




19




                          Clustermatic Installation

     •    Installation process for RedHat
          •    Log into laptop
               •   Username: root
               •   Password: lci2004
          •    Insert and mount CD-ROM
               # mount /mnt/cdrom
          •    Locate install script
               # cd /mnt/cdrom/LCI
          •    Install Clustermatic
               # ./install_clustermatic
          •    Reboot to load new kernel
               # reboot


20




                                                                       10
                      Module 2: BProc & Beoboot
                                  Presenter: Erik Hendriks

     •   Objective
          •   To introduce BProc and gain a basic understanding of how it works
          •   To introduce Beoboot and understand how it fits together with BProc
          •   To configure and manage a BProc cluster
     •   Contents
          •   Overview of BProc
          •   Overview of Beoboot
          •   Configuring BProc For Your Cluster
          •   Bringing Up BProc
          •   Bringing Up The Nodes
          •   Using the Cluster
          •   Managing a Cluster
          •   Troubleshooting Techniques

21




                                  BProc Overview

     •   BProc = Beowulf Distributed Process Space
     •   BProc is a Linux kernel modification which provides
          •   A single system image for process control in a cluster
          •   Process migration for creating processes in a cluster
     •   BProc is the foundation for the rest of the Clustermatic software




22




                                                                                    11
                                         Process Space

     •     A process space is:
            •   A pool of process IDs
            •   A process tree
                  •   A set of parent/child relationships
            •   Every instance of the Linux kernel has a process space
     •     A distributed process space allows parts of one node’s process
           space to exist on other nodes




23




                            Distributed Process Space
                                         •   With a distributed process space, some processes
                                             will exist on other nodes
                                         •   Every remote process has a place holder in the
                                             process tree
                                               •   All remote processes remain visible
                                         •   Process related system calls (fork, wait, kill, etc.)
                                             work identically on local and remote processes
                                         •   Kill works on remote processes
         A node with two remote                •   No runaway processes
               processes                 •   Ptrace works on remote processes
                                               •   Strace, gdb, TotalView transparently work on remote
                                                   processes!




24




                                                                                                         12
              Distributed Process Space Example
         Master                                      Slave




                                                           B

     •   The master starts processes on slave nodes
     •   These remote processes remain visible on the master node
     •   Not all processes on the slave are part of the master’s process space


25




                       Process Creation Example
         Master                                      Slave



                         A                                A
                                                            fork()


                         B                                 B

     •   Process A migrates to the slave node
     •   Process A calls fork() to create a child – process B
     •   A new place holder for B is created
     •   Once the place holder exists B is allowed to run

26




                                                                                 13
                                  BProc in a Cluster

     •   In a BProc cluster, there is a single master and many slaves
     •   Users (including root) only log into the master
     •   The master’s process space is the process space for the cluster
     •   All processes in the cluster are
          •   Created from the master
          •   Visible on the master
          •   Controlled from the master




27




                                  Process Migration

     •   BProc provides a process migration system to place processes on
         other nodes in the cluster
     •   Process migration on BProc is not
          •   Transparent
          •   Preemptive
               •   A process must call the migration system call in order to move
     •   Process migration on BProc is
          •   Very fast (1.9s to place a 16MB process on 1024 nodes)
          •   Scalable
               •   It can create many copies for the same process (e.g. MPI startup) very efficiently
               •   O(log #copies)



28




                                                                                                        14
                               Process Migration

     •   Process migration does preserve
         •   The contents of memory and memory related metadata
         •   CPU State (registers)
         •   Signal handler state
     •   Process migration does not preserve
         •   Shared memory regions
         •   Open files
         •   SysV IPC resources
         •   Just about anything else that isn’t “memory”




29




                        Running on a Slave Node

     •   BProc is a process management system
         •   All other system calls are handled locally on the slave node
     •   BProc does not impose any extra overhead on non-process related
         system calls
     •   File and Network I/O are always handled locally
         •   Calling open() will not cause contact with the master node
         •   This means network and file I/O are as fast as they can be




30




                                                                            15
                                      Implications
     •   All processes are started from the master with process migration
     •   All processes remain visible on the master
          •   No runaways
     •   Normal UNIX process control works for ALL processes in the
         cluster
          •   No need for direct interaction
     •   There is no need to log into a node to control what is running there
     •   No software is required on the nodes except the BProc slave
         daemon
     •   ZERO software maintenance on the nodes!
     •   Diskless nodes without NFS root
          •   Reliable nodes

31




                                          Beoboot

     •   BProc does not provide any mechanism to get a node booted
     •   Beoboot fills this role
          •   Hardware detection and driver loading
          •   Configuration of network hardware
          •   Generic network boot using Linux
          •   Starts the BProc slave daemon
     •   Beoboot also provides the corresponding boot servers and utility
         programs on the front end




32




                                                                                16
                             Booting a Slave Node
         Master                                                     Slave
                                        Request (who am I?)
                                                                         Phase 1
                                     Response (IPs, servers, etc)        Small kernel
                                       Request phase 2 Image         Minimal functionality
                                                  Phase 2 Image
                                           ?

                                                                                Load phase 2 Image
                                                                                   (Using magic)
                                     Request (who am I again?)

                                               Response
                                                                         Phase 2
                                                                      Operation kernel
                                        BProc Slave Connect            Full featured




33




                        Loading the Phase 2 Image

     •    Two Kernel Monte is a piece of software which will load a new
          Linux kernel replacing one that is already running
     •    This allows you to use Linux as your boot loader!
     •    Using Linux means you can use any network that Linux supports.
           •   There is no PXE bios or Etherboot support for Myrinet, Quadrics or Infiniband
           •   “Pink” network boots on Myrinet which allowed us to avoid buying a 1024
               port ethernet network
     •    Currently supports x86 (including AMD64) and Alpha




34




                                                                                                     17
                              BProc Configuration

     •   Main configuration file
          •   /etc/clustermatic/config
     •   Edit with favorite text editor
          •   Lines consist of comments (starting with #)
          •   Rest are keyword followed by arguments
     •   Specify interface:
          •   interface eth0 10.0.4.1 255.255.255.0
          •   eth0 is interface connected to nodes
          •   IP of master node is 10.0.4.1
          •   Netmask of master node is 255.255.255.0
          •   Interface will be configured when BProc is started


35




                              BProc Configuration

     •   Specify range of IP addresses for nodes:
          •   iprange 0 10.0.4.10 10.0.4.14
          •   Start assigning IP addresses at node 0
          •   First address is 10.0.4.10, last is 10.0.4.14
          •   The size of this range determines the number of nodes in the cluster
     •   Next entries are default libraries to be installed on nodes
          •   Can explicitly specify libraries or extract library information from an
              executable
          •   Need to add entry to install extra libraries
               •   librariesfrombinary /bin/ls /usr/bin/gdb
          •   The bplib command can be used to see libraries that will be loaded



36




                                                                                        18
                            BProc Configuration

     •   Next line specifies the name of the phase 2 image
          •   bootfile /var/clustermatic/boot.img
          •   Should be no need to change this
     •   Need to add a line to specify kernel command line
          •   kernelcommandline apm=off console=ttyS0,19200
          •   Turn APM support off (since these nodes don’t have any)
          •   Set console to use ttyS0 and speed to 19200
          •   This is used by beoboot command when building phase 2 image




37




                            BProc Configuration

     •   Final lines specify ethernet addresses of nodes, examples given
          #node 0 00:50:56:00:00:00
          #node   00:50:56:00:00:01
          •   Needed so node can learn its IP address from master
          •   First 0 is optional, assign this address to node 0
     •   Can automatically determine and add ethernet addresses using the
         nodeadd command
     •   We will use this command later, so no need to change now
     •   Save file and exit from editor




38




                                                                            19
                               BProc Configuration

     •   Other configuration files
          • Should not need to be changed for this configuration
     •   /etc/clustermatic/config.boot
          •   Specifies PCI devices that are going to be used by the nodes at boot time
          •   Modules are included in phase 1 and phase 2 boot images
          •   By default the node will try all network interfaces it can find
     •   /etc/clustermatic/node_up.conf
          •   Specifies actions to be taken in order to bring a node up
          •   Load modules
          •   Configure network interfaces
          •   Probe for PCI devices
          •   Copy files and special devices out to node



39




                                 Bringing Up BProc

     •   Check BProc will be started at boot time
          # chkconfig --list clustermatic
     •   Restart master daemon and boot server
          # service bjs stop
          # service clustermatic restart
          # service bjs start
          •   Load the new configuration
          •   BJS uses BProc, so needs to be stopped first
     •   Check interface has been configured correctly
          # ifconfig eth0
          •   Should have IP address we specified in config file


40




                                                                                          20
                       Build a Phase 2 Image
     •   Run the beoboot command on the master
         # beoboot -2 -n --plugin mon
         • -2        this is a phase 2 image
         • -n        image will boot over network
         • --plugin  add plugin to the boot image
     •   The following warning messages can be safely ignored
         •   WARNING: Didn’t find a kernel module called gmac.o
         •   WARNING: Didn’t find a kernel module called bmac.o

     •   Check phase 2 image is available
         # ls -l /var/clustermatic/boot.img




41




                   Bringing Up The First Node
     •   Ensure both nodes are powered off
     •   Run the nodeadd command on the master
         # /usr/lib/beoboot/bin/nodeadd -a -e -n 0 eth0
         • -a     automatically reload daemon
         • -e     write a node number for every node
         • -n 0   start node numbering at 0
         • eth0   interface to listen on for RARP requests
     •   Power on the first node
     •   Once the node boots, nodeadd will display a message
         New MAC: 00:30:48:23:ac:9c
         Sending SIGHUP to beoserv.



42




                                                                  21
                    Bringing Up The Second Node
     •   Power on the the second node
     •   In a few seconds you should see another message:
          New MAC: 00:30:48:23:ad:e1
          Sending SIGHUP to beoserv.
     •   Exit nodeadd when second node detected (^C)
     •   At this point, cluster is up and fully operational
     •   Check cluster status
          # bpstat -U

     Node(s)          Status             Mode       User                  Group
     0-1              up                 ---x------ root                  root




43




                                Using the Cluster
     •   bpsh
          •   Migrates a process to one or more nodes
          •   Process is started on front-end, but is immediately migrated onto nodes
          •   Effect similar to rsh command, but no login is performed and no shell is
              started
          •   I/O forwarding can be controlled
          •   Output can be prefixed with node number
          •   Run date command on all nodes which are up
                # bpsh -a -p date
          •   See other arguments that are available
               # bpsh -h



44




                                                                                         22
                                 Using the Cluster
     •   bpcp
         •   Copies files to a node
         •   Files can come from master node, or other nodes
         •   Note that a node only has a ram disk by default
         •   Copy /etc/hosts from master to /tmp/hosts on node 0
               # bpcp /etc/hosts 0:/tmp/hosts
               # bpsh 0 cat /tmp/hosts




45




                             Managing the Cluster
     •   bpstat
         •   Shows status of nodes
              •   up       node is up and available
              •   down     node is down or can’t be contacted by master
              •   boot     node is coming up (running node_up)
              •   error    an error occurred while the node was booting
         •   Shows owner and group of node
              •   Combined with permissions, determines who can start jobs on the node
         •   Shows permissions of the node
              • ---x------ execute permission for node owner

              • ------x--- execute permission for users in node group

              • ---------x execute permission for other users



46




                                                                                         23
                             Managing the Cluster
     •   bpctl
         •   Control a nodes status
         •   Reboot node 1 (takes about a minute)
               # bpctl -S 1 -R
         •   Set state of node 0
               # bpctl -S 0 -s groovy
               • Only up, down, boot and error have special meaning, everything else
                 means not down
         •   Set owner of node 0
               # bpctl -S 0 -u nobody
         •   Set permissions of node 0 so anyone can execute a job
               # bpctl -S 0 -m 111



47




                             Managing the Cluster
     •   bplib
         •   Manage libraries that are loaded on a node
         •   List libraries to be loaded
               # bplib -l
         •   Add a library to the list
               # bplib -a /lib/libcrypt.so.1
         •   Remove a library from the list
               # bplib -d /lib/libcrypt.so.1




48




                                                                                       24
                     Troubleshooting Techniques
     •   The tcpdump command can be used to check for node activity
         during and after a node has booted
     •   Connect a cable to serial port on node to check console output for
         errors in boot process
     •   Once node reaches node_up processing, messages will be logged
         in /var/log/clustermatic/node.N (where N is node
         number)




49




                             Module 3: LinuxBIOS
                                  Presenter: Ron Minnich

     •   Objective
          •   To introduce LinuxBIOS
          •   Build and install LinuxBIOS on a cluster node
     •   Contents
          •   Overview
          •   Obtaining LinuxBIOS
          •   Source Tree
          •   Building LinuxBIOS
          •   Installing LinuxBIOS
          •   Booting a Cluster without LinuxBIOS
     •   More Information
          •   http://www.linuxbios.org


50




                                                                              25
                                   LinuxBIOS Overview

     •   Replacement for proprietary BIOS
     •   Based entirely on open source code
     •   Can boot from a variety of devices
     •   Supports a wide range of architectures
          •   Intel P3 & P4
          •   AMD K7 & K8 (Opteron)
          •   PPC
          •   Alpha
     •   Ports available for many systems
          compaq        ibm        lippert      rlx       tyan           advantech    dell         intel
          matsonic      sis        via          asus      digitallogic   irobot       motorola     stpc
          winfast6300   bcm        elitegroup   lanner    nano           supermicro   bitworks     leadtek
          pcchips       supertek   cocom        gigabit   lex            rcn          technoland



51




                                   Why Use LinuxBIOS?

     •   Proprietary BIOS’s are inherently interactive
          •   Major problem when building clusters with 100’s or 1000’s of nodes
     •   Proprietary BIOS’s misconfigure hardware
          •   Impossible to fix
          •   Examples that really happen
                •   Put in faster memory, but it doesn’t run faster
                •   Can misconfigure PCI address space - huge problem
     •   Proprietary BIOS’s can’t boot over HPC networks
          •   No Myrinet or Quadrics drivers for Phoenix BIOS
     •   LinuxBIOS is FAST
          •   This is the least important thing about LinuxBIOS


52




                                                                                                             26
                                        Definitions

     •   Bus
         •   Two or more wires used to connect two or more chips
     •   Bridge
         •   A chip that connects two or more busses of the same or different type
     •   Mainboard
         •   Aka motherboard/platform
         •   Carrier for chips that are interconnected via buses and bridges
     •   Target
         •   A particular instance of a mainboard, chips and LinuxBIOS configuration
     •   Payload
         •   Software loaded by LinuxBIOS from non volatile storage into RAM

53




                                Typical Mainboard
                   Rev D
                               CPU                                     CPU


                                            Front-side Bus


                                      AGP                       DDR
                           Video                Northbridge               RAM

                                                        I/O Buses (PCI)

                                                                       BIOS
                                                Southbridge
                           Keyboard                                    Chip



                                       Floopy                 Legacy


54




                                                                                       27
                             What Is LinuxBIOS?
     •     That question has changed over time
     •     In 1999, at the start of the project, LinuxBIOS was literal
           •   Linux is the BIOS
           •   Hence the name
     •     The key questions are:
           •   Can you learn all about the hardware on the system by asking the hardware
               on the system?
           •   Does the OS know how to do that?
     •     The answer, in 1995 or so on PCs, was “NO” in both cases
     •     OS needed the BIOS to do significant work to get the machine
           ready to use


55




                What Does The BIOS Do Anyway?
     1.    Make the processor(s) sane
     2.    Make the chipsets sane
     3.    Make the memory work (HARD on newer systems)
     4.    Set up devices so you can talk to them
     5.    Set up interrupts so the go to the right place
     6.    Initialize memory even though you don’t want it to
     7.    Totally useless memory test
           •   I’ve never seen a useful BIOS memory test
     8.    Spin up the disks
     9.    Load primary bootstrap from the right place
     10.   Start up the bootstrap

56




                                                                                           28
         Is It Possible With Open-Source Software?

     •   1995: very hard - tightly coded assembly that barely fits into 32KB
     •   1999: pretty easy - the Flash is HUGH (256KB at least)
     •   So the key in 1999 was knowing how to do the startup
     •   Lots of secret knowledge which took a while to work out
     •   Vendors continue to make this hard, some help
          •   AMD is good example of a very helpful vendor
     •   LinuxBIOS community wrote the first-ever open-source code that
         could:
          •   Start up Intel and AMP SMPs
          •   Enable L2 cache on the PII
          •   Initialize SDRAM and DDRAM

57




         Only Really Became Possible In 1999

     •   Huge 512K byte Flash parts could hold the huge kernel
          •   Almost 400KB
     •   PCI bus had self-identifying hardware
          •   Old ISA, EISA, etc. were DEAD thank goodness!
     •   SGI Visual Workstation showed you could build x86 systems
         without standard BIOS
     •   Linux learned how to do a lot of configuration, ignoring the BIOS
     •   In summary
          •   The hardware could do it (we thought)
          •   Linux could do it (we thought)



58




                                                                               29
              LinuxBIOS Image In The 512KB Flash
     PC memory
     (not to scale)
     0xffffffff                                    •Top 16 bytes - jump to BIOS
     0xfffffff0
     0xffff0000
                        Flash memory at top        •Top 64K bytes - startup code

     0xfff80000                                    •Rest of flash - Linux kernel


     0xc0000000
     0xb0000000
                        PCI devices in middle

     0x40000000         Top of memory
                        Main memory at
     0x00000000         bottom
59




               The Basic Load Sequence ca. 1999

     •   Top 16 bytes: jump to top 64K
     •   Top 64K:
          •   Set up hardware for Linux
          •   Copy Linux from FLASH to bottom of memory
          •   Jump to 0x100020 (start of Linux)
     •   Linux: do all the stuff you normally do
          •   2.2: not much, was a problem
          •   2.4: did almost everything
     •   In 1999, Linux did not do all we needed (2.2)
     •   In 2000, 2.4 could do almost as much as we want
     •   The 64K bootstrap ended up doing more than we planned

60




                                                                                   30
               What We Thought Linux Would Do

     •   Do ALL the PCI setup
     •   Do ALL the odd processor setup
     •   In fact, do everything: all the “64K” code had to do was copy Linux
         to RAM




61




              What We Changed (Due To Hardware)
     •    DRAM does not start life operational, like the old days
     •    Turn-on for DRAM is very complex
          •   The single hardest part of LinuxBIOS is DRAM support
     •    To turn on DRAM, you need to turn on chipsets
     •    To turn on chipsets, you need to set up PCI
     •    And, on AMD Athlon SMPs, we need to grab hold of all the CPUs
          (save one) and idle them
     •    So the “64K” chunk ended up doing more




62




                                                                               31
                                 Getting To DRAM
                 Rev D
                             CPU                                     CPU


                                          Front-side Bus


                                    AGP                       DDR
                         Video                Northbridge               RAM

                                                      I/O Buses (PCI)

                                                                     BIOS
                                              Southbridge
                         Keyboard                                    Chip



                                     Floopy                 Legacy


63




                                 Another Problem

     •   IRQ wiring can not be determined from hardware!
     •   Botch in PCI results in having to put tables in the BIOS
     •   This is true for all motherboards
     •   So, although PCI hardware is self-identifying, hardware interrupts
         are not
     •   So Linux can’t figure out what interrupt is for what card
     •   LinuxBIOS has to pick up this additional function




64




                                                                              32
                           The PCI Interrupt Botch

                 1                                                A
                 2                                                B
                 3                                                C
                 4                                                D



                 1                                                A
                 2                                                B
                 3                                                C
                 4                                                D



65




                 What We Changed (Due To Linux)

     •   Linux could not set up a totally empty PCI bus
          •   Needed some minimal configuration
     •   Linux couldn’t find the IRQs
          •   Not really its fault, but…
     •   Linux needed SMP hardware set up “as per BIOS”
     •   Linux needed per-CPU hardware set up “as per BIOS”
     •   Linux needed tables (IRQ, ACPI, etc.) set up “as per BIOS”
     •   Over time, this is changing
          •   Someone has a patent on the “SRAT” ACPI table
          •   SRAT describes hardware
          •   So Linux ignores SRAT, talks to hardware directly

66




                                                                      33
                               As Of 2000/2001

     •   We could boot Linux from flash (quickly)
     •   Linux would find the hardware and the tables ready for it
     •   Linux would be up and running in 3-12 seconds
     •   Problem solved?




67




                                   Problems…

     •   Looking at trends, in 1999 we counted on motherboard flash sizes
         doubling every 2 years or so
     •   From 1999 to 2000 the average flash size either shrank or stayed
         the same
     •   Linux continued to grow in size though…
     •   Linux outgrew the existing flash parts, even as they were getting
         smaller
     •   Venders went to a socket that couldn’t hold a larger replacement
     •   Why did vendors do this?
          •   Everyone wants cheap mainboards!



68




                                                                             34
                           LinuxBIOS Was Too Big

     •   Enter the alternate bootstraps
          •    Etherboot
          •    FILO
          •    Built-in etherboot
          •    Built-in USB loader




69




                                     The New Picture
              Flash (256KB)            Compact Flash (32MB)

             Top 16 bytes
         Top 64K (LinuxBIOS)               Linux Kernel
         Next 64K (Etherboot)
                                                              Loaded over
                                                              IDE channel by
                   Empty                                      bootloader

                                              Empty




70




                                                                               35
                                 LinuxBIOS Now

     •   The aggregate of the “64K loader”, Etherboot (or FILO), and Linux
         from Compact Flash?
          •   Too confusing
     •   LinuxBIOS now means only the 64K piece, even though it’s not
         Linux any more
     •   On older systems, LinuxBIOS loads Etherboot which loads Linux
         from Compact Flash
          •   Compact Flash read as raw set of blocks
     •   On newer systems, LinuxBIOS loads FILO which loads Linux from
         Compact Flash
          •   Compact Flash treated as ext2 filesystem


71




                                  Final Question

     •   You’re reflashing 1024 nodes on a cluster and the power fails
     •   You’re now the proud owner of 1024 bricks, right?
     •   Wrong…
          •   Linux NetworX developed fallback BIOS technology




72




                                                                             36
                                 Fallback BIOS
         Flash (256KB)
         Jump to BIOS        •   “Jump to BIOS” jumps to fallback BIOS
         Fallback BIOS       •   Fallback BIOS checks conditions
         Normal BIOS              •   Was the last boot successful?
         Fallback FILO            •   Do we want to just use fallback anyway?
         Normal FILO              •   Does “normal” BIOS look ok?
                             •   If things are good, use normal
                             •   If things are bad, use fallback
                             •   Note there is also a fallback and normal FILO
                                  •   These load different files from CF
                             •   So normal kernel, FILO, and BIOS can be
                                 hosed and you’re ok
73




                     Rules For Upgrading Flash

     •    NEVER replace the fallback BIOS
     •    NEVER replace the fallback FILO
     •    NEVER replace the fallback kernel
     •    Mess up other images at will, because you can always fall back




74




                                                                                 37
                           A Last Word On Flash Size

     •   Flash size decreased to 256KB from 1999-2003
          •   Driven by packaging constraints
     •   Newer technology uses address-address multiplexing to pack lots
         of address bits onto 3 address lines - up to 128 MB!
          •   Driven by cell phone and MP3 player demand
     •   So the same small package can support 1,2,4,8… MB
          •   Will need them: kernel + initrd can be 4MB!
     •   This will allow us to realize our original vision
          •   Linux in flash
     •   Etherboot, FILO, etc., are really a hiccup


75




                                              Source Tree
     •   COPYING                                            •   /console
                                                                 •   Device independent console
     •   NEWS                                                        support
     •   ChangeLog                                          •   /cpu
     •   documentation                                           •   Implementation specific files
          •   Not enough!                                   •   /devices
     •   src                                                     •   Dynamic device allocation routines
          •   /arch                                         •   /include
                                                                 •   Header files
               •   Architecture specific files, including
                   initial startup code                     •   /lib
          •   /boot                                              •   Generic library functions (atoi)
               •   Main LinuxBIOS entry code:
                   hardwaremain()
          •   /config
               •   Configuration for a given platform



76




                                                                                                          38
                                                  Source Tree
         •   /mainboard                                          •   /stream
                  •    Mainboard specific code                        •   Source of payload data
         •   /northbridge                                        •   /superio
                  •    Memory and bus interface routines              •   Chip to emulate legacy crap
         •   /pc80                                          •   targets
                  •    Legacy crap
                                                                 • Instances of specific platforms
         •   /pmc
                  •    Processor mezzanine cards            •   utils
         •   /ram                                                •   Utility programs
                  •    Generic RAM support
         •   /sdram
                  •    Synchronous RAM support
         •   /southbridge
                  •    Bridge to interface to legacy crap




77




                                        Building LinuxBIOS

     •       For this demonstration, untar source from CDROM
              #       mount /mnt/cdrom
              #       cd /tmp
              #       tar zxvf /mnt/cdrom/LCI/linuxbios/freebios2.tgz
              #       cd freebios2
     •       Find target that matches your hardware
              # cd targets/via/epia
     •       Edit configuration file Config.lb and change any settings
             specific to your board
              •       Should not need to make any changes in this case



78




                                                                                                        39
                            Building LinuxBIOS

     •   Build the target configuration files
          # cd ../..
          # ./buildtarget via/epia
     •   Now build the ROM image
          # cd via/epia/epia
          # make
     •   Should result in a single file
          •   linuxbios.rom
     •   Copy ROM image onto a node
          # bpcp linuxbios.rom 0:/tmp



79




                           Installing LinuxBIOS

     •   This will overwrite old BIOS with LinuxBIOS
          •   Prudent to keep a copy of the old BIOS chip
          •   Bad BIOS = useless junk
     •   Build flash utility
          # cd /tmp/freebios2/util/flash_and_burn
          # make
     •   Now flash the ROM image - please do not do this step
          # bpsh 0 ./flash_rom /tmp/linuxbios.rom
     •   Reboot node and make sure it comes up
          # bpctl -S 0 -R
          •   Use BProc troubleshooting techniques if not!

80




                                                                40
              Booting a Cluster Without LinuxBIOS
     •   Although an important part of Clustermatic, it’s not always possible
         to deploy LinuxBIOS
          •   Requires detailed understanding of technical details
          •   May not be available for a particular mainboard
     •   In this situation it is still possible to set up and boot a cluster using
         a combination of DHCP, TFTP and PXE
     •   Dynamic Host Configuration Protocol (DHCP)
          •   Used by node to obtain IP address and bootloader image name
     •   Trivial File Transfer Program (TFTP)
          •   Simple protocol to ransfer files across an IP network
     •   Pre-Execution Environment (PXE)
          •   BIOS support for network booting


81




                               Configuring DHCP
     •   Copy configuration file
          •   cp /mnt/cdrom/LCI/pxi/dhcpd.conf /etc
     •   Contains the following entry (one “host” entry for each node)
          ddns-update-style ad-hoc;
          subnet 10.0.4.0 netmask 255.255.255.0 {
            host node1 {
              hardware ethernet xx:xx:xx:xx:xx:xx;
              fixed-address 10.0.4.14;
              filename “pxelinux.0”;
            }
          }
     •   Replace xx:xx:xx:xx:xx:xx with MAC address of node
     •   Restart server to load new configuration
          # service dhcpd restart
82




                                                                                     41
                            Configuring TFTP
     •   Create directory to hold bootloader
          # mkdir -p /tftpboot
     •   Edit TFTP config file
          •   /etc/xinetd.d/tftp
     •   Enable TFTP
          •   Change
               • disable =         yes
          •   To
               • disable =         no
     •   Restart server
          # service xinetd restart



83




                             Configuring PXE
     •   Depends on BIOS, enabled through menu
     •   Create correct directories
          # mkdir -p /tftpboot/pxelinux.cfg
     •   Copy bootloader and config file
          # cd /mnt/cdrom/LCI/pxe
          # cp pxelinux.0 /tftpboot/
          # cp default /tftpboot/pxelinux.cfg/
     •   Generate a bootable phase 2 image
          # beoboot -2 -i -o /tftpboot/node --plugin mon
     •   Creates a kernel and initrd image
          •   /tftpboot/node
          •   /tftpboot/node.initrd


84




                                                           42
                               Booting The Cluster
     •   Run nodeadd to add node to config file
          # /usr/lib/beoboot/bin/nodeadd -a -e eth0
     •   Node can now be powered on
     •   BIOS uses DHCP to obtain IP address and filename
     •   pxelinux.0 will be loaded
     •   pxelinux.0 will in turn load phase 2 image and initrd
     •   Node should boot
          •   Check status using bpstat command
     •   Requires monitor to observe behavior of node




85




                             Module 4: Filesystems
                                    Presenter: Ron Minnich

     •   Objective
          •   To show the different kinds of filesystems that can be used with a BProc
              cluster and demonstrate the advantages and disadvantages of each
     •   Contents
          •   Overview
          •   No Local Disk, No Network Filesystem
          •   Local Disks
          •   Global Network Filesystems
               •   NFS
               •   Third Party Filesystems
          •   Private Network Filesystems
               •   V9FS



86




                                                                                         43
                           Filesystems Overview

     •   Nodes in a Clustermatic cluster do not require any type of local or
         network filesystem to operate
     •   Jobs that operate with only local data need no other filesystems
     •   Clustermatic can provide a range of different filesystem options




87




              No Local Disk, No Network Filesystem

     •   Root filesystem is a tmpfs located in system RAM, so size is limited
         to RAM size of nodes
     •   Applications that need an input deck must copy necessary files to
         nodes prior to execution and from nodes after execution
          •   30K input deck can be copied to 1023 nodes in under 2.5 seconds
     •   This can be a very fast option for suitable applications
     •   Removes node dependency on potentially unreliable fileserver




88




                                                                                44
                                      Local Disks

     •   Nodes can be provided with one or more local disks
     •   Disks are automatically mounted by creating entry in
         /etc/clustermatic/fstab
     •   Solves local space problem, but filesystems are still not shared
     •   Also reduces reliability of nodes since they are now dependent on
         spinning hardware




89




                                             NFS

     •   Simplest solution to providing a shared filesystem on nodes
     •   Will work in most environments
     •   Nodes are now dependent on availability of NFS server
     •   Master can act as NFS server
          •   Adds extra load
          •   Master may already be loaded if there are a large number of nodes
     •   Better option is to provide a dedicated server
          •   Configuration can be more complex if server is on a different network
          •   May require mutliple network adapters in master
     •   Performance is never going to be high


90




                                                                                      45
                Configuring Master as NFS Server

     •   Standard Linux NFS configuration on server
     •   Check NFS is enabled at boot time
          # chkconfig --list nfs
          # chkconfig nfs on
     •   Start NFS daemons
          # service nfs start
     •   Add exported filesystem to /etc/exports
          •   /home 10.0.4.0/24(rw,sync,no_root_squash)
     •   Export filesystem
          # exportfs -a



91




                  Configuring Nodes To Use NFS
     •   Edit /etc/clustermatic/fstab to mount filesystem
         when node boots
          •   MASTER:/home /home nfs nolock 0 0
          •   MASTER will be replaced with IP address of front end
          •   nolock must be used unless portmap is run on each node
          •   /home will be automatically created on node at boot time
     •   Reboot nodes
          # bpctl -S allup -R
     •   When nodes have rebooted, check NFS mount is available
          # bpsh 0-1 df



92




                                                                         46
                           Third Party Filesystems

     •   GPFS (http://www.ibm.com)
     •   Panasas (http://www.panasas.com)
     •   Lustre (http://www.lustre.org)




93




                                              GPFS
     •   Supports up to 2.4.21 kernel (latest is 2.4.26 or 2.6.5)
     •   Data striping across multiple disks and multiple nodes
     •   Client-side data caching
     •   Large blocksize option for higher efficiencies
     •   Read-ahead and write-behind support
     •   Block level locking supports concurrent access to files
     •   Network Shared Disk Model
          •   Subset of nodes are allocated as storage nodes
          •   Software layer ships I/O requests from application node to storage nodes across
              cluster interconnect
     •   Direct Attached Model
          •   Each node must have direct connection to all disks
          •   Requires Fibre Channel Switch and Storage Area Network disk configuration



94




                                                                                                47
                                             Panasas
     •   Latest version supports 2.4.26 kernel
     •   Object Storage Device (OSD)
          •   Intelligent disk drive
          •   Can be directly accessed in parallel
     •   PanFS Client
          •   Object-based installable filesystem
          •   Handles all mounting, namespace operations, file I/O operations
          •   Parallel access to multiple object storage devices
     •   Metadata Director
          •   Separate control path for managing OSDs
          •   “mapping” of directories and files to data objects
          •   Authentication and secure access
     •   Metadata Director and OSD require dedicated proprietary hardware
     •   PanFS Client is open source

95




                                               Lustre
     •   Lustre Lite supports 2.4.24 kernel
     •   Full Lustre will support 2.6 kernel
     •   Luster Lite = Lustre - clustered metadata scalability
     •   All open source
     •   Meta Data Servers (MDSs)
          •   Supports all filesystem namespace operations
          •   Lock manager and concurrency support
          •   Transaction log of metadata operations
          •   Handles failover of metadata servers
     •   Object Storage Targets (OSTs)
          •   Handles actual file I/O operations
          •   Manages storage on Object-Based Disks (OBDs)
          •   Object-Based Disk drivers support normal Linux filesystems
     •   Arbitrary network support through Network Abstraction Layer
     •   MDSs and OSTs can be standard Linux hosts
96




                                                                                48
                                             V9FS

     •   Provides a shared private network filesystem
     •   Shared
          •   All nodes running a parallel process can access the filesystem
     •   Private
          •   Only processes in a single process group can see or access files in the
              filesystem
     •   Mounts exist only for duration of process
          •   Node cleanup is automatic
          •   No “hanging mount” problems
     •   Protocol is lightweight
     •   Pluggable authentication services

97




                                             V9FS

     •   Experimental
     •   Can be mounted across a secure channel (e.g. ssh) for additional
         security
     •   1000+ concurrent mounts in 20 seconds
          •   Multiple servers will improve this
     •   Servers can run on cluster nodes or dedicated systems
     •   Filesystem can use cluster interconnect or dedicated network
     •   More information
          •   http://v9fs.sourceforge.net




98




                                                                                        49
                Configuring Master as V9FS Server

      •   Start server
           # v9fs_server
           •   Can be started at boot if desired
      •   Create mount point on nodes
           # bpsh 0-1 mkdir /private
           • Can add mkdir command to end of node_up script if desired




99




                           V9FS Server Commands

      •   Define filesystems to be mounted on the nodes
           # v9fs_addmount 10.0.4.1:/home /private
      •   List filesystems to be mounted
           # v9fs_lsmount




100




                                                                         50
                               V9FS On The Cluster

      •   Once filesystem mounts have been defined on the server,
          filesystems will be automatically mounted when a process is
          migrated to the node
           # cp /etc/hosts /home
           # bpsh 0-1 ls -l /private
           # bpsh 0 cat /private/hosts
      •   Remove filesystems to be mounted
           # v9fs_rmmount /private
           # bpsh 0-1 ls -l /private




101




                                           One Note

      •   Note that we ran the file server as root
      •   You can actually run the file server as you
      •   If run as you, there is added security
           •   The server can’t run amok
      •   And subtracted security
           •   We need a better authentication system
           •   Can use ssh, but something tailored to the cluster would be better
      •   Note that the server can chroot for even more safety
           •   Or be told to serve from a file, not a file system
      •   There is tremendous flexibility and capability in this approach


102




                                                                                    51
                                             Also…

      •   Recall that on 2.4.19 and later there is a /proc entry for each
          process
           •   /proc/mounts
      •   It really is quite private
           •   There is a lot of potential capability here we have not started to use
      •   Still trying to determine need/demand




103




                                   Why Use V9FS?

      •   You’ve got some wacko library you need to use for one application
      •   You’ve got a giant file which you want to serve as a file system
      •   You’ve got data that you want visible to you only
           •   Original motivation: compartmentation in grids (1996)
      •   You want a mount point but it’s not possible for some reason
      •   You want an encrypted data file system




104




                                                                                        52
                                      Wacko Library

      •   Clustermatic systems (intentionally) limit the number of libraries on
          nodes
           •   Current systems have about 2GB worth of libraries
           •   Putting all these on nodes would take 2GB of memory!
      •   Keeping node configuration consistent is a big task on 1000+ nodes
           •   Need to do rsync, or whatever
           •   Lots of work, lots of time for libraries you don’t need
      •   What if you want some special library available all the time
           •   Painful to ship it out, set up paths, etc., every time
      •   V9FS allows custom mounts to be served from your home directory


105




                           Giant File As File System

      •   V9FS is a user-level server
           •   i.e. an ordinary program
      •   On Plan 9, there are all sorts of nifty uses of this
           •   Servers for making a tar file look like a read-only file system
           •   Or cpio archive, or whatever…
      •   So, instead of trying to locate something in the middle of a huge tar
          file
           •   Run the server to serve the tar file
           •   Save disk blocks and time




106




                                                                                  53
                           Data Visible To You Only

      •   This usage is still very important
      •   Run your own personal server (assuming authentication is fixed) or
          use the global server
      •   Files that you see are not visible to anyone else at all
           •   Even root
      •   On Unix, if you can’t get to the mount point, you can’t see the files
      •   On Linux with private mounts, other people don’t even know the
          mount point exists




107




          You Want A Mount Point But Can’t Get One

      •   “Please Mr. Sysadmin, sir, can I have another mount point?”
      •   “NO!”
      •   System administrators have enough to do, than to
           •   Modify fstab on all nodes
           •   Modify permissions on a server
           •   And so on…
      •   Just to make your library available on the nodes?
           •   Doubtful
      •   V9FS gives a level of flexibility that you can’t get otherwise



108




                                                                                  54
                  Want Encrypted Data File System

      •   This one is really interesting
      •   Crypto file systems are out there in abundance
           •   But they always require lots of “root” involvement to set up
      •   Since V9FS is user-level, you can run one yourself
      •   Set up your own keys, crypto, all your own stuff
      •   Serve a file system out of one big encrypted file
      •   Copy the file elsewhere, leaving it encrypted
           •   Not easily done with existing file systems
      •   So you have a personal, portable, encrypted file system



109




                                So Why Use V9FS?

      •   Opens up a wealth of new ways to store, access and protect your
          data
      •   Don’t have to bother System Administrators all the time
      •   Can extend the file system name space of a node to your
          specification
      •   Can create a whole file system in one file, and easily move that file
          system around (cp, scp, etc.)
      •   Can do special per-user policy on the file system
           •   Tar or compressed file format
           •   Per-user crypto file system
      •   Provides capabilities you can’t get any other way

110




                                                                                  55
                                 Module 5: Supermon
                                    Presenter: Matt Sottile

      •   Objectives
           •   Present an overview of supermon
           •   Demonstrate how to install and use supermon to monitor a cluster
      •   Contents
           •   Overview of Supermon
           •   Starting Supermon
           •   Monitoring the Cluster
      •   More Information
           •   http://supermon.sourceforge.net




111




                            Overview of Supermon

      •   Provides monitoring solution for clusters
      •   Capable of high sampling rates (Hz)
                         Nodes            All data       One variable
                            1              5200             9400
                           16               120              200
                           128              13               25
                          1024               1                2
      •   Very small memory and computational footprint
           •   Sampling rates are controlled by clients at run-time
      •   Completely extensible without modification
           •   User applications
           •   Kernel modules
112




                                                                                  56
                                        Node View

      •   Data sources
           •   Kernel module(s)
           •   User application
      •   Mon daemon
      •   IANA-registered port number
           •   2709




113




                                    Cluster View

      •   Data sources
           •   Node mon daemons
           •   Other supermons
      •   Supermon daemon
      •   Same port number
           •   2709
      •   Same protocol at every level
           •   Composable, extensible




114




                                                    57
                                         Data Format

      •   S-expressions
          •   Used in LISP, Scheme, etc.
          •   Very mature
      •   Extensible, composable, ASCII
          •   Very portable
          •   Easily changed to support richer data and structures
          •   Composable
               •   (expr 1) o (expr 2) = ((expr 1) (expr 2))
      •   Fast to parse, low memory and time overhead




115




                                        Data Protocol
      •   # command
          •   Provides description of what data is provided and how it is structured
          •   Shows how the data is organized in terms of rough categories containing
              specific data variables (e.g. cpuinfo category, usertime variable)
      •   S command
          •   Request actual data
          •   Structure matches that described in # command
      •   R command
          •   Revive clients that disappeared and were restarted
      •   N command
          •   Add new clients



116




                                                                                        58
                               User Defined Data

      •   Each node allows user-space programs to push data into mon to
          be sent out on the next sample
      •   Only requirement
          •   Data is arbitrary text
          •   Recommended to be an s-expression
      •   Very simple interface
          •   Uses UNIX domain socket for security




117




                              Starting Supermon

      •   Start supermon daemon
          # supermon n0 n1 2> /dev/null &
      •   Check output from kernel
          # bpsh 1 cat /proc/sys/supermon/#
          # bpsh 1 cat /proc/sys/supermon/S
      •   Check sensor output from kernel
          # bpsh 1 cat /proc/sys/supermon_sensors_t/#
          # bpsh 1 cat /proc/sys/supermon_sensors_t/S




118




                                                                          59
                            Supermon In Action

      •   Check mon output from a node
           # telnet n1 2709
           S
           ^]close
      •   Check output from supermon daemon
           # telnet localhost 2709
           S
           ^]close




119




                            Supermon In Action

      •   Read supermon data and display to console
           # supermon_stats [options…]
      •   Create trace file for off-line analysis
           # supermon_tracer [options…]
           • supermon_stats can be used to process trace data off-line




120




                                                                         60
                                   Module 6: BJS
                                   Presenter: Matt Sottile

      •   Objectives
           •   Introduce the BJS scheduler
           •   Configure and submit jobs using BJS
      •   Contents
           •   Overview of BJS
           •   BJS Configuration
           •   Using BJS




121




                                   Overview of BJS

      •   Designed to cover the needs of most users
      •   Simple, easy to use
      •   Extensible interface for adding policies
      •   Used in production environments
      •   Optimized for use with BProc
      •   Traditional schedulers require O(N) processes, BJS requires O(1)
      •   Schedules and unschedules 1000 processes in 0.1 seconds




122




                                                                             61
                                    BJS Configuration

      •   Nodes are divided into pools, each with a policy
      •   Standard policies
           •   Filler
                 •   Attempts to backfill unused nodes
           •   Shared
                 •   Allows multiple jobs to run on a single node
           •   Simple
                 •   Very simple FIFO scheduling algorithm




123




                                        Extending BJS

      •   BJS was designed to be extensible
      •   Policies are “plug-ins”
           •   They require coding to the BJS C API
           •   Not hard, but nontrivial
           •   Particularly useful for installation-specific policies
           •   Based on shared-object libraries
      •   A “fair-share” policy is currently in testing at LANL for BJS
           •   Enforce fairness between groups
           •   Enforce fairness between users within a group
           •   Optimal scheduling between user’s own jobs



124




                                                                          62
                                 BJS Configuration
      •   BJS configuration file
           •   /etc/clustermatic/bjs.config
      •   Global configuration options (usually don’t need to be changed)
           •   Location of spool files
                •   spooldir
           •   Location of dynamically loaded policy modules
                •   policypath
           •   Location of UNIX domain socket
                •   socketpath
           •   Location of user accouting log file
                •   acctlog




125




                                 BJS Configuration

      •   Per-pool configuration options
           •   Defines the default pool
                •   pool default
           •   Name of policy module for this pool (must exist in policydir)
                •   policy filler
           •   Nodes that are in this pool
                •   nodes 0-10000
           •   Maximum duration of a job (wall clock time)
                •   maxsecs 86400
           •   Optional: Users permitted to submit to this pool
                •   users
           •   Optional: Groups permitted to submit to this pool
                •   groups

126




                                                                               63
                             BJS Configuration

      •   Restart BJS daemon to accept changes
          # service bjs restart
      •   Check nodes are available
          # bjsstat

      Pool: default          Nodes (total/up/free): 5/2/2
      ID      User            Command               Requirements




127




                                    Using BJS
      •   bjssub
          • Submit a request to allocate nodes
          • ONLY runs the command on the front end

          • The command is responsible for executing on nodes
          -p specify node pool
          -n number of nodes to allocate
          -s run time of job (in seconds)
          -i run in interactive mode
          -b run in batch mode (default)
          -D set working directory
          -O redirect command output to file



128




                                                                   64
                                         Using BJS
      •   bjsstat
          •   Show status of node pools
               •   Name of pool
               •   Total number of nodes in pool
               •   Number of operational nodes in pool
               •   Number of free nodes in pool
          •   Lists status of jobs in each pool




129




                                         Using BJS
      •   bjsctl
          • Terminate a running job
          -r specify ID number of job to terminate




130




                                                         65
                               Interactive vs Batch
      •   Interactive jobs
           •   Schedule a node or set of nodes for use interactively
           •   bjssub will wait until nodes are available, then run the command
           •   Good during development
           •   Good for single run, short runtime jobs
           •   “Hands-on” interaction with nodes

           # bjssub -p default -n 2 -s 1000 -i bash
           Waiting for interactive job nodes.
           (nodes 0 1)
           Starting interactive job.
           NODES=0,1
           JOBID=59
           > bpsh $NODES date
           > exit
131




                               Interactive vs Batch
      •   Batch jobs
           •   Schedule a job to run as soon as requested nodes are available
           •   bjssub will queue the command until nodes are available
           •   Good for long running jobs that require little or no interaction

           # bjssub -p default -n 2 -s 1000 -O output -b date
           JOBID=59
           # cat date
           Wed Sep 15 15:48:51 MDT 2004




132




                                                                                  66
                                How It Works
      •   BJS allocates a set of nodes and changes their permissions
          appropriately
      •   Sets the environment variable NODES to contain the list of
          allocated nodes
      •   This is used by the command to access allocated nodes
      •   For example, say I allocate 2 nodes on the cluster:

           # bjssub -n 2 -s 50 ‘echo $NODES > stuff’
           # cat stuff
           0,1
           # bjssub -n 2 -s 50 ‘bpsh $NODES date’



133




                              BJS and Scripts
      •   Given that BJS provides NODES, scripts are easy to schedule!
      •   Good for scheduling tasks that are composed of subtasks, such as
          copying input data to nodes, running a job, and retrieving outputs

           # cat > simple.sh
           #!/bin/sh
           bpsh $NODES -O /tmp/stuff cat < stuff
           bpsh $NODES -O /tmp/count wc /tmp/stuff
           bpsh -p $NODES cat /tmp/count > the_output
           bpsh $NODES rm /tmp/stuff /tmp/count
           ^D
           # bjssub -n 2 -s 10 simple.sh
           # cat the_output

134




                                                                               67
                                   Module 7: MPI
                                  Presenter: Matt Sottile

      •   Objective
      •   Contents
           •   Overview of MPI
           •   How MPI Works with Clustermatic
           •   Setting Up
           •   Compiling an MPI Program
           •   Running an MPI Program
           •   Debugging an MPI Program
           •   Using MPI with BJS
           •   Staging Data




135




                                 Overview of MPI

      •   MPI (Message Passing Interface) is a standard API for parallel
          applications
      •   Provides portable approach for parallel programming on shared-
          memory systems and clusters
      •   Current widely available open-source implementations
           •   MPICH - Argonne
           •   LAM-MPI - Indiana University
           •   OpenMPI - Collaboration between Los Alamos, LAM, and others
      •   All three are available for Clustermatic
           •   Demonstrating MPICH today



136




                                                                             68
                   How MPI Works with Clustermatic

      •   Uses standard interface to start MPI program
            •   mpirun [options…] mpi-program
      •   Mpirun loads mpi-prog on master
      •   Mpi-prog process is migrated to each node
      •   Each mpi-prog exchanges connection information via mpirun on
          master and starts executing
                                                                mpi-prog       Node 0
                   mpirun
          Master
                            mpi-prog

                                                                mpi-prog       Node 1


137




                                       MPI and BJS
      •   As we showed in the BJS section, bjssub allocates nodes and
          sets the NODES environment variable
      •   mpirun looks at this to determine where to run the job
      •   The --np argument is ignored if NODES is set
            •   It uses the scheduled nodes if it detects it has been run by the scheduler
      •   Transparent to the user




138




                                                                                             69
                                   Setting Up

      •   Set path to include MPICH
           # export PATH=$PATH:/usr/mpich-p4/bin
      •   Copy sample programs
           # mount /mnt/cdrom
           • If not already mounted
           # cd
           # cp /mnt/cdrom/LCI/mpi/* .




139




                     Compiling an MPI Program
      •   Use mpicc to supply correct include files and libraries
           # mpicc -g -o hello hello.c
           • Normal gcc options

      •   Compile the other sample programs
           # mpicc -g -o mpi-ping mpi-ping.c
      •   Jacobi requires the math library to compile
           # mpicc -g -o jacobi jacobi.c -lm




140




                                                                    70
                          Running an MPI Program

      •   Run program on two nodes using P4 device
           # mpirun --np 2 --p4 ./hello
           • Must use ./hello since dot is not in path

      •   See what happens on one node
           # mpirun --np 1 --p4 ./hello




141




                           Seeing What’s Running

      •   Start the mpi-ping program running on both nodes
           # mpirun --np 2 --p4 ./mpi-ping 0 100000 10000
      •   Find process ID’s in another window
           # ps -mfx
           3273    ?            S         0:00 \_kdeinit: konsole
           3274    pts/1        S         0:00    \_ /bin/bash
           3339    pts/1        S         0:00        \_ mpirun --np 2 …
           3341    ?            RW        0:43            \_ [mpi-ping]
           3343    ?            SW        0:00            |   \_ [mpi-ping]
           3342    ?            SW        0:38            \_ [mpi-ping]
           3344    ?            SW        0:00                \_ [mpi-ping]
      •   MPICH starts 2 processes on each node
           •   Parent is actual running process
142




                                                                              71
                     Debugging an MPI Program

      •   Start gdb and attach to running process
           # gdb mpi-ping
           (gdb) attach 3341
      •   You are now attached to a remote process from the master!
      •   Print call stack
           (gdb) bt
      •   Select frame (choose frame in main)
           (gdb) frame 14
      •   Print contents of variable
           (gdb) print status



143




                 Using MPI with BJS (interactive)

      •   Use BJS to allocate 2 nodes for 10 minutes
           # bjssub -s 600 -n 2 -i bash
      •   BJS will start a new (bash) shell
      •   Check nodes have been allocated
           # bjsstat
           Pool: default         Nodes(total/up/free): 5/2/0
           ID     User            Command           Requirements
              0 R root            (interactive)     nodes=2,secs=600
      •   Now run the job on the allocated nodes
           # mpirun --np 2 --p4 ./mpi-ping 0 500 100
      •   When job is complete, exit shell
           # exit
144




                                                                       72
                        Using MPI with BJS (batch)

      •    Usually, long production runs should use batch mode
      •    Use BJS to allocate 2 nodes and invoke mpirun
          # bjssub -s 600 -n 2 mpirun --np 2 --p4 ./mpi-ping 0 1000 5
      •    Check nodes have been allocated
            # bjsstat
            Pool: default           Nodes(total/up/free): 5/2/0
            ID     User              Command           Requirements
               1 R root              mpirun --np 2     nodes=2,secs=600
      •    Remove job from queue
            # bjsctl -r 1




145




                     Programs That Work With Data

      •    Programs running on nodes can read and write data via NFS
            •   This can be very slow if large amounts of data are involved
            •   Can also be a problem if many processes are accessing NFS simultaneously
      •    Copy Jacobi input data to /home
            # cp jacobi.input /home
      •    Run the Jacobi program using data obtained via NFS
            # bjssub -n 2 -s 10 mpirun --np 2 --p4 ./jacobi /home
      •    Once run is complete, output files should be in /home
            # ls -l /home/output*




146




                                                                                           73
                         Staging Data To Nodes

      •   To overcome problems using NFS, data can be copied out to the
          nodes
           # bpcp jacobi.input 0:/tmp/jacobi.input
           # bpcp jacobi.input 1:/tmp/jacobi.input
      •   Now run the Jacobi program
           # bjssub -n 2 -s 10 mpirun --np 2 --p4 ./jacobi /tmp
      •   Copy staged output back to master
           # bpcp 0:/tmp/output_0.dat .
           # bpcp 1:/tmp/output_1.dat .




147




                  Staging Data To Nodes (cont…)

      •   In reality, this would be scripted as follows

           #!/bin/sh
           bpsh $NODES -O /tmp/jacobi.input cat < /tmp/jacobi.input
           mpirun --np 2 --p4 ./jacobi /tmp
           for i in `echo $NODES | tr ‘,’ ‘ ‘`
           do
             bpsh $i cat /tmp/output.dat > output_$i.dat
           done




148




                                                                          74
                                   Feedback
      •   Please complete a feedback form now and collect a free
          Clustermatic CD
      •   Either leave forms on table, or return by post (addess on form)
      •   Your feedback allows us to improve this tutorial



                  Thanks for attending the Clustermatic Tutorial!
                          We hope you found it valuable




149




                                                                            75

								
To top