Introduction to Linux Clusters
Clarence K. Din SAS Computing University of Pennsylvania March 15, 2004
Cluster Components
Hardware Nodes Disk array Networking gear Backup device Admin front end UPS Rack units Software Operating system MPI Compilers Scheduler
Cluster Components
Hardware Nodes
Compute nodes Admin node I/O node Login node
Software Operating system Compilers Scheduler MPI
Disk array Networking gear Backup device Admin front end
Cluster Components
Hardware Disk array
RAID5 SCSI 320 10k+ RPM, TB+ capacity NFS-mounted from I/O node
Software Operating system Compilers Scheduler MPI
Networking gear Backup device Admin front end
Cluster Components
Hardware Networking gear
Myrinet, gigE, 10/100 Switches Cables Networking cards
Software Operating system Compilers Scheduler MPI
Backup device Admin front end UPS Rack units
Cluster Components
Hardware Backup device
AIT3, DLT, LTO N-slot cartridge drive SAN
Admin front end UPS Rack units
Software Operating system Compilers Scheduler MPI
Cluster Components
Hardware Admin front end
Console (keyboard, monitor, mouse) KVM switches KVM cables
UPS Rack units
Cluster Components
Hardware UPS
APC SmartUPS 3000 3 per 42U rack
Rack units
Software Operating system Compilers Scheduler MPI
Cluster Components
Hardware Rack units
42U, standard or deep
Software Operating system Compilers Scheduler MPI
Cluster Components
Software Operating system
Red Hat 9+ Linux Debian Linux SUSE Linux Mandrake Linux FreeBSD and others
MPI Compilers Scheduler
Cluster Components
Software MPI
MPICH LAM/MPI MPI-GM MPI Pro
Compilers Scheduler
Cluster Components
Software Compilers
gnu Portland Group Intel
Scheduler
Cluster Components
Software Scheduler
OpenPBS PBS Pro Maui
Filesystem Requirements
Journalled filesystem
Reboots happen more quickly after a crash Slight performance hit for this feature ext3 is a popular choice (old ext2 was not journalled)
Space and Power Requirements
Space
Standard 42U rack is about 24”W x 80”H x 40”D Blade units give you more than 1 node per 1U space in a deeper rack Cable management inside the rack Consider overhead or raised floor cabling for the external cables
Power
67 node Xeon cluster consumes 19,872W = 5.65 tons of A/C to keep it cool Ideally, each UPS plug should connect to its own circuit Clusters (especially blades) run real hot; make sure there is adequate A/C and ventilation
Network Requirements
External Network
One 10mbps network line is adequate (all computation and message passing is within the cluster)
Internal Network
gigE Myrinet Some combo
Base your net gear
selection on whether most of your jobs are CPUbound or I/O bound
Network Choices Compared
Fast Ethernet (100BT)
0.1 Gb/s (or 100 Mb/s) bandwidth Essentially free
gigE
0.4 Gb/s to 0.64 Gb/s bandwidth ~$400 per node
Networking Gear Speeds
2500
2000
Myrinet
1.2 Gb/s to 2.0 Gb/s bandwidth ~$1000 per node Scales to thousands of nodes Buy fiber instead of copper cables
1500
1000
500
0 Fast Ethernet gigE Myrinet
I/O Node
Globally accessible filesystem (RAID5 disk
array) Backup device
I/O Node
Globally accessible filesystem (RAID5 disk
array)
NFS share it Put user home directories, apps, and scratch space directories on it so all compute nodes can access them Enforce quotas on home directories
Backup device
I/O Node
Globally accessible filesystem (RAID5 disk
array) Backup device
Make sure your device and software is compatible with your operating system Plan a good backup strategy Test the ETA of bringing back a single file or a filesystem from backups
Admin Node
Only sysadmins log into this node Runs cluster management software
Admin Node
Only sysadmins log into this node
Accessible only from within the cluster
Runs cluster management software
Admin Node
Only admins log into this node Runs cluster management software
User and quota management Node management Rebuild dead nodes Monitor CPU utilization and network traffic
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford. Memory size of each node depends on the application mix. Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford.
Don’t forget that some software companies license their software per node, so factor in software costs Stick with a proven technology over future promise
Memory size of each node depends on the
application mix.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford. Memory size of each node depends on the application mix.
2 GB + for for large calculations < 2 GB for financial databases
Lots of hard disk space is not so much a
priority since the nodes will primarily use shared space on the I/O node.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford. Memory size of each node depends on the application mix. Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.
Disks are cheap nowadays... 40GB EIDE is standard per node
Compute Nodes
Choose a CPU architecture you’re
comfortable with
Intel: P4, Xeon, Itanium AMD: Opteron, Athlon Other: G4/G5
Consider that some algorithms require 2n
nodes 32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per node
Login Node
Users login here Only way to get into the cluster Compile code Job control
Login Node
Users login here
ssh or ssh -X Cluster designers recommend 1 login node per 64 compute nodes Update /etc/profile.d so all users get the same environment when they log in
Only way to get into the cluster Compile code Job control
Login Node
Users login here Only way to get into the cluster
Static IP address (vs. DHCP addresses on all other cluster nodes) Turn on built-in firewall software
Compile code Job control
Login Node
Users login here Only way to get into the cluster Compile code
Licenses should be purchased for this node only Don’t pay for more than you need 2 licenses might be sufficient for code compilation
for a department
Job control
Login Node
Users login here Only way to get into the cluster Compile code Job control (using a scheduler)
Choice of queues to access subset of resources Submit, delete, terminate jobs Check on job status
Spare Nodes
Offline nodes that are put into service
when an existing node dies Use for spare parts Use for testing environment
Cluster Install Software
Designed to make cluster installation
easier (“cluster in a box” concept) Decreases ETA of the install process using automated steps Decreases chance of user error Choices: OSCAR Felix IBM XCAT IBM CSM
Cluster Management Software
Run parallel commands via GUI
Or write Perl scripts for command-line control
Install new nodes, rebuild corrupted nodes Check on status of hardware (nodes,
network connections)
Ganglia xpbsmon Myrinet tests (gm_board_info)
Cluster Management Software
xpbsmon shows jobs running that were submitted via the scheduler
Cluster Consistency
Rsync or rdist
/etc/password, shadow, gshadow, and group files from login node to compute nodes Also consider (auto or manually) rsync’ing /etc/profile.d files, pbs config files, /etc/fstab, etc.
Local and Remote Management
Local management
GUI desktop from console monitor KVM switches to access each node
Remote management
Console switch ssh in and see what’s on the console monitor screen
from your remote desktop
Web-based tools Ganglia ganglia.sourceforge.net Netsaint www.netsaint.org Big Brother www.bb4.com
Ganglia
Tool for monitoring clusters of up to 2000
nodes Used on over 500 clusters worldwide For multiple OS’s and CPU architectures
# ssh -X coffee.chem.upenn.edu # ssh coffeeadmin # mozilla & Open http://coffeeadmin/ganglia Periodically auto-refreshes web page
Ganglia
Ganglia
Ganglia
Scheduling Software (PBS)
Set up queues for different groups of users
based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node) The world does not end if one node goes down; the scheduler will run the job on another node Make sure pbs_server and pbs_sched is running on login node Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodes
Scheduling Software
OpenPBS PBS Pro Others
Scheduling Software
OpenPBS
Limit users by number of jobs Good support via messageboards *** FREE ***
PBS Pro Others
Scheduling Software
OpenPBS PBS Pro
The “pro” version of OpenPBS Limit by nodes, not just jobs per user Must pay for support ($25 per CPU, or $3200 for a 128 CPU cluster)
Others
Scheduling Software
OpenPBS PBS Pro Others
Load Share Facility Codeine Maui
MPI Software
MPICH (Argonne National Labs) LAM/MPI (OSC/Univ. of Notre Dame) MPI-GM (Myricom) MPI Pro (MSTi Software)
Programmed by one of the original developers of MPICH Claims to be 20% faster than MPICH Costs $1200 plus support per year
Compilers and Libraries
Compilers
gcc/g77 Portland Group Intel
www.gnu.org/software www.pgroup.com www.developer.intel.com
Libraries
BLAS ATLAS - portable BLAS www.math-atlas.sourceforge.net LAPACK SCALAPACK - MPI-based LAPACK FFTW - Fast Fourier Transform www.fftw.org many, many more
Cluster Security
Securing/patching your Linux cluster is
much like securing/patching your Linux desktop Keep an eye out for the latest patches Install a patch only if necessary and do it on a test machine first Make sure there’s a way to back out of a patch before installing it
Cluster Security
Get rid of unneeded software Limit who installs and what gets installed Close unused ports and services Limit login service to ssh between login
node and outside world Use ssh to tunnel X connections safely Limit access using hosts.allow/deny Use scp and sftp for secure file transfer
Cluster Security
Carefully configure NFS Upgrade to the latest, safest Samba
version, if used Disable Apache if not needed Turn on built-in Linux firewall software
Troubleshooting
Make sure the core cluster services are
running
Scheduler, MPI, NFS, cluster managers
Make sure software licenses are up-todate Scan logs for break-in attempts Keep a written journal of all patches installs and upgrades
Troubleshooting
Sometimes a reboot will fix the problem
If you reboot the login node where the scheduler is running, be sure the scheduler is started after the reboot Any jobs in the queues will be flushed Hard-rebooting hardware, such as tape drives, usually fixes the problem
Troubleshooting
Reboot order: I/O node, login node, admin
node, compute nodes (i.e. master nodes first, then slave nodes) Rebuilding a node takes 30 minutes with the cluster manager; reconfiguring it may take an hour more
Vendor Choices
Dell IBM Western Scientific Aspen Systems Racksaver eRacks Penguin Computing Many, many others
Go with a proven vendor Get every vendor to spec
out the same hardware and software before you compare prices Compare service agreements How fast can they deliver a working cluster?
Buying Commercial Software
Is it worth the money? Is it proven software? Are all the bells and whistles really necessary? Paid software does not necessarily have the best support
Cluster Tips
Keep all sysadmin
scripts in an easily accessible place
/4sysadmin /usr/local/4sysadmin
Cluster Tips
Force everyone to
use the scheduler to run their jobs (even uniprocessor jobs)
Police it Don’t let users get away with things Wrapping some applications into a scheduler script can be tricky
Cluster Upgrades
Nodes become obsolete in 2 to 3 years Upgrade banks of nodes at a time If upgrading to a new CPU, check for
compatibility problems and new A/C requirements Upgrading memory and disk space is easy but tedious
Cluster Upgrades
Upgrading the OS can be a major task Even installing patches can be a major
task
Common Sense Cluster Administration
Plan a little before you do anything Keep a journal of everything you do Create procedures that are easy to follow
in times of stress Document everything!
Common Sense Cluster Administration
Test software before announcing it Educate and “radiate” your cluster
knowledge to your support team
coffee.chem
6 P.I.’s in Chemistry funded it Located in FBA121 next to A/C3 69 dual-CPU node cluster
64 compute nodes 1 login node 1 admin node 1 I/O node 1 backup node 1 firewall node
coffee.chem
Myrinet on 32 compute nodes, gigE on
other 32 2 TB RAID5 array (1.7 TB formatted)
12-slot, 4.8 TB capacity LTO tape
drive
2U fold-out console with LCD monitor,
keyboard, trackpad
coffee.chem
5 KVM daisy chained switches 9 APC 3000 UPS units each connected to
their own circuit 3 42U racks
coffee.chem
Red Hat 9 Felix cluster install and management
software PBS Pro MPICH, LAM/MPI, MPICH-GM gnu and Portland Group compilers BLAS, SCALAPACK, ATLAS libraries Gaussian98 (Gaussian03 + Linda soon)
coffee.chem
/data on I/O node (coffeecompute00)
holds common apps and user home directories Admin node (coffeeadmin) runs Felix cluster manager Compute nodes (coffeecompute01..64) Every node in the cluster can access /data via NFS
coffee.chem
Can ssh into compute nodes, admin, and
I/O node only via login node Backup node (javabean) temporarily has our backup device attached (we use tar right now)
Logging Into coffee.chem
Everyone in this room will have user
accounts on coffee.chem and home directories in /data/staff Our existence on the system is for Chemistry’s benefit Support scripts are found in /4sysadmin If a reboot is necessary, make sure that PBS is started (/etc/init.d/pbs start)
Compiling and Running Code
pgCC -Mmpi -o test hello.cpp mpirun -np 8 test
Compiling Code
pgCC -Mmpi -o test hello.cpp MPICH includes mpicc and mpif77 to
compile and link MPI programs
Scripts that pass the MPI library arguments to cc and f77
Running Code
mpirun -np XXX -machinefile YYY -nolocal
test
-np = number of processors -machinefile = filename with list of processors you want to run job on -nolocal = don’t run the job locally
Submitting a Job
3 queues to choose from
Coffeeq general purpose queue 12 hours max run time 16 processors max Espressoq Higher priority than coffeeq 3 weeks max run time Some may still use piq, but this will go away soon
Submitting a Job
Prepare a scheduler script
#!/bin/tcsh #PBS -l arch=linux {define architecture} #PBS -l cput=1:00:00 {define CPU time needed} #PBS -l mem=400mb {define memory space needed} #PBS -l nodes=64:ppn=1 {define number of nodes needed} #PBS -m e {mail me the results} #PBS -c c {minimal checkpointing} #PBS -k oe {keep the output and errors} #PBS -q coffeeq {run the job on coffeeq} mpirun -np 8 -machinefile machines_gige_32.LINUX /data/staff/din/newhello
qsub the scheduler script
More PBS Commands
Check on the status of all submitted jobs
with: qstat Submit a job with: qsub Delete a job with: qdel Terminate the execution of a job with: qterm See all your available compute node resources with: pbsnodes -a
Node Terms
Login node = Service node = Head node = the node users log into Master scheduler node = node where scheduler runs, usually login node Admin node = the node the sysadmin logs into to gain access to cluster
management apps Compute node = one or more nodes that perform pieces of a larger computation Storage node = the node that has the RAID array or SAN attached to it Backup node = the node that has the backup solution attached to it I/O node = can combine features of storage and backup nodes Visualization node = the node that contains a graphics card and graphics console; multiple visualization nodes can be combined in a matrix to form a video wall Spare node = nodes that are not in service, but can be rebuilt to take the place of a compute node or, in some cases, an admin or login node
References
Bookman, Charles. Linux Clustering: Building and Maintaining
Linux Clusters. New Riders, Indianapolis, Indiana, 2003. Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User & Developer, Issue 33. pp 33-36. Robertson, Alan. "Highly-Affordable High Availability" in Linux Magazine, November 2003. pp 16-21. The Seventh LCI Workshop Systems Track Notes. Linux Clusters Institute, March 24-28, 2003. Sterling, Thomas et al. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. The MIT Press, Cambridge, Massachusetts, 1999. Vrenios, Alex. Linux Cluster Architecture. Sams Publishing, Indianapolis, Indiana, 2002.
coffee.chem Contact List
Dell hardware problems 800-234-1490 Myrinet problems help@myri.com “Very limited” software support dellsup@mpi-softtech.com PGI Compiler issues help@pgi.com
Introduction to Linux Clusters
Clarence K. Din SAS Computing University of Pennsylvania March 15, 2004