Introduction to Linux Clusters

Reviews
Shared by: gregoria
Stats
views:
192
rating:
not rated
reviews:
0
posted:
11/21/2008
language:
pages:
0
Introduction to Linux Clusters Clarence K. Din SAS Computing University of Pennsylvania March 15, 2004 Cluster Components Hardware  Nodes  Disk array  Networking gear  Backup device  Admin front end  UPS  Rack units Software  Operating system  MPI  Compilers  Scheduler Cluster Components Hardware  Nodes     Compute nodes Admin node I/O node Login node Software  Operating system  Compilers  Scheduler  MPI     Disk array Networking gear Backup device Admin front end Cluster Components Hardware  Disk array  RAID5  SCSI 320  10k+ RPM, TB+ capacity  NFS-mounted from I/O node Software  Operating system  Compilers  Scheduler  MPI  Networking gear  Backup device  Admin front end Cluster Components Hardware  Networking gear     Myrinet, gigE, 10/100 Switches Cables Networking cards Software  Operating system  Compilers  Scheduler  MPI     Backup device Admin front end UPS Rack units Cluster Components Hardware  Backup device  AIT3, DLT, LTO  N-slot cartridge drive  SAN  Admin front end  UPS  Rack units Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Admin front end  Console (keyboard, monitor, mouse)  KVM switches  KVM cables  UPS  Rack units Cluster Components Hardware  UPS  APC SmartUPS 3000  3 per 42U rack  Rack units Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Rack units  42U, standard or deep Software  Operating system  Compilers  Scheduler  MPI Cluster Components Software  Operating system      Red Hat 9+ Linux Debian Linux SUSE Linux Mandrake Linux FreeBSD and others  MPI  Compilers  Scheduler Cluster Components Software  MPI     MPICH LAM/MPI MPI-GM MPI Pro  Compilers  Scheduler Cluster Components Software  Compilers  gnu  Portland Group  Intel  Scheduler Cluster Components Software  Scheduler  OpenPBS  PBS Pro  Maui Filesystem Requirements  Journalled filesystem  Reboots happen more quickly after a crash  Slight performance hit for this feature  ext3 is a popular choice (old ext2 was not journalled) Space and Power Requirements  Space  Standard 42U rack is about 24”W x 80”H x 40”D  Blade units give you more than 1 node per 1U space in a deeper rack  Cable management inside the rack  Consider overhead or raised floor cabling for the external cables  Power  67 node Xeon cluster consumes 19,872W = 5.65 tons of A/C to keep it cool  Ideally, each UPS plug should connect to its own circuit  Clusters (especially blades) run real hot; make sure there is adequate A/C and ventilation Network Requirements  External Network  One 10mbps network line is adequate (all computation and message passing is within the cluster)  Internal Network  gigE  Myrinet  Some combo  Base your net gear selection on whether most of your jobs are CPUbound or I/O bound Network Choices Compared  Fast Ethernet (100BT)  0.1 Gb/s (or 100 Mb/s) bandwidth  Essentially free  gigE  0.4 Gb/s to 0.64 Gb/s bandwidth  ~$400 per node Networking Gear Speeds 2500 2000  Myrinet     1.2 Gb/s to 2.0 Gb/s bandwidth ~$1000 per node Scales to thousands of nodes Buy fiber instead of copper cables 1500 1000 500 0 Fast Ethernet gigE Myrinet I/O Node  Globally accessible filesystem (RAID5 disk array)  Backup device I/O Node  Globally accessible filesystem (RAID5 disk array)  NFS share it  Put user home directories, apps, and scratch space directories on it so all compute nodes can access them  Enforce quotas on home directories  Backup device I/O Node  Globally accessible filesystem (RAID5 disk array)  Backup device  Make sure your device and software is compatible with your operating system  Plan a good backup strategy  Test the ETA of bringing back a single file or a filesystem from backups Admin Node  Only sysadmins log into this node  Runs cluster management software Admin Node  Only sysadmins log into this node  Accessible only from within the cluster  Runs cluster management software Admin Node  Only admins log into this node  Runs cluster management software  User and quota management  Node management  Rebuild dead nodes  Monitor CPU utilization and network traffic Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Don’t forget that some software companies license their software per node, so factor in software costs  Stick with a proven technology over future promise  Memory size of each node depends on the application mix. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  2 GB + for for large calculations  < 2 GB for financial databases  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.  Disks are cheap nowadays... 40GB EIDE is standard per node Compute Nodes  Choose a CPU architecture you’re comfortable with  Intel: P4, Xeon, Itanium  AMD: Opteron, Athlon  Other: G4/G5  Consider that some algorithms require 2n nodes  32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per node Login Node  Users login here  Only way to get into the cluster  Compile code  Job control Login Node  Users login here  ssh or ssh -X  Cluster designers recommend 1 login node per 64 compute nodes  Update /etc/profile.d so all users get the same environment when they log in  Only way to get into the cluster  Compile code  Job control Login Node  Users login here  Only way to get into the cluster  Static IP address (vs. DHCP addresses on all other cluster nodes)  Turn on built-in firewall software  Compile code  Job control Login Node  Users login here  Only way to get into the cluster  Compile code  Licenses should be purchased for this node only  Don’t pay for more than you need  2 licenses might be sufficient for code compilation for a department  Job control Login Node  Users login here  Only way to get into the cluster  Compile code  Job control (using a scheduler)  Choice of queues to access subset of resources  Submit, delete, terminate jobs  Check on job status Spare Nodes  Offline nodes that are put into service when an existing node dies  Use for spare parts  Use for testing environment Cluster Install Software  Designed to make cluster installation easier (“cluster in a box” concept)  Decreases ETA of the install process using automated steps  Decreases chance of user error  Choices:  OSCAR  Felix  IBM XCAT  IBM CSM Cluster Management Software  Run parallel commands via GUI  Or write Perl scripts for command-line control  Install new nodes, rebuild corrupted nodes  Check on status of hardware (nodes, network connections)  Ganglia  xpbsmon  Myrinet tests (gm_board_info) Cluster Management Software  xpbsmon shows jobs running that were submitted via the scheduler Cluster Consistency  Rsync or rdist /etc/password, shadow, gshadow, and group files from login node to compute nodes  Also consider (auto or manually) rsync’ing /etc/profile.d files, pbs config files, /etc/fstab, etc. Local and Remote Management  Local management  GUI desktop from console monitor  KVM switches to access each node  Remote management  Console switch  ssh in and see what’s on the console monitor screen from your remote desktop  Web-based tools  Ganglia ganglia.sourceforge.net  Netsaint www.netsaint.org  Big Brother www.bb4.com Ganglia  Tool for monitoring clusters of up to 2000 nodes  Used on over 500 clusters worldwide  For multiple OS’s and CPU architectures # ssh -X coffee.chem.upenn.edu # ssh coffeeadmin # mozilla & Open http://coffeeadmin/ganglia Periodically auto-refreshes web page Ganglia Ganglia Ganglia Scheduling Software (PBS)  Set up queues for different groups of users based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node)  The world does not end if one node goes down; the scheduler will run the job on another node  Make sure pbs_server and pbs_sched is running on login node  Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodes Scheduling Software  OpenPBS  PBS Pro  Others Scheduling Software  OpenPBS  Limit users by number of jobs  Good support via messageboards  *** FREE ***  PBS Pro  Others Scheduling Software  OpenPBS  PBS Pro  The “pro” version of OpenPBS  Limit by nodes, not just jobs per user  Must pay for support ($25 per CPU, or $3200 for a 128 CPU cluster)  Others Scheduling Software  OpenPBS  PBS Pro  Others  Load Share Facility  Codeine  Maui MPI Software  MPICH (Argonne National Labs)  LAM/MPI (OSC/Univ. of Notre Dame)  MPI-GM (Myricom)  MPI Pro (MSTi Software)  Programmed by one of the original developers of MPICH  Claims to be 20% faster than MPICH  Costs $1200 plus support per year Compilers and Libraries  Compilers  gcc/g77  Portland Group  Intel www.gnu.org/software www.pgroup.com www.developer.intel.com  Libraries       BLAS ATLAS - portable BLAS www.math-atlas.sourceforge.net LAPACK SCALAPACK - MPI-based LAPACK FFTW - Fast Fourier Transform www.fftw.org many, many more Cluster Security  Securing/patching your Linux cluster is much like securing/patching your Linux desktop  Keep an eye out for the latest patches  Install a patch only if necessary and do it on a test machine first  Make sure there’s a way to back out of a patch before installing it Cluster Security  Get rid of unneeded software  Limit who installs and what gets installed  Close unused ports and services  Limit login service to ssh between login node and outside world  Use ssh to tunnel X connections safely  Limit access using hosts.allow/deny  Use scp and sftp for secure file transfer Cluster Security  Carefully configure NFS  Upgrade to the latest, safest Samba version, if used  Disable Apache if not needed  Turn on built-in Linux firewall software Troubleshooting  Make sure the core cluster services are running  Scheduler, MPI, NFS, cluster managers  Make sure software licenses are up-todate  Scan logs for break-in attempts  Keep a written journal of all patches installs and upgrades Troubleshooting  Sometimes a reboot will fix the problem  If you reboot the login node where the scheduler is running, be sure the scheduler is started after the reboot  Any jobs in the queues will be flushed  Hard-rebooting hardware, such as tape drives, usually fixes the problem Troubleshooting  Reboot order: I/O node, login node, admin node, compute nodes (i.e. master nodes first, then slave nodes)  Rebuilding a node takes 30 minutes with the cluster manager; reconfiguring it may take an hour more Vendor Choices         Dell IBM Western Scientific Aspen Systems Racksaver eRacks Penguin Computing Many, many others  Go with a proven vendor  Get every vendor to spec out the same hardware and software before you compare prices  Compare service agreements  How fast can they deliver a working cluster? Buying Commercial Software  Is it worth the money?  Is it proven software?  Are all the bells and whistles really necessary?  Paid software does not necessarily have the best support Cluster Tips  Keep all sysadmin scripts in an easily accessible place  /4sysadmin  /usr/local/4sysadmin Cluster Tips  Force everyone to use the scheduler to run their jobs (even uniprocessor jobs)  Police it  Don’t let users get away with things  Wrapping some applications into a scheduler script can be tricky Cluster Upgrades  Nodes become obsolete in 2 to 3 years  Upgrade banks of nodes at a time  If upgrading to a new CPU, check for compatibility problems and new A/C requirements  Upgrading memory and disk space is easy but tedious Cluster Upgrades  Upgrading the OS can be a major task  Even installing patches can be a major task Common Sense Cluster Administration  Plan a little before you do anything  Keep a journal of everything you do  Create procedures that are easy to follow in times of stress  Document everything! Common Sense Cluster Administration  Test software before announcing it  Educate and “radiate” your cluster knowledge to your support team coffee.chem  6 P.I.’s in Chemistry funded it  Located in FBA121 next to A/C3  69 dual-CPU node cluster       64 compute nodes 1 login node 1 admin node 1 I/O node 1 backup node 1 firewall node coffee.chem  Myrinet on 32 compute nodes, gigE on other 32  2 TB RAID5 array (1.7 TB formatted)  12-slot, 4.8 TB capacity LTO tape drive  2U fold-out console with LCD monitor, keyboard, trackpad coffee.chem  5 KVM daisy chained switches  9 APC 3000 UPS units each connected to their own circuit  3 42U racks coffee.chem  Red Hat 9  Felix cluster install and management      software PBS Pro MPICH, LAM/MPI, MPICH-GM gnu and Portland Group compilers BLAS, SCALAPACK, ATLAS libraries Gaussian98 (Gaussian03 + Linda soon) coffee.chem  /data on I/O node (coffeecompute00) holds common apps and user home directories  Admin node (coffeeadmin) runs Felix cluster manager  Compute nodes (coffeecompute01..64)  Every node in the cluster can access /data via NFS coffee.chem  Can ssh into compute nodes, admin, and I/O node only via login node  Backup node (javabean) temporarily has our backup device attached (we use tar right now) Logging Into coffee.chem  Everyone in this room will have user accounts on coffee.chem and home directories in /data/staff  Our existence on the system is for Chemistry’s benefit  Support scripts are found in /4sysadmin  If a reboot is necessary, make sure that PBS is started (/etc/init.d/pbs start) Compiling and Running Code  pgCC -Mmpi -o test hello.cpp  mpirun -np 8 test Compiling Code  pgCC -Mmpi -o test hello.cpp  MPICH includes mpicc and mpif77 to compile and link MPI programs  Scripts that pass the MPI library arguments to cc and f77 Running Code  mpirun -np XXX -machinefile YYY -nolocal test  -np = number of processors  -machinefile = filename with list of processors you want to run job on  -nolocal = don’t run the job locally Submitting a Job  3 queues to choose from  Coffeeq  general purpose queue  12 hours max run time  16 processors max  Espressoq  Higher priority than coffeeq  3 weeks max run time  Some may still use piq, but this will go away soon Submitting a Job  Prepare a scheduler script           #!/bin/tcsh #PBS -l arch=linux {define architecture} #PBS -l cput=1:00:00 {define CPU time needed} #PBS -l mem=400mb {define memory space needed} #PBS -l nodes=64:ppn=1 {define number of nodes needed} #PBS -m e {mail me the results} #PBS -c c {minimal checkpointing} #PBS -k oe {keep the output and errors} #PBS -q coffeeq {run the job on coffeeq} mpirun -np 8 -machinefile machines_gige_32.LINUX /data/staff/din/newhello  qsub the scheduler script More PBS Commands  Check on the status of all submitted jobs     with: qstat Submit a job with: qsub Delete a job with: qdel Terminate the execution of a job with: qterm See all your available compute node resources with: pbsnodes -a Node Terms  Login node = Service node = Head node = the node users log into  Master scheduler node = node where scheduler runs, usually login node  Admin node = the node the sysadmin logs into to gain access to cluster  management apps Compute node = one or more nodes that perform pieces of a larger computation Storage node = the node that has the RAID array or SAN attached to it Backup node = the node that has the backup solution attached to it I/O node = can combine features of storage and backup nodes Visualization node = the node that contains a graphics card and graphics console; multiple visualization nodes can be combined in a matrix to form a video wall Spare node = nodes that are not in service, but can be rebuilt to take the place of a compute node or, in some cases, an admin or login node      References  Bookman, Charles. Linux Clustering: Building and Maintaining      Linux Clusters. New Riders, Indianapolis, Indiana, 2003. Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User & Developer, Issue 33. pp 33-36. Robertson, Alan. "Highly-Affordable High Availability" in Linux Magazine, November 2003. pp 16-21. The Seventh LCI Workshop Systems Track Notes. Linux Clusters Institute, March 24-28, 2003. Sterling, Thomas et al. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. The MIT Press, Cambridge, Massachusetts, 1999. Vrenios, Alex. Linux Cluster Architecture. Sams Publishing, Indianapolis, Indiana, 2002. coffee.chem Contact List  Dell hardware problems 800-234-1490  Myrinet problems help@myri.com  “Very limited” software support dellsup@mpi-softtech.com  PGI Compiler issues help@pgi.com Introduction to Linux Clusters Clarence K. Din SAS Computing University of Pennsylvania March 15, 2004

Related docs
linux
Views: 109  |  Downloads: 11
object2oriented clusters
Views: 0  |  Downloads: 0
Linux Introduction History
Views: 424  |  Downloads: 78
Advantages of Linux
Views: 567  |  Downloads: 69
A Global Operating System for HPC Clusters
Views: 0  |  Downloads: 0
Linux Virtual Server Tutorial
Views: 249  |  Downloads: 16
linux
Views: 85  |  Downloads: 9
premium docs
Other docs by gregoria
Derdiarian Watson
Views: 276  |  Downloads: 2
fw4
Views: 132  |  Downloads: 0
German Glossary of Toponymic Terminology
Views: 463  |  Downloads: 4
Revell v Lidov
Views: 663  |  Downloads: 7
Microbiology Gelatinase Test Results
Views: 2891  |  Downloads: 20
Lord Most High
Views: 327  |  Downloads: 2
Healer of My Soul
Views: 267  |  Downloads: 0
Engineering Principles for IT Security
Views: 414  |  Downloads: 20
dv110k
Views: 107  |  Downloads: 0
All Hail the Power of Jesus Name
Views: 260  |  Downloads: 3
Mortgage Accounting Spread Sheet
Views: 376  |  Downloads: 29
He Has Made Me Glad
Views: 413  |  Downloads: 4
Here I Am To Worship
Views: 563  |  Downloads: 9
Cohen Pop's Goodman
Views: 185  |  Downloads: 1
Hannah s evidence outline
Views: 298  |  Downloads: 10