“Using SAS on the RSPH Cluster” Vernard “V” Martin Information Services Dept. Rollins School of Public Health Outline 0. Who am I? 1. Intro to Clusters 2. Intro to Linux 3. Connecting to the Cluster 4. SAS on the cluster Working with Very Large Datasets Optimizing your code for speed Multiple programs simultaneously Debugging SAS Who is Vernard Martin? Lead Application Developer/Analyst a.k.a. Subject Matter Expert Specialize in supporting HPC environments 1 part system administrator 1 part application developer 1 part HPC researcher When should you talk to me? When a new grant is being written! When your data sets are larger than 2GB. When computations run longer than 2 hours. When you want to get access to the cluster. When you want software installed on the cluster. When you are cluster related problems. When you want to learn more about Linux. When you want to learn more about HPC. What the heck is a cluster? A cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. Why clusters? #1 Answer: CHEAP SPEED. 1) Makes our researchers more efficient. We aren‟t waiting for our answers so we get more done. 2) Changes scope of questions that we ask. Answers in weeks versus years. 3) Can do more with less resources. Every researcher has his own Supercomputer effectively. So what is the catch? TANSTAAFL! or “There ain‟t no such thing as a free lunch!” 1. Doesn‟t run Windows so there is a learning curve 2. To get much faster results you have to modify your code. 3. To get the fastest results you might have to write new code or consult a subject matter expert. Cluster Architecture Cluster is comprised of two types of computers Head nodes Provide the desktop environment for program development Provide access to interactive applications Provide a test bed for application development Provide access for short running jobs. Compute nodes Provide the batch environment Provides access for both short and long running jobs Currently one head node and six compute nodes. Cluster Hardware Head Node 8x Intel Xeon X5365 (two quadcore 3.0 GHz), 32Gb RAM Compute nodes node002 - 8x AMD Opteron 870 (two quadcore 2.0GHz), 16Gb RAM node003 - 8x AMD Opteron 870 (two quadcore 2.0GHz), 16Gb RAM node004 - 4x AMD Opteron 850 (four singlecore 2.4 GHz), 32Gb RAM node005 - 4x Intel Xeon E7330 (four quadcore 2.4Ghz), 96Gb RAM node006 - 2x AMD Opterm 2356 (two quadcore 2.2Ghz), 16Gb RAM node007 - 2x AMD Opterm 2356 (two quadcore 2.2Ghz), 16Gb RAM That is a total of 18 CPUs with 56 cores. Node005 is currently reserved by the researcher that purchased it. Plans for sharing it via checkpointing and priority queues at a later date. Intro to Linux Why Linux instead of Windows? Historically, the Windows environment is designed for interactive processing and not batch processing. Research in HPC is typically done on Linux so the technology transfer happens faster to Linux. Cheaper by a factor of 10 at the lower end and 1,000,000 at the high end. So what are the differences? Filesystem No concept of “named letter drives”. Uses a single hierarchical filesystem instead. Any part of the file system can be a separate drive but this is controlled by the system administrators. File names Avoid the following: spaces, apostrophes, symbols Everything is case sensitive. Text console is often more useful than the GUI Some tools can only be accessed via the text console. Much more flexible and more powerful than GUI. So what are the differences? (cont) Primarily a batch environment versus interactive. Some GUI tools are lacking. Do initial development on Windows and then migrate Can only use full power of cluster in batch mode. File permissions are different Antiquated compared to Windows Sharing space requires you to pay more attention. No second chances When you remove a file, it cannot be undeleted. Seriously. If it wasn‟t caught by backups, its GONE! I‟m not kidding. Be careful before you delete!!!!!! So what are the differences? (cont) YUCK! That is a LOT of things to remember! So what are the differences? (cont) YUCK! That is a LOT of things to remember! YES! But its well worth it. And almost all of the changes are good habits for a windows environment as well. So what are the differences? (cont) YUCK! That is a LOT of things to remember! YES! But its well worth it. And almost all of the changes are good habits for a windows environment as well. And the gains are incredible. So what are the differences? (cont) Gains: Schedule batch jobs for running later (no more having to watch computations) Conditional computation Based on previous results, you can perform others. parallel computation (leverage multiple cpus) distributed computation (leverage multiple machines) Parallel + distributed = Massively Large Computations So how do I get to the cluster? All Things Cluster http://www.sph.emory.edu/computing/cluster.php To request an account, the project P.I. sends email to help@sph with required info: User‟s real name, Users‟ SPH/Emory Login ID Department, Project Description If >500GB space is needed then project name and authorizing account for purchase of disk space. Accessing the cluster The cluster is accessed via one of two ways: A textual interface via an SSH client. SPH IT will install one at your request. Also, PuTTy is a freely available one for home use. GUI desktop via the Xming program. There should be a large black X-Icon on your desktop named CLUSTER. Double click it to open up a remote desktop session to the cluster. Remote Desktop Remote Desktop (cont.) Cluster Desktop (cont.) INSERT ACTUAL DESKTOP USAGE HERE Using the cluster: The Terminal Primary use the command line interface is called “the shell” Everything is a text command. Every character is important. Especially spaces, dashes, and capitalization. Open it as a GUI application called “Terminal” or sometimes “Console”. Or variations on that name. Eleven Essential Commands ls: List directory contents cd: Change directory pwd: print current directory cp: copy file mv: move a file or directory rm: delete a file mkdir: create directory rmdir: remove directory cat: print file contents more: scroll through a file exit: close the shell nano: edit text files Less: browse file contents grep: search through files Six other useful commands chmod: change permissions chgrp: change group perms history: see old commands tail: monitor files clear: clears the screen man: online manual Web Tutorials Beginner‟s Guide to Command Line Linux http://floppix.ccai.com/labs.html Into to the Linux command line http://www.tuxfiles.org/linuxhelp/cli.html The Linux Terminal – a Beginner‟s Bash http://linux.org.mt/article/terminal Lets get started! Move Data Files WinSCP (similar to WinFTP) Rename files BEFORE you move them Editing Files Geany Eclipse (eventually) Neither as good as the Windows R editor environment Can use vi, joe, emacs, pico Remember! Have to ditch the references to drive letters What to do? Always test the program interactively first! $ R Run program here Then test it via a simple batch method sas mysasprogram.sas Check output/.log file Once you are satisfied that its working Create wrapper job script by adding lines #!/bin/bash #$ -cwd <insert your line here> qsub mysasjob.sh Success! THAT‟S IT! Success! THAT‟S IT! But you need a bit more to keep track of things. Working with very large datasets What is large? > 100,000 observations Non-simulation running about an hour File sizes < 500MB What is Very Large? > 1,000,000 observations (aggregate) Non-simulation running > 4 hours File sizes > 1GB Examples: Original N's: records in original raw data file: 4,398,590 persons number of id values occurring 2+ times: 81,532 total # of records with duplicate id problem: 170,029 # of observations in SAS data set of persons after removing all those with duplicate id: 4,228,561 Files built for analysis, including intermediate files: N's: 1,408,233---252,172,992 observations File sizes: 75 mb---6.8 gb Job times: Real from 26 1/2 minutes to 3 1/2 hours, CPU from 21 1/2 minutes to 2 3/4 hours Working with VLD (cont.) Sorting Very Large Datasets with SAS Q. I'm trying to sort a very large SAS dataset (4.49G) and I'm getting the message that SAS is "out of resources". What can I do? A. Here are some helpful tips passed along to us from other users as well as from SAS: You can better diagnose where you're running into resource problems by setting a couple of system options: MSGLEVEL=i and FULLSTIMER. You need to determine whether you're running out of RAM or disk space. Its often the latter. With regard to memory -- if, for example, you have 750M RAM, try setting the MEMSIZE option to 700M and the SORTSIZE option to 650M. You need to leave enough room for the operating system & for SAS overhead. For the cluster, this means, setting MEMSIZE to 2000M and SORTSIZE based on your project space size (usually 4000M) About disk space -- A common cause of problems! If you are sorting a temporary (WORK) data set, you need to have room for 5 copies of the dataset in your WORK library! If you're sorting a permanent dataset (two level name), you need room for 1 copy in the source library, 1 in the destination, and 2 in WORK. Try sorting subsets of the data & recombining them -- If the file is very 'wide', split it into multiple files which all contain the sort variables but only contain some of the other variables; you can then do a MERGE/BY to recombine them. If the files are very 'long', try subsetting the file (1/2 the obs in one file, 1/2 in another, say), sort them separately, and then interleave them in a data step (the obvious) -- Try and avoid the sort altogether: index the file or re-think the job sequence. Working with VLD (cont.) I can't stress enough how much reducing the 'width' of your file by dropping unnecessary variables and setting LENGTHs properly can help. Some combination of the other methods should get you 'sorted out' Working with very large Datasets Make your first run with OPTIONS OBS=0; Repeat until no syntax errors This guarantees that you haven‟t made any simple mistakes Working with VLD (cont.) Use KEEP= data set option with SET, MERGE and Proc Sort E.g. KEEP= with SET Data women; Set in.indiv(keep=personid status emigdth mures98); Working with VLD (cont.) Minimize sorting in programs by storing permanent data sets in the sorted order that will be most useful in the future Beware that sorting a permanent data set without the OUT= option on Proc Sort will overwrite the data set and will destroy it completely in a run with system option OBS=0. Use the system optino NOREPLACE to avoid accidental replacement of a permanent data set. Working with VLD (cont.) Use LENGTH to set lengths in bytes for each variable Length personid mothrid fathrid 6; Length child1-child17 $ 4; Working with VLD (cont.) Use WHERE= data set option or the WHERE statement to select observations when possible data tmp; Set parntliv(where=(mother=„1‟ & „malive=„ „)); Instead of Data tmp; Set parntliv; If mother=„1‟ & malive=„ „; Working with VLD (cont.) Use Proc Datasets to delete, during job execution, temporary data sets no longer needed Proc datasets library=work; delete tmp tmp2; run; Reuse temporary data set names Data tmp2; Set tmp; Run; …… Proc data=tmp2; run Working with VLD (cont.) Don‟t write unnecessary data steps that create unnecessary temporary data sets This is BAD: /*first make a copy of in.indiv and then merge with inw.womfixed*/ data tmp; set in.indiv(keep=personid dob muborn wherborn county); run; data women; merge inw.womfixed(in=a) tmp; by personid; if a=1; run; Working with VLD(cont.) Much better: /*merge the permanent data sets directly*/ data women; merge inw.womfixed(in=a) in.indiv(keep=personid dob muborn wherborn county); by personid; if a=1; run; Working with VLD (cont.) Create random samples of your data to test/develop your programs data out.smple5pct; set in.indiv; if ranuni(2106) < .05 then output; /* 5% random sample of in.indiv */ run; What is Sun Grid Engine? Provides: Job priority policies and configurations Job advanced reservation Quality of Service support Node allocation policies Resource utilization tracking and statistics Backfill policies Workload analyzing And it does this with a few commands. Sun Grid Engine (cont.) At a fundamental level, Sun GridEngine (SGE) is very easy to use. The following sections will describe the commands you need to submit simple jobs to run on the cluster. The command that will be most useful to you are as follows: qsub - submit a job to the batch scheduler qstat - examine the status of jobs and queues qhost - examine the status of hosts in the cluster qdel - delete a job from the queue SGE (cont.) Qstat Qstat usually only shows you info on your jobs. If you have nothing running, it shows nothing. Use „qstat –f‟ to give detailed info on your jobs Use „qstat -u "*" „ to show *all* jobs including yours Questions and Answers You can contact me directly: Email: firstname.lastname@example.org (PREFERRED) I read email VERY regularly Mon-Fri Rarely read it on the weekend. Office: Grace Crum Rollins Suite 120 Need to ask for me at the front desk. Phone: Desk Phone: 404-727-2076 Warning: I‟m always at my desk. Cell Phone: 404-313-0282 Warning: I may not answer if I‟m busy. If its critical then call back immediately. Leave a message if its important! No guarantees that I‟ll respond to your satisfaction.
Pages to are hidden for
"Slide 1 Rollins School of Public Health"Please download to view full document