Slide 1 Rollins School of Public Health by pengxiang

VIEWS: 5 PAGES: 50

									“Using SAS on the RSPH Cluster”

Vernard “V” Martin
Information Services Dept.
Rollins School of Public Health
Outline

0. Who am I?
1. Intro to Clusters
2. Intro to Linux
3. Connecting to the Cluster
4. SAS on the cluster
    Working with Very Large Datasets
    Optimizing your code for speed
    Multiple programs simultaneously
    Debugging SAS
Who is Vernard Martin?

   Lead Application Developer/Analyst
       a.k.a. Subject Matter Expert
       Specialize in supporting HPC environments
            1 part system administrator
            1 part application developer
            1 part HPC researcher
When should you talk to me?

   When a new grant is being written!
   When your data sets are larger than 2GB.
   When computations run longer than 2 hours.
   When you want to get access to the cluster.
   When you want software installed on the cluster.
   When you are cluster related problems.
   When you want to learn more about Linux.
   When you want to learn more about HPC.
What the heck is a cluster?



A cluster is a group of loosely coupled computers
that work together closely so that in many respects
they can be viewed as though they are a single
computer.
Why clusters?

  #1 Answer: CHEAP SPEED.

  1)   Makes our researchers more efficient. We aren‟t
       waiting for our answers so we get more done.

  2)   Changes scope of questions that we ask. Answers in
       weeks versus years.

  3)   Can do more with less resources. Every researcher
       has his own Supercomputer effectively.
So what is the catch?
                    TANSTAAFL!
                          or
      “There ain‟t no such thing as a free lunch!”

1.   Doesn‟t run Windows so there is a learning curve

2.   To get much faster results you have to modify
     your code.

3.   To get the fastest results you might have to write
     new code or consult a subject matter expert.
Cluster Architecture

   Cluster is comprised of two types of computers
       Head nodes
            Provide the desktop environment for program development
            Provide access to interactive applications
            Provide a test bed for application development
            Provide access for short running jobs.
       Compute nodes
            Provide the batch environment
            Provides access for both short and long running jobs



Currently one head node and six compute nodes.
Cluster Hardware

   Head Node
      8x Intel Xeon X5365 (two quadcore 3.0 GHz), 32Gb RAM

   Compute nodes
      node002 - 8x AMD Opteron 870 (two quadcore 2.0GHz), 16Gb RAM
      node003 - 8x AMD Opteron 870 (two quadcore 2.0GHz), 16Gb RAM
      node004 - 4x AMD Opteron 850 (four singlecore 2.4 GHz), 32Gb RAM
      node005 - 4x Intel Xeon E7330 (four quadcore 2.4Ghz), 96Gb RAM
      node006 - 2x AMD Opterm 2356 (two quadcore 2.2Ghz), 16Gb RAM
      node007 - 2x AMD Opterm 2356 (two quadcore 2.2Ghz), 16Gb RAM


That is a total of 18 CPUs with 56 cores.

Node005 is currently reserved by the researcher that purchased it.
Plans for sharing it via checkpointing and priority queues at a later date.
Intro to Linux

   Why Linux instead of Windows?
       Historically, the Windows environment is designed for
        interactive processing and not batch processing.

       Research in HPC is typically done on Linux so the
        technology transfer happens faster to Linux.

       Cheaper by a factor of 10 at the lower end and
        1,000,000 at the high end.
So what are the differences?

   Filesystem
       No concept of “named letter drives”.
       Uses a single hierarchical filesystem instead.
       Any part of the file system can be a separate drive but
        this is controlled by the system administrators.
   File names
       Avoid the following: spaces, apostrophes, symbols
       Everything is case sensitive.
   Text console is often more useful than the GUI
       Some tools can only be accessed via the text console.
       Much more flexible and more powerful than GUI.
So what are the differences? (cont)

   Primarily a batch environment versus interactive.
       Some GUI tools are lacking.
       Do initial development on Windows and then migrate
       Can only use full power of cluster in batch mode.
   File permissions are different
       Antiquated compared to Windows
       Sharing space requires you to pay more attention.
   No second chances
       When you remove a file, it cannot be undeleted.
       Seriously. If it wasn‟t caught by backups, its GONE!
       I‟m not kidding. Be careful before you delete!!!!!!
So what are the differences? (cont)

  YUCK! That is a LOT of things to remember!
So what are the differences? (cont)

   YUCK! That is a LOT of things to remember!


            YES! But its well worth it.
 And almost all of the changes are good habits for a
          windows environment as well.
So what are the differences? (cont)

   YUCK! That is a LOT of things to remember!


            YES! But its well worth it.
 And almost all of the changes are good habits for a
          windows environment as well.


            And the gains are incredible.
So what are the differences? (cont)

   Gains:
       Schedule batch jobs for running later (no more having
        to watch computations)

       Conditional computation
            Based on previous results, you can perform others.

       parallel computation (leverage multiple cpus)

       distributed computation (leverage multiple machines)

       Parallel + distributed = Massively Large Computations
So how do I get to the cluster?
All Things Cluster
 http://www.sph.emory.edu/computing/cluster.php

To request an account, the project P.I. sends email to
 help@sph with required info:

  User‟s real name, Users‟ SPH/Emory Login ID
  Department, Project Description

  If >500GB space is needed then project name and
  authorizing account for purchase of disk space.
Accessing the cluster

   The cluster is accessed via one of two ways:

       A textual interface via an SSH client. SPH IT will
        install one at your request. Also, PuTTy is a freely
        available one for home use.

       GUI desktop via the Xming program. There should be
        a large black X-Icon on your desktop named
        CLUSTER. Double click it to open up a remote
        desktop session to the cluster.
Remote Desktop
Remote Desktop (cont.)
Cluster Desktop (cont.)




           INSERT ACTUAL
           DESKTOP USAGE
                HERE
Using the cluster: The Terminal

   Primary use the command line interface is called
    “the shell”

   Everything is a text command.

   Every character is important. Especially spaces,
    dashes, and capitalization.

   Open it as a GUI application called “Terminal” or
    sometimes “Console”. Or variations on that name.
Eleven Essential Commands

ls: List directory contents       cd: Change directory
pwd: print current directory cp: copy file
mv: move a file or directory rm: delete a file
mkdir: create directory           rmdir: remove directory
cat: print file contents          more: scroll through a file
exit: close the shell             nano: edit text files
Less: browse file contents        grep: search through files
                   Six other useful commands
chmod: change permissions chgrp: change group perms
history: see old commands         tail: monitor files
clear: clears the screen          man: online manual
Web Tutorials

   Beginner‟s Guide to Command Line Linux
       http://floppix.ccai.com/labs.html

   Into to the Linux command line
       http://www.tuxfiles.org/linuxhelp/cli.html

   The Linux Terminal – a Beginner‟s Bash
       http://linux.org.mt/article/terminal
Lets get started!

   Move Data Files
       WinSCP (similar to WinFTP)
       Rename files BEFORE you move them
   Editing Files
       Geany
       Eclipse (eventually)
       Neither as good as the Windows R editor environment
       Can use vi, joe, emacs, pico
   Remember!
       Have to ditch the references to drive letters
What to do?
   Always test the program interactively first!
       $ R
       Run program here
   Then test it via a simple batch method
       sas mysasprogram.sas
       Check output/.log file
   Once you are satisfied that its working
       Create wrapper job script by adding lines
       #!/bin/bash
       #$ -cwd
       <insert your line here>
       qsub mysasjob.sh
Success!




           THAT‟S IT!
Success!




             THAT‟S IT!
     But you need a bit more to keep track of things.
Working with very large datasets

   What is large?
       > 100,000 observations
       Non-simulation running about an hour
       File sizes < 500MB

   What is Very Large?
       > 1,000,000 observations (aggregate)
       Non-simulation running > 4 hours
       File sizes > 1GB
Examples:
Original N's:
 records in original raw data file: 4,398,590 persons
 number of id values occurring 2+ times: 81,532
 total # of records with duplicate id problem: 170,029
 # of observations in SAS data set of persons after removing all
  those with duplicate id: 4,228,561

Files built for analysis, including intermediate files:

   N's: 1,408,233---252,172,992 observations
   File sizes: 75 mb---6.8 gb
   Job times:
        Real from 26 1/2 minutes to 3 1/2 hours,
        CPU from 21 1/2 minutes to 2 3/4 hours
Working with VLD (cont.)

   Sorting Very Large Datasets with SAS

       Q. I'm trying to sort a very large SAS dataset (4.49G)
        and I'm getting the message that SAS is "out of
        resources". What can I do?

       A. Here are some helpful tips passed along to us from
        other users as well as from SAS:
   You can better diagnose where you're running into
    resource problems by setting a couple of system
    options: MSGLEVEL=i and FULLSTIMER. You
    need to determine whether you're running out of
    RAM or disk space. Its often the latter.
   With regard to memory -- if, for example, you
    have 750M RAM, try setting the MEMSIZE
    option to 700M and the SORTSIZE option to
    650M. You need to leave enough room for the
    operating system & for SAS overhead.

   For the cluster, this means, setting MEMSIZE to
    2000M and SORTSIZE based on your project
    space size (usually 4000M)
   About disk space -- A common cause of
    problems! If you are sorting a temporary (WORK)
    data set, you need to have room for 5 copies of the
    dataset in your WORK library! If you're sorting a
    permanent dataset (two level name), you need
    room for 1 copy in the source library, 1 in the
    destination, and 2 in WORK.
   Try sorting subsets of the data & recombining them
      -- If the file is very 'wide', split it into multiple files
       which all contain the sort variables but only contain
       some of the other variables; you can then do a
       MERGE/BY to recombine them.
      If the files are very 'long', try subsetting the file (1/2
       the obs in one file, 1/2 in another, say), sort them
       separately, and then interleave them in a data step


   (the obvious) -- Try and avoid the sort altogether: index
    the file or re-think the job sequence.
Working with VLD (cont.)

   I can't stress enough how much reducing the
    'width' of your file by dropping unnecessary
    variables and setting LENGTHs properly can
    help. Some combination of the other methods
    should get you 'sorted out'
Working with very large Datasets

   Make your first run with OPTIONS OBS=0;
       Repeat until no syntax errors
       This guarantees that you haven‟t made any simple
        mistakes
Working with VLD (cont.)

   Use KEEP= data set option with SET, MERGE
    and Proc Sort

       E.g. KEEP= with SET

       Data women;
       Set in.indiv(keep=personid status emigdth mures98);
Working with VLD (cont.)

   Minimize sorting in programs by storing
    permanent data sets in the sorted order that will be
    most useful in the future

       Beware that sorting a permanent data set without the
        OUT= option on Proc Sort will overwrite the data set
        and will destroy it completely in a run with system
        option OBS=0.

       Use the system optino NOREPLACE to avoid
        accidental replacement of a permanent data set.
Working with VLD (cont.)

   Use LENGTH to set lengths in bytes for each
    variable

       Length personid mothrid fathrid 6;
       Length child1-child17 $ 4;
Working with VLD (cont.)

   Use WHERE= data set option or the WHERE
    statement to select observations when possible
   data tmp;
   Set parntliv(where=(mother=„1‟ & „malive=„ „));
          Instead of
   Data tmp;
   Set parntliv;
   If mother=„1‟ & malive=„ „;
Working with VLD (cont.)

   Use Proc Datasets to delete, during job execution,
    temporary data sets no longer needed
            Proc datasets library=work; delete tmp tmp2; run;

   Reuse temporary data set names
       Data tmp2;
       Set tmp;
       Run;
       ……
       Proc data=tmp2;
       run
Working with VLD (cont.)

Don‟t write unnecessary data steps that create
 unnecessary temporary data sets
This is BAD:
 /*first make a copy of in.indiv and then merge with inw.womfixed*/
 data tmp;
 set in.indiv(keep=personid dob muborn wherborn county);
 run;

 data women;
 merge inw.womfixed(in=a)
      tmp;
 by personid;
 if a=1; run;
Working with VLD(cont.)

Much better:


  /*merge the permanent data sets directly*/
  data women;
  merge inw.womfixed(in=a)
          in.indiv(keep=personid dob muborn wherborn county);
  by personid;
  if a=1;
  run;
Working with VLD (cont.)

   Create random samples of your data to
    test/develop your programs
data out.smple5pct;
set in.indiv;

if ranuni(2106) < .05 then output; /* 5% random sample of in.indiv */
run;
What is Sun Grid Engine?

   Provides:
       Job priority policies and configurations
       Job advanced reservation
       Quality of Service support
       Node allocation policies
       Resource utilization tracking and statistics
       Backfill policies
       Workload analyzing

   And it does this with a few commands.
Sun Grid Engine (cont.)

   At a fundamental level, Sun GridEngine (SGE) is
    very easy to use. The following sections will
    describe the commands you need to submit simple
    jobs to run on the cluster. The command that will
    be most useful to you are as follows:

       qsub - submit a job to the batch scheduler
       qstat - examine the status of jobs and queues
       qhost - examine the status of hosts in the cluster
       qdel - delete a job from the queue
SGE (cont.)

   Qstat
       Qstat usually only shows you info on your jobs. If you
        have nothing running, it shows nothing.
       Use „qstat –f‟ to give detailed info on your jobs
       Use „qstat -u "*" „ to show *all* jobs including yours
Questions and Answers
   You can contact me directly:
       Email: vcmarti@sph.emory.edu (PREFERRED)
            I read email VERY regularly Mon-Fri
            Rarely read it on the weekend.
       Office:
            Grace Crum Rollins Suite 120
            Need to ask for me at the front desk.
       Phone:
            Desk Phone: 404-727-2076
                 Warning: I‟m always at my desk.
            Cell Phone: 404-313-0282
                 Warning: I may not answer if I‟m busy.
                 If its critical then call back immediately.
                 Leave a message if its important!
                 No guarantees that I‟ll respond to your satisfaction.

								
To top