Learning Center
Plans & pricing Sign in
Sign Out

Computing for Research I Spring 2011 Lecture 1_ January 5


									Computing for Research I
     Spring 2011

  Lecture 1: January 5

       Primary Instructor:
    Elizabeth Garrett-Mayer
• Description: Students learn to use the primary statistical software
  packages for data manipulation and analysis, including (but not limited
  to): R, R Bioconductor, SAS, SAS macro, and Stata. Additionally, students
  will learn: how to use the division's high speed cluster-computing
  environment, how to practice the principles of reproducible research using
  Sweave in R, and how to use LaTeX and BibTeX for manuscript and
  presentation development. This is a three credit course.

• Course Organization: This course is given by the entire
  division. Instructors will take turns giving lectures in their areas of

• Textbooks: No textbook. Reading material (primarily found on the web)
  will be provided as necessary.

• Prerequisites: Biometry 700
• Grading: Instructors will give short exercises to be completed and turned
  into the primary instructor by the Wednesday of the week following when
  it was assigned (e.g., assignments given on Monday Feb 28 and
  Wednesday Mar 2 are both due on Wednesday Mar 9). Each assignment
  will count equally towards 75% of the course grade. There will be a final
  project which will account for the remaining 20% of the course grade.
  The remaining 5% of the course grade will reflect class participation.

• Homeworks Policy: Homeworks are due by 5pm on the due date. All
  homeworks should be emailed to the primary instructor
  ( or turned in at lecture time. Asking for extensions
  on homeworks is strongly discouraged. However, it is expected that, on
  occasion, extenuating circumstances may arise. Therefore, the policy is
  that each student may request an extension on homework twice and the
  extension is to be no more than 2 days. After using two extensions, no
  more extensions will be granted except with a medical note.
     Primary Elizabeth Garrett-Mayer
Contact Info: Hollings Cancer Center, Rm. 118G
     (preferred mode of contact is email)
        Time: Mondays and Wednesdays, 2:30-4:00
    Location: Cannon 301, Room 305V
Office Hours: TBA, or by appointment

      Office Hours: The primary instructor will have office hours and be available by
      appointment. However, given the nature of the course, the primary instructor
      may not be knowledgeable regarding all of the topics covered. As a result,
      additional help may be needed to complete assignments from the lecturers. Be
      considerate and responsible in scheduling time with course instructors and
      recognize that they all have busy schedules.
               Course Objectives
Upon successful completion of the course, the student will be
able to
• Import, perform simple analyses and produce graphical
  displays in Stata, SAS and R
• Create new functions or commands in each of R, Stata and
• Generate professional quality scientific manuscripts and
  presentations using Latex along with statistical software
• Perform standard power and sample size calculations using
  available software and simulations.
• Operate the division’s cluster computer with batch
              Schedule, briefly
•   SAS
•   R
•   Batch processing
•   Latex + Sweave
•   Data management
•   Etc: power calculations, other packages
                    Detailed Schedule
Date       Lecturer            Topic
W Jan 5    E. Garrett-Mayer    Introduction; Overview and Principles
M Jan 10   Jody Ciolino        SAS: introduction
W Jan 12   Sharon Yeatts       SAS: IML
W Jan 19   Renee Martin        SAS: macros
M Jan 24   Valerie Durkalski   SAS: proc tabulate and proc report
W Jan 26   Nate Baker          SAS: Gplot
M Jan 31   Annie Simpson       SAS: ODS
W Feb 2    Jordan Elm          SAS: array processing
M Feb 7    E. Garrett-Mayer    STATA: introduction, “immediate” commands
W Feb 9    Joan Cunningham     STATA: data organization, manipulation
M Feb 14   E. Garrett-Mayer    STATA: exploratory data analysis; graphical displays
W Feb 16   E. Garrett-Mayer    STATA regression commands
M Feb 21   E. Garrett-Mayer    STATA: programming and do files
W Feb 23   E. Garrett-Mayer    R: introduction to object-oriented programming
                   Detailed Schedule
Date       Lecturer                 Topic
W Feb 23   E. Garrett-Mayer         R: introduction to object-oriented programming
M Feb 28   Caitlyn Ellerbe          R: downloading packages/libraries; data input &
W Mar 2    Anthony Parker           R: basic language structure (ifelse, where, looping)
M Mar 7    Cody Chiuzan             R: graphics
W Mar 9    E. Garrett-Mayer         R: exploratory data analysis
M Mar 21   Yanqui Weng              R: simulations; random number generation; sampling
                                    from distributions
W Mar 23   Stacia DeStantis         R: regression commands
M Mar 28   Bethany Wolf             R: bioconductor
W Mar 30   Adrian Nida              Batch processing (using R) and cluster computing
M Apr 4    Mulugeta Gebregziabher   Latex and Bibtex: manuscript production
W Apr 6    Dipankar Bandyopadhay    Latex and Bibtex: presentations
M Apr 11   Betsy Hill               Reproducible Research: Sweave and StatWeave
W Apr 13   Amy Wahlquist            Data management: RedCap
M Apr 18   Annie Simpson            Data management principles & Excel
W Apr 20                            Other packages (TBA)
M Apr 25   Paul Nietert             Sample size calculation software packages
W Apr 27
• We are meeting in a regular classroom
• Bringing laptops is encouraged
• Data, code, etc. needed for class will be on the
  website prior to class
• For optimal interface, install packages ASAP
   –   R (
   –   Stata (DBE helpdesk request)
   –   SAS (DBE helpdesk request)
   –   WinEdt (
• Create a bookmark to the course website
                 Lecture Notes
• Every lecturer will have his/her own style
• Notes may be
   – prepared ahead of time and posted
   – Prepared and posted after the lecture
   – Nonexistent
• Lecture notes will NOT be printed by the
  instructors prior to lecture.
• If they are available and you would like a paper
  copy, it is your responsibility to print them out.
• 2011: to be a successful
  biostatistician/epidemiologist, you MUST be
  competent on the computer.
• Historically: students learned in labs from
• Moving forward:
  – many options for analysis and generation of results
  – Efficiency in computing is essential.
  – Your computer IS your lab!
            Data analysis software
• In this course:
  – SAS
  – Stata
• Many other options:
       SPSS          S, Splus   Epi Info   GraphPad
       JMP           Matlab     JAGS       Systat
       Minitab       EGRET      BMDP       MedCalc
       Mathematica   WinBugs    GLIM       ….
                      SAS: History
• SAS was conceived by Anthony J. Barr in 1966. As a North Carolina
  State University graduate student from 1962 to 1964, Barr had
  created an analysis of variance modeling language. From 1966 to
  1968, Barr developed the fundamental structure and language of
• In January 1968, Barr and James Goodnight collaborated,
  integrating new multiple regression and analysis of variance
  routines developed by Goodnight into Barr's framework.
• By 1971, SAS was gaining popularity within the academic
  community. One strength of the system was analyzing experiments
  with missing data, which was useful to the pharmaceutical and
  agricultural industries, among others.
• In 1976, SAS Institute, Inc. was incorporated.
• The latest version, SAS version 9.2, was released in March 2008
             SAS: functioning
• SAS consists of a number of components,
  which organizations separately license and
  install as required.
• Licenses expire! Software cannot be used
  after expiration (unless renewed)
              Why (or why not) SAS?
• Most commonly used in pharma (although that may be changing!)
• FDA likes SAS
• Many jobs for MS statisticians and/or epidemiologists require SAS
• The most common language

• Becoming less the choice of academia
   – Updates are less frequent than freeware
   – ‘pros’ of competitors are starting to outweigh the ‘pros of SAS
       •   Licensing costs
       •   Slow to add new functionality
       •   Lack of consistency with syntax
       •   Learning curve is slower than other programs that now have similar capability
• Stata is a general-purpose statistical software
  package created in 1985 by StataCorp.
• Most of its users work in research, especially
  in the fields of economics, sociology, political
  science, biomedicine and epidemiology.
• Relatively simple to learn yet powerful
• Latest version is Stata 11 (released 2009).
• Lots of add-ons for epi users
         Why (or why not) Stata?
• Relatively inexpensive (especially as student or single-
• Biomedical focus so output, functions are tailored to
  medical research
• Fast and big: can handle and manipulate large datasets
• Sophisticated with wide range of tools
• Easy to learn language with consistent syntax
• Graphics are not as good as other packages (although
  that has improved)
• Programming (simulations, loops, etc) is more
                            R: History
• R is a programming language and software environment for statistical
  computing and graphics.
• The R language has become a de facto standard among statisticians for the
  development of statistical software, and is widely used for statistical
  software development and data analysis.
• R is an implementation of the S programming language. S was created by
  John Chambers while at Bell Labs. R was created by Ross Ihaka and Robert
  Gentleman, and is now developed by the R Development Core Team. R is
  named partly after the first names of the first two R authors, and partly as
  a play on the name of S.
• R source code is freely available under the GNU General Public License.
• The capabilities of R are extended through user-submitted packages,
  which allow specialized statistical techniques, graphical devices, as well as
  import/export capabilities to many external data formats. A core set of
  packages are included with the installation of R, with more than 2460 (as
  of July 2010) available at the Comprehensive R Archive Network (CRAN).
              R: functionality
• Freeware: latest version can be installed
  anywhere at anytime
• Packages (aka libraries) that are user-
  contribute allow additional
• Relatively simple interface
                Why (or why not) R?
•   Great for programming and simulations
•   Handles looping well
•   Flexible language
•   FREE!
•   User-contributes packages included in real-time (i.e., no delay in
    their availability)
•   Most PhD Biostatistics programs teach their students R and
    many/most academic statisticians in top programs use R.
•   Interfaces nicely with other programs such as Latex (Sweave),
    WinBugs, C, Emacs.
•   Can be clunky for data management.
•   Memory is not as good as SAS and Stata
•   Quality-control on user-contributed packages not evident
• Not a question of which one.
• Question is “for my current problem, which
  package makes the most sense to use?”
• Each has strengths and weaknesses
                 Latex and Sweave
• LaTeX is a document markup language and document preparation
  system for the TeX typesetting program.
• The term LaTeX refers only to the language in which documents are
  written, not to the editor used to write those documents. In order
  to create a document in LaTeX, a .tex file must be created using
  some form of text editor. (e.g. WinEdt)
• LaTeX is most widely used by mathematicians, scientists, engineers,
  philosophers, lawyers, linguists, economists, researchers, and other
  scholars in academia.
• LaTeX is used because of the high quality of typesetting achievable
  by TeX. The typesetting system offers extensive facilities for
  automating most aspects of typesetting and desktop publishing,
  including numbering and cross-referencing, tables and figures, page
  layout and bibliographies.
                 Latex and Sweave
• Sweave is a function in R that enables integration of R code into
  LaTeX documents. The purpose is "to create dynamic reports, which
  can be updated automatically if data or analysis change".
• The data analysis is performed at the moment of writing the report,
  or more exactly, at the moment of compiling the Sweave code with
  Sweave (i.e., essentially with R) and subsequently with LaTeX. This
  can facilitate the creation of up-to-date reports for the author.
• Because the Sweave files together with any external R files that
  might be sourced from them and the data files contain all the
  information necessary to trace back all steps of the data analyses,
• Sweave also has the potential to make research more transparent
  and reproducible to others. However, this is only the case to the
  extent that the author makes the data and the R and Sweave code
          Sample size and power
• We don’t really use textbook formulas anymore
  to do simple power calculations (just like we don’t
  really invert matrices by hand when we analyze data).
• There are a number of packages that quickly and
  easily perform simple power calculations
• R, SAS and Stata can do some.
• But, packages like Nquery, EAST and PASS do a lot
• In some non-standard settings, simulations are
  required to determine power.
            Data management
• Analysis of clean data is easy!
• The real world: you will get messy data most of
  the time from your colleagues
• Data management tools will help you;
  – Deal with messy data
  – Set up data capture approaches for your colleagues to
    minimize messiness
• Excel, RedCap and general principles of data
  management for statistical analysis will be
Patient #       cycle #       total ceramide levels S1P levels C18 ceramide S1P/C18
            1             0                     743.6      197.2          9.8   20.122449
                          3                     625.6      177.9          9.9   17.969697

            2             0                  534.8      148.4             9   16.4888889
CR                        3                  461.6      182.8          10.8   16.9259259
                          5                  527.3      151.4          11.5   13.1652174

            3             0                  760.5      214.5            12        17.875

            4             0                   359       167.3           4.3   38.9069767
                          3                  375.9      125.3           4.6   27.2391304
                          5                  475.6      116.2           4.4   26.4090909

            5             0                  394.1      163.1           5.7   28.6140351

            6             0                  848.7      132.5          10.8   12.2685185
                          3                 1083.6      203.9          13.5   15.1037037

            7             0                  684.6      191.4           8.1   23.6296296

            8             0                  822.7      219.5           8.9   24.6629213

            9             0                  486.3       198            5.7   34.7368421
CR                                           581.3      186.8           9.6   19.4583333
                                             699.6       42.3          11.4   3.71052632
                                             561.7      130.4           6.7   19.4626866
                                              754       320.6          14.4   22.2638889
           Before getting started…
• Types of files involved in statistical computing
   –   Data files
   –   Results files
   –   Command/batch files
   –   Function files
   –   Graphics files
   –   + more(?)
   – develop a common nomenclature for naming files and
   – Organize projects within folders
             Organization is key!
• DO NOT overwrite old files (especially data files)
• Save with a new name
   – Mousedata.xls (file sent from colleague)
   – Mousedata.clean.xls (your clean version of the data)
• Use a consistent approach, but think ahead
   – Naming files *.new.* is not a good idea. You may have
     a new ‘new’ next week
   – Numerics are good, but if you think you may need
     more than 9 versions, consider how data2 and data10
     would be alphabetized.
• For each Principal Investigator I work with, I have
  a folder
• With the PI folder, for each project, I have a folder
• For each time I get a new dataset (or work on a
  new grant) for that project, I have a folder named
  with month and year
• Example:
      I:\\MUSC Oncology\\Kraft, Andrew\\VelcadeTrial\\May2008
      I:\\MUSC Oncology\\Kraft, Andrew\\R01 June 2007
• Within each folder of data analysis or grant
  development calculations, I use the same naming
  conventions for files:
   – Rbatch.R: a set of R commands that implement all of the
     computation or analyses
   – Rfunctions.R: a set of R functions that are used by the
     batch file
   – I always save the original data file from the investigator
     before making any changes
   – I add ‘clean’ to the datafile name and save it as a . Csv
     before use.
   – My Rbatch.R files always include a line sourcing in the
     data, including the folder where the data resides.
   Friends in Statistical Computing
1. Google is your friend
2. ‘Help’ functions and ‘see also’ links are your
3. ‘examples’ are your friends
4. Your fellow students are your friends

Friends help friends figure out statistical
                  Using your noggin
• Example 1:
   – SPSS is not included in this curriculum.
   – Can you not use it? NO!
   – Will you be able to learn it better and faster after having taken this
     course? YES!
• Example 2:
   – We will probably not cover the R package nnc (Neareset Neighbor
   – Does that mean you need to find someone to teach it to you? NO!
   – Will you be able to teach it to yourself? YES!
• Example 3:
   – None of your instructors are computer scientists (except maybe Annie
   – Does this mean that they are not qualified to teach you? NO!
   – Most of them are self-taught with regards to these techniques
       Final Thoughts for Today


• Next up: Jody Ciolino with an intro to SAS!
• Background info on R, SAS, Stata, Latex and
  Sweave was all pilfered from Wikipedia.

To top