R Programming for Life Sciences by yrs83496

VIEWS: 13 PAGES: 120

									R Programming for Life Scientists
         Version 2.0
       Raymond R. Balise, Ph.D.
      Health Research and Policy
              Spectrum
                   Roadmap
•   What makes R different for the rest?
•   Setting up R
•   Types of data
•   Working with collections of data
•   Importing and exporting data
•   Writing functions
•   Graphics
             When to Use R
• Shoestring budget
• Cutting edge statistics
• Developing your own or fine-tuning existing
  methods
• Local expertise
       Programming Languages
• Procedural languages
  – C, Fortran, Cobol, Basic
  – use a model where the logic flows from the top of
    the page to the bottom with calls to “goto”
    subroutines as needed
     • It is hard to encapsulate the code.
• Object oriented languages
     • C++, Visual Basic, JAVA
     • involves creating “objects” and then operating on them
       R is Object Oriented (OO)
• You create objects
     • vector of numbers, a graphic, etc.
• You call methods/functions to operate on the
  objects.
• Working with an OO language requires you to
  learn about special methods to create, access,
  modify, or destroy objects and their properties.
  – R hides these processes.
  – It helps a lot if you want to write new statistics and
    methods and is required for making new packages.
                   OO Example
• With R you write code in the editor which I will
  show you in a minute.
• You can create an object which holds a bunch of
  numbers (a vector, if you remember math)
• You can then use (aka call) a function (aka
  method) to operate on the object.
  – The summary() function
     • Create and display a numeric summary object
  – The plot() function
     • Create and display a graphic summary object
 Make the
ages object

                    Call the
                   summary
                   function.


               Call the
                plot
              function.
         But wait … there’s more!
• There is a lot of functionality built into R. It
  ships with libraries that do many different
  tasks. And you can download more.
                     Activate the map
                       datasets and
                        functions.




   Map most of the
       USA.
   But hold on…. There is MORE!
• You can add options to the function calls to
  make them do fancy things like color….
• Or you can have one function act on the
  output of another function….
• And you can save output as objects!
           Important Objects
• Vectors are lists of numbers.
• Dataframes are like database or spreadsheets.
                Everything is an Object
 Every object in R has a mode which determines how much space it uses.
       R objects can have a class which indicates what they are used for.

            Class:

    Mode:                character
character                 numeric
numeric                     logical      Holding “simple” data elements
logical                   function
function                     factor          Structures to hold data
complex                 data.frame     Columns (of same length) with data like database
raw                             list   Columns of data (of different length)
                            matrix     Grid of data (all same type)
                                 lm
                              table      Output from summary/graphic procedures
                     … lots more…
               Where to Get R
• R has two main websites. One describes the project:
               http://www.r-project.org/
• The other has most of the stuff you want to
  download:
               http://cran.r-project.org/
• Because the R project has people working all over
  the globe, the software download site is “mirrored”
  everywhere. The closest mirror is USA CA1 (aka UC
  Berkeley).
    http://cran.cnr.berkeley.edu/
• There is an R installer for all the common
  operating systems:
• cran.cnr.berkeley.edu/bin/windows/base/
• cran.cnr.berkeley.edu/bin/macosx/
• cran.cnr.berkeley.edu/bin/linux/
• Each is basically self explanatory.
         Installing on Windows
• Double click the installer and just push next
  until you get to this screen.

                               Specify that you
                               want to do
                               customized startup.

                               This will let you set
                               up R to work with
                               other programs
                               nicely.
                 Customize
• Use these options, then hit Next> a bunch.
• help.start() and push enter to start the help.
• q() and push enter to quit but don’t yet.
Use the built in
    editor.
                                GUI
                   Save or restore all
                   the objects in use.


                                 Save or reload the
                                   code from the
                                      console.


                                    Set the working
                                    directory to save
                                        objects.


                                 Keep all the text in
                                 the console for the
                                      session.
Edit existing data.
                        GUI




       Tweak the
    appearance of the
        console.
               Rprofile.site
• If you have instructions that you always want
  run when R starts up, you can include them in
  the Rprofile.site file:
                     GUI
                            Common
                           commands.




 Show the add on
packages currently
    accessible.
                Packages in R
• User-supplied packages are typically found at
  one of three places:
  – CRAN for all kinds of stuff
  – Omegahat for web-based statistics
  – Bioconductor for genomic analysis
• R packages update often.
• Your colleagues will recommend task-specific
  packages.
  – Rcmdr is my favorite.
    Use a previously
downloaded package.
 I type library(name)
        instead.
                        GUI is closest
                        USA (CA1)
                           to Stanford.




                            Choose which set of
                            packages to look at.




                        See the HUGE list of
                             packages.

     Update often!
                  GUI




This is useful.
            HTML
            help…
        This will not find
        information if you
        have not installed
        the packages.




This is useful
  but not
  Google.
     Rseek.org is Google-driven
• I highly recommend it.
Search help for the
   word "map".
                       Search for details
                      on a function if the
                                             Mac Quick Help
                       package is loaded
                      and you know the
                        functions name.
Search help for the
   word "map".

  Load the package.   Windows Quick Help
  Search help for the
   function named
       "map".
                 Mac Install
• Download and double click the dmg file.




                               Click customize and
                               make sure Tcl/Tk is
                                   checked on.
                        X11
• Some packages for R on the Mac (like Rcmdr)
  require X11 to be installed.
  – I think it is part of the standard Leopard
    installation but was an option with Tiger. If you
    need it, try to install it off of the DVD that came
    with your machine because people have reported
    using the dmg files from Apple.com.
X11 and Add-on Packages          To get add on
                               packages, use this
                                     menu.




          You can click here
          to make sure X11
               works.
    Getting or Updating Packages
• Click Get List, click the package name, be sure
  install dependencies is checked on, then click
  install.
        Instead of Point and Click
• You can also run this code to have Mac or
  Windows R download a list of packages:

usefulPackages = c("car", "foreign", "hexbin", "gdata",
   "ggplot2", "gmodels", "gplots", "Hmisc", "reshape",
   "Rcmdr")
install.packages(usefulPackages, dependencies = TRUE)

Be sure to take note of any packages that do not install.
‘marray’, ‘affy’, ‘Biobase’, ‘Rgraphviz’ were not available
             Your First Package
• I suggest you install the Rcmdr package first
  thing.
  – Use the Install packages… option on the package
    menu to download Rcmdr
• To make it available for your R session type:
                      library(Rcmdr)
  – CAPITALIZATION MATTERS!
  – The first time you run it, it will ask you if it can
    download additional packages.
              On a Mac you can
              not directly import
                  from Excel.




If you are on Windows you
 can directly import Excel.
                Hate Typing?
• Tab is your fiend. It will auto-complete if it
  can or give you a list of functions that match
  what you have typed. It woks very well on the
  Mac. In Windows sometimes you need to type
  tab twice.

• In Windows if you type tab after a ( it displays
  options for the function or they just appear in
  the Mac.
             Before Analysis
• Rcmdr makes the most commonly done
  analyses easy or if you are told the names of
  the functions to use, writing the code is
  almost tolerable.
• Data manipulation is relatively difficult in R
  compared to other analysis and data
  management tools.
• You need to know how to manipulate the data
  objects.
                  Data Set Objects
• Vectors
   – A bunch of data in a single row or column
   – All of the same type
• Matrix
   – A row and column arrangement of data
   – All of the same type
• Data frame
   – A row and column arrangement of data        Like a “good” spreadsheet
   – Columns are of different types              or relational database file

• List
   – Very free-form structure
   – A grouping of different types of data
            Types of Data Vectors
• Numeric
  – Integer, real, and complex are different types but
    you will not need to pay attention to the details
  – NA means missing
  – NAN means not a number
• String
  – Characters of the alphabet
• Logical
  – TRUE, FALSE or NA
                Making a Vector
                                             R is case sensitive.
• Make a sequence                    # means ignore the rest of the line.
                                      ; means a new command follows.
  OneToThirty = seq(1, 30)
  OneToThirty # same as print(OneToThirty)
  oneToThirty = seq(1, 30, by = 2); oneToThirty
  x1230 = 1:30                     This will be VERY useful.

  (x1to30 = 1:30)                    Surround the expression
                                         with () to display the result
                                                automatically.
         Making Vectors With c()
• c stands for concatenate
  ages = c(9, 11, 40, 41) ; ages
  stooges = c("Larry", "Moe", "Curly", "Shemp"); stooges
                 Getting Details
• You can use is functions and length to get
  details on a vector.
  is.vector(ages)
  is.numeric(ages)
  is.logical(ages)
  length(ages)
        Recycling and Vectorizing
• You can add one to all four ages.
   ages + c(1,1,1,1)
• If you provide the scalar integer, R will
  temporarily vectorize the 1 by recycling that
  value to match the length of the ages vector.
   ages + 1
• It will recycle a series also.
   ages
   ages + c(1,2)
       Naming Parts of a Vector
• You can assign names to the elements of a vector.
  This allows later access to the elements using the
  names instead of the position.
  names(ages) = stooges
  ages
• To erase them:
  names(ages) = NULL; ages
• Notice what happens when the lengths differ:
  stooges= c("Larry", "Moe", "Curly")
  names(ages) = stooges
  ages
                Attributes
• When you add names to things (objects) they
  acquire or change their “names attribute.”
  attributes(ages)
• When you strip off the names, the vector is
  left with no attributes.
  names(ages) = NULL
  attributes(ages)
                 Complex Objects
• A data frame is an object with many
  attributes.
• R ships with a lot of datasets if you want one…
  help.start()
• Click packages then datasets.
  esoph
  ?esoph
  attributes(esoph)
         Getting at Parts of a Vector
• Specify the element number.
  heyMoe = ages[2] ; heyMoe


• Specify to drop everything except the element
  number.
  ages[c(-1, -3, -4)]
• Specify a list with TRUE and FALSE
  ages[c(FALSE, TRUE, FALSE, FALSE)]
      Getting Parts with Names
  ages = c(9, 11, 40, 41) ; ages
  names(ages) = c("Larry", "Moe", "Curly", "Shemp")
  ages


• Specify the name.
  heyMoe = ages["Moe"]
                Duplicate Names
• That code only returns the first one if there
  are duplicates.
   names(ages)[4] = "Moe"
   ages
   heyMoe = ages["Moe"]
   heyMoe
• Gives all if duplicates
   names(ages) %in% "Moe"
   ages[names(ages) %in% "Moe"]
         Parts of a Data Frame
• You can select columns of a data frame just
  like you selected elements from a vector.
  booze = esoph["alcgp"]
  is.data.frame(booze)
  esoph[2]
  esoph[c(4,5)]
                   Choosing Records
• If you put a single item or series inside of the
  square brackets, R thinks you are requesting
  columns.
• If you want to get access to specific rows, you
  include a comma after the rows.
• blah[rows, columns]
  esoph[ 1 , ]
  esoph[ 1:3 , ]
     Smarter Access to a Vector
• You can use logic checks to find the record
  numbers in a vector which meet your criteria.
  ages < 21
  which(ages < 21)
• You can then subset down your data to the
  records of interest using the [ ] subset
  operator.
  ages[which(ages < 21)]
  ages[ages < 21]
  Smarter Access to a Data Frame
• You can find the records that pass a logic
  check inside a data frame.
  esoph["ncases"] > 0
  esoph$ncases > 0
  which(esoph$ncases > 0)
                        Subset a Data Frame
• Recall that you can select rows with
  frameName[rows,columns] and if you do not
  include a comma, all records are chosen.
  which(esoph$ncases > 0) gives you a list of
  records which adhere to that rule. Therefore,
  the code below gives you a subset
  esoph[ which(esoph$ncases > 0) , ]
  or
  esoph[esoph$ncases > 0 , ]
       Subset Data Frames Easily
• If that logic is rough to think about, use the
  subset function.
  subset(esoph, ncases > 0)
              Choosing Values
• If you need specific values, you can use the &
  (and) or the | (or) operators to get the
  ordered set of TRUE and FALSE values.
  ages > 21 & ages < 41
• ! means not
   !(ages > 21 & ages < 41)
• Notice that it is applying the one logic check
  to the vector of ages. How does it do that?
   Math on Data Frame Columns
• You have seen how to do scalar and vector
  algebra. Algebra on a data frame is easy.
  names(esoph)
  esoph$total=esoph$ncases + esoph$ncontrols

• To see the end of the data frame, use tail()
  tail(esoph)
      Comparing Against Vectors
• What happens when you try to compare a vector
  to a set of things?
  gender = c(NA, "Male", "Female", "Blue", "Female")
                                                This one uses
  gender == "Male" | gender == "Female"         recycling and
  gender == c("Male", "Female")                  gives wrong
                                                     answers.
• R recycles the shorter vector to be the longer
  length, then does the comparison. Use the %in%
  operator if you want to compare as if you wrote a
  series of or statements.
  gender %in% c("Male", "Female")
           Categorical Variables
• R makes a distinction between variables holding
  a bunch of characters from the alphabet and
  variables holding categorical information. If
  you have a classification/categorical variable,
  you want R to treat it as a factor or an ordered
  factor. Typical factors are treatment or gender.
  dose = c("low", "placebo", "high", "low")
  dose
  typeof(dose)
                                                         Factors
• To convert a character variable to a factor, use the
  as.factor function.         There are is. or as. predicate functions
   doseF = as.factor(dose)               to check object types or convert
                                            between types of objects.
   typeof(doseF)
   class(doseF)
• Behind the scenes, the character variable is
  converted into numbers and the numbers are
  given character strings to display.
• In modern R the levels of the factor are ordered
  alphabetically and the first one is represented
  with the digit 1, the second is 2, etc.
             Comparing Factors
• You can compare a factor vs. a constant value.
  doseF == "high"
  as.integer(doseF) == 1
• Or you can compare vs. vectors (CAREFULLY).
  doseF == c("high", "low")      Notice wrong answer thanks to recycling.
  doseF %in% c("high", "low")
• R will stop you from comparing factors that have
  different categories.
  doseF2 = as.factor(c("blah", "placebo", "high", "low"))
  doseF == doseF2
               Recoding Factors
• Often you will want to regroup factor levels.
  amount=as.factor(c("placebo", "10mg", "5mg", "10mg"))
  levels(amount)
  regroup = list(none="placebo", some=c("5mg", "10mg"))
                                         none
  levels(amount) = regroup                 placebo
  amount                                 some
                                             5mg
                                          10mg
             Numeric Factors
• If you have numeric factors, be careful
  converting from factors back to numbers.
  ID = c(1000, 1000, 1001, 2)
  IDf = factor(ID)
  as.integer(IDf)
  levels(IDf)
  numbersAgain = as.numeric(levels(IDf))[IDf]
      Recoding Values in a Vector
• R has functions like if and ifelse to process
  values.
  ifelse(ages < 21, "Young", "Old")
  ifelse(ages<21, "Young", ifelse(ages<41, "Old",
     "Fossilized") )
               Easier Recoding
• Other packages like car have functions to recode:
  library(car)
  newAge2=recode(ages, ' 1:21="Young"; else= "Old" ')
  newAge2
  detach("package:car")
        Attaching and Masking
• You can see the order in which R searches for
  things using the search() function.




                                  Remove the
                                 package from
                                the search path.
        Attaching Data Frames
• People who really hate typing attach data
  frames so they can refer to them with short
  names.
  bmi = women$weight / women$height ^2 * 703
• Instead you can attach the women data frame
  and an easier formula write:
  attach(women)
  search()
  bmi = weight / height ^2 * 703; bmi
  detach(women)
              Keeping Track
• You can see what datasets are in each of the
  work environments/packages with the list
  stuff function ls().
  rm(list=ls(all=TRUE))
  ls()
  search()
  attach(women)
  ls()
  search()
  head(women); head(height)
  .GlobalEnv
               women
                               datasets
                      height            …
                      weight        women
                                        …

Look 1st   Look 2nd        Look 3rd
  Adding a Variable & Making a DF
women$bmi = weight / height ^2 * 703; bmi
head(women)
ls()                                   The data frame
                                          with bmi
                                                           and the data
                                                          frame without
                            .GlobalEnv                         bmi
                                         women
                                                          datasets
                              women
                                                 height            …
                                                 weight        women
                                                                   …

                          Look 1st    Look 2nd        Look 3rd
           Making a Data Frame
• Frequently you will want to make data frames for
  analysis with Rcmdr. Use the data.frame()
  command:
  attach(sleep)
  pair = data.frame(extra[group=="A"], extra[group=="B"])
Using Rcmdr for a paired t-test
                    Click here.
           Loading Text Data into R
• Reading text files:
   fakeAlleles=read.table("c:\\blah\\fakeAlleles.txt",
     header=TRUE)
• See if it worked:
   fakeAlleles
   names(fakeAlleles)
   summary(fakeAlleles)
   fakeAlleles$dude = as.character(fakeAlleles$dude)
   fakeAlleles
• A better option:
   fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt", header =
     TRUE, colClasses = c("character", "factor","factor"))
              Other Text Formats
• Other text reading methods:
   read.csv = coma separated values
   read.csv2 = semicolon delimited files
   read.delim = read tab delimited files
   read.fwf = read fixed width format files
• Use same options as read.table
• If the data has bad or no column headings you may
  also want to include:
   read.table( …stuff…, col.names = c("name1", "name2") )
• To prevent characters from coming in as factors:
   options(stringsAsFactors = FALSE)
                    Data Frames
• The data imported into a data frame.
   class(fakeAlleles)
• A data frame really is a list of vectors where the
  vectors are all the same length.
   as.list(fakeAlleles)
• To select a column you specify the data frame $
  variable name.
   theDudes = fakeAlleles$dude
• All the stuff you saw for logic checks on vectors
  can be used on the parts of a data frame.
   fakeAlleles$allele1 == "A"
      Subsetting Vectors (again)
• Recall that you can subset using the [ ]
  operator:
  ages = c(9, 11, 40, 41)
  heyMoe = ages[2]
  ages <21
  ages[ages < 21]
• The same voodoo works on the vectors that
  make up data frames!
  dudeTwo = fakeAlleles$dude[2]
        Subsetting Data Frames
• Parts (subsets) of data frames are referenced
  by "column numbers comma row numbers":
  – The first record: fakeAlleles[1, ]
  – The 2nd and 3rd columns: fakeAlleles[ , c(2,3)]
  – The genotype for record 6: fakeAlleles[6, c(2,3)]
• or by names:
  fakeAlleles[, c("allele1", "allele2")]
                 Named Rows
• You can name your rows also:
  fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt",
    header = TRUE, colClasses = c("character",
    "factor","factor"))
  row.names(fakeAlleles) = fakeAlleles$dude
  fakeAlleles = fakeAlleles[, -1]; fakeAlleles


 fakeAlleles["006", c("allele1", "allele2")]
Getting Counts with Rcmdr
          Subsetting Using Logic
• You can use logic checks to subset:
  fakeAlleles$allele1 == "A" & fakeAlleles$allele2 =="A"
  fakeAlleles[ fakeAlleles$allele1 == "A" &
    fakeAlleles$allele2 =="A", ]
          Importing From Excel
• If you have PERL on your machine, you can
  use the read.xls() function in the gdata library
  to easily get data out of Excel and into a data
  frame.
  – Mac has PERL
  – Windows
      http://www.activestate.com/activeperl/
                 Using read.xls
• Windows:
  library(gdata)
  sleepy = read.xls("c:\\blah\\sleep.xls")
• Mac:
  library(gdata)
  read.xls("/users/balise/desktop/sleep.xls")
• It’s that easy… Behind the scenes it is converting
  the xls file into a csv so you can use the text
  importing options.
• Do summary() on the data frame and notice what
  happens to the missing value.
                     RODBC
• ODBC is a language/convention for accessing
  databases. R allows you to use ODBC
  connections to burrow directly into databases
  and other data containers like Excel.
  library(RODBC)
  channel<-odbcConnectExcel("c:\\blah\\sleep.xlsx")
  sqlQuery(channel, "select * from [Sheet1$]")
  odbcCloseAll()
                       SQL
• If you have to learn one programming
  language, learn SQL.
  – With it you can manipulate data stored in nearly
    every commercial database.
  – You can aggregate, subset and modify data.
  – It is well implemented inside of both R and SAS.
• SQL with R is nicely documented in Spector's
  (2008) Data Manipulation with R. It is a must
  own for people who want to learn R.
            Exporting Text Files
• R can write objects full of data, including data
  frames, into text files.
  – By default, it will quote the character string and fill
    in the letters NA where there were originally
    missing values.
• This code exports back to the original
  appearance.
  write.table(sleepy, file = "c:\\blah\\exported.tab",
   sep ="\t", quote = FALSE, na ="")
  Office 2007 Excel

• ODBC connection
  1.   Control Pannels,
  2.   Double click Adminstrative Tools
  3.   Double click Data Sources (ODBC)
  4.   On the USER DNS tab choose ADD
  5.   Click Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)
  6.   Give the connection a name and browse to the file.
       Jot down the name of the connection for the R code.
       Using an ODBC connection
• Once the ODBC connection is set-up use code like this:
   library(RODBC)
   connection = odbcConnect("sleepODBC")
   dataFromODBC= sqlFetch(connection, "Sheet1")
   odbcClose(connection)
              Creating Programs
• You can write line-by-line instructions in the R
  console, use the editors built into R, or use a third
  party editor (like Tinn-R for windows or JGR).
• Console
   – Type history() to see the lines you have submitted
     recently and then save to a file and re-run it later if
     needed.
• Built-in Editor
   – Mac: Click the blank page at top of the console
   – Windows: File > New Script
     Windows Tinn-R Editor
http://www.sciviews.org/Tinn-R/index.html
                 OO Programming in R
• OO programming requires
    – objects
    – classes
          • describe specific properties for groups of objects
    – inheritance
          • classes related to eachother (derived from other classes) have related
            properties
    – polymorphism
          • the same function name applied to different classes does different things
• R vs. JAVA: R typically has separate classes for actions instead of
  bundling them with the data structures
    – JAVA
          • Animal -> domesticated -> dog (walks)
    – R
          • Animal -> domesticated -> dog
          • Movement -> Walks
          Polymorphism is Fun
• plot does different things depending on the
  function arguments:
  plot(sleepy)
  plot(sleepy$extra)
  plot(sleepy$baseline, sleepy$extra, sleepy$group)
• Take a look at how it works:
  isS4(plot)
  methods(plot)
  getAnywhere(plot.factor)
                     Arguments
• Functions try to match arguments in the order they
  appear in the definitions in the help files.
   ?plot
• You can explicitly reference the arguments' full names
  and they typically allow abbreviations (but don’t
  obfuscate your code).
• The … means that other arguments are allowed and
  are passed along the class hierarchy to other methods.
  Look at plot to see this.
   – The … makes it easy to write bugged code because you if
     misspell the argument name, it is silently passed along.
               Writing Functions
• You can easily write functions, but notice that the last
  thing calculated is returned:
   MandM = function(x){mean(x); median(x)}
   MandM(sleepy$extra) # returns only the median
• Store the values you want into a list:
   MandM = function(x) {
     blah = list(theMean=0, theMedian=0)
     blah$theMean = mean(x)
     blah$theMedian = median(x)
     return (blah)
   }
   MandM(sleepy$extra)
            Other Arguments
  MandM(sleepy$baseline)
• It points out that we need to deal with missing
  values. Look up mean and median and you will
  see they allow the na.rm parameter to determine
  if missing values are dropped.
• Using MandM(sleepy$baseline, na.rm=TRUE)
  does not work because the parameter list does
  not allow it. We want to allow that parameter to
  be passed along. So rewrite the function.
              M and M Again
• Recall that an … in the argument list means
  "other stuff".
  MandM = function(x, ...) {
    blah = list(theMean=0, theMedian=0)
    blah$theMean = mean(x , ...)
    blah$theMedian = median(x, ...)
    return (blah)
  }
  MandM(sleepy$extra)
  MandM(sleepy$baseline, na.rm=TRUE)
           Appling Your Function
• R does allow you to write loops to iterate over
  records or variables but if you are not writing
  novel math functions, they can generally be
  avoided.
• R will try to vectorize and process:
  MandM(c(sleepy$baseline,sleepy$extra))
• Use sapply to apply a function to a data frame:
  sapply(sleepy, MandM, rm.na=TRUE)
           Better M and M
MandM = function(x, ...) {
  blah = list(theMean=0, theMedian=0)
  if(is.numeric(x)== TRUE) {
    blah$theMean = mean(x , ...)
    blah$theMedian = median(x, ...)
  }
  return (blah)
}
sapply(sleepy, MandM, na.rm=TRUE)
                          Yummy M and M
MandM = function(x, ...) {
  blah = list(theMean=NaN, theMedian=NaN)
  if(is.numeric(x)== TRUE) {
  blah$theMean = mean(x , ...)
  blah$theMedian = median(x, ...)
  }
  return (blah)
}
MandM(sleepy$extra)
MandM(sleepy$baseline, na.rm=TRUE)

sapply(sleepy, MandM, na.rm=TRUE)
         Writing Novel Functions
• Look hard on rseek.org before you reinvent
  the wheel.
• R syntax is very similar to C.
   – Select/Case logic is different (R short-circuits) .
• The R Book by Crawley is too big to buy for
  just this topic but it is good for syntax. Get it
  from the library and read the early chapters.
• The final chapter of Spector has a few
  wonderful pages.
# The "what the function" (wtf) function                        if(is.list(x) == TRUE) {
                                                                   thingy = cat(item,
wtf = function(x){                                                   "\n LIST \n Components:", paste(names(x), collapse=" "),
 if (missing(x)) stop("\n\tUsage: wtf(x)\n\twhere x exists")         "\n MODE:", mode(x),
                                                                     "\n CLASS:", class(x), "\n"
                                                                   )
 item = as.character(sys.call())[2]                              }
 if(is.factor(x) == TRUE) {                                      if(is.data.frame(x) == TRUE) {
   thingy = cat(item,                                              thingy = cat(item,
     "\n FACTOR \n length", length(x),                               "\n DATAFRAME \n Components:", paste(names(x),
                                                               collapse=" "),
     "\n MODE:", mode(x),                                            "\n component lenght of", length(x[[1]]), "items",
     "\n CLASS:", class(x), "\n"                                     "\n MODE:", mode(x),
   )                                                                 "\n CLASS:", class(x), "\n"
 }                                                                 )
 if(is.vector(x) == TRUE) {                                      }
                                                                 if(is.function(x) == TRUE) {
   thingy = cat(item,                                              thingy = cat(item,
     "\n VECTOR \n length", length(x),                               "\n FUNCTION \n"
     "\n MODE:", mode(x),                                          )
     "\n CLASS:", class(x), "\n"                                 }
   )                                                           }
 }
 if(is.matrix(x) == TRUE) {
   thingy = cat(item,
     "\n MAXTRIX \n dimentions", paste(dim(x),
       collapse="x"),
     "\n MODE:", mode(x),
     "\n CLASS:", class(x), "\n"

 }
   )
                                                                A Handy Function
          Destroying Efficiency
• A matrix of data is really a vector with row and
  column attributes added to it. This has profound
  speed issues if you add to the size of a matrix
  because the data has to be shifted all over the
  place.
• If you plan on writing your own functions to
  manipulate matrices, build an empty matrix of
  the maximum size (or guess bigger) rather than
  using the functions to add rows or columns.
         Writing Efficient Code
• R has decent tools for profiling code.
• The Rprof and summaryRprof functions will
  help you figure out what is bogging down your
  code.
  Rprof()
  MandM(rnorm(1000000))
  Rprof(NULL)
  summaryRprof()
                 Debugging in R
• See Chapter 9 in Gentleman's book.
• The browser() function can be put inside a function to
  pause execution and see what is going on.
• The codetools package is great for tweaking big
  functions:
   findLocals(), findGlobals(),
   – shows you if variables and functions originate inside of a
      function
   checkUsage(), and checkUsagePackage()
   – shows you what variables are modified or not touched in a
      function
              Creating Graphs
• Basic plots are easy but tweaking them for
  publications can be rough because the
  documentation on the function arguments is
  appalling.
• Data Analysis and Graphics Using R by John
  Maindonald and John Braun is extremely useful.
• There are myriad graphics built into the core of R
  plus more in the packages.
    addictedtor.free.fr/graphiques/thumbs.php
                      Test Scores
scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE)
rapply(scores, class)
scores$CENTER = as.factor(scores$CENTER)
scores$PAT = as.character(scores$PAT)
rapply(scores, class)
scores$isSick = ifelse(scores$SCORE > 0, 1, 0);
library(car)
(scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 =
   "Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BAD
   DATA"')))
(scores$SEV = factor(scores$SEV, levels = c("None", "Mild",
   "Moderate", "Severe"), ordered = TRUE));
      Common Plots are Easy
attach(scores) #to avoid typing scores$
plot(SEV, main = "MainTitle", xlab = "xlab", ylab =
  "ylab")
plot(SCORE)
hist (SCORE)
boxplot(SCORE)
boxplot(SCORE ~ SEX, ylim = c(0,100))
detach(scores)
                                               Graphics Tweaks


                                                                Density = dnorm                                                    Probability = pnorm
 mfrow is used to set
                                               0.4




                                                                                                                      1.0
 number of rows and




                                                                                                                      0.8
                                               0.3
                         Probability density




columns of graphics on



                                                                                                        Probability

                                                                                                                      0.6
                                               0.2




                                                                                                                      0.4
       a page
                                               0.1




                                                                                                                      0.2
                                                                                                                      0.0
                                               0.0




                                                     -3    -2      -1         0         1     2   3                         -3    -2        -1   0       1       2   3

                                                                              z                                                                  z



                                                                Quantiles = qnorm                                                Random numbers = rnorm
                                                                                                                      200
                                               2




                                                                                                                      150
                                               1
                         Quantile (Z)




                                                                                                        frequency

                                                                                                                      100
                                               0
                                               -1




                                                                                                                      50
                                               -2




                                                                                                                      0




                                                     0.0   0.2          0.4       0.6       0.8   1.0                       -4         -2            0       2

                                                                              p                                                                  z
   Strip Charts for Small Datasets
• par(cex = 1.5) # big font
• with(Gad, stripchart(HAMA ~ DOSEGRP, xlab =
  "HAMA", pch = 16))



                         PB
                         LO
                         HI




                              20   25      30   35

                                    HAMA
   3 Languages for the Price of 1…
• The graphics I have shown use the classic
  graphic methods.
• There are trellis plots from the lattice package
  that split the data into multiple panes
  automatically.
• ggplot2 uses a "grammar of graphics"
  approach (like SPSS).
                Don’t play with pie!
                                    The lattice package makes
                                    trellis graphics (I didn’t make
•   library(lattice)                up these names!).
•   trellis.par.set(list(fontsize=list(points=20)))
•   trellis.par.set(list(fontsize=list(text=25)))
•   dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21))
             DOSEGRP


                                       PB
                        HI




                                       LO

    LO




                                       HI



                       PB                   0      5      10      15   20
                                                         Freq
                                                               8 12 16             8 12 16             8 12 16             8 12 16

                                                      EE        EE        EE        EE        EE        EE        EE        EE        EE

                                                 4




                                                 3

                            NOx (micrograms/J)
Typical lattice plot with
   banding to show
         subsets                                 2




                                                 1




                                                     8 12 16             8 12 16             8 12 16             8 12 16             8 12 16

                                                                                   Compression Ratio
trellis.par.set(list(fontsize=list(points=15)))
trellis.par.set(list(fontsize=list(text=15)))
EE <- equal.count(ethanol$E, number=9,
overlap=1/4)
## Constructing panel functions on the fly; prepanel
xyplot(NOx ~ C | EE, data = ethanol,
      prepanel = function(x, y) prepanel.loess(x, y,
span = 1),
      xlab = "Compression Ratio", ylab = "NOx
(micrograms/J)",
      panel = function(x, y) {
         panel.grid(h=-1, v= 2)
         panel.xyplot(x, y)
         panel.loess(x,y, span=1)
      },
      aspect = "xy")
 Basic plot + geometric
details + adding details
+ adding more details +
    yet more details




                           qplot(carat, price, data = diamonds, geom
                           = c("point", "smooth"))
     Use Rcmdr (R Commander)
• Rcmdr has A LOT of great graphics built into
  the point and click interface.
                    library(Rcmdr)
• Look up my short course (5 talks) covering
  basic statistics to see how to code many
  graphics.
www.stanford.edu/~balise/HowToDoBiostatistics.htm
 You are Going to Need More Help
• Data Manipulation with R by Spector.
   – A must-have book on how to read and write data with or without SQL,
     manipulate data with R, aggregate data, and reshape datasets easily.
• R Programming For Bioinformatics by Gentleman.
   – A very good intermediate level book on how R object-oriented
     programming really works.
• The R Book or Statistical Computing by Crawley.
   – These have nicely written intermediate level statistics.
   – But they are highly redundant across the two books.




                                                      Redundant
• A couple datasets used for this talk were from Glenn
  A. Walker’s Common Statistical Methods for Clinical
  Research with SAS Examples.
   – Buy Walker’s book if you do clinical research and you use
     SAS.
• Data Analysis and Graphics Using R by Maindonald and Braun
   – I constantly use this book to figure out how to do graphics.
• Using R for Introductory Statistics by Verzani.
   – This one has fantastic coverage of the R to do the common statistics.
   – It also has a nearly useless index at the end of book.
               Biostatistics
• John Fox, the guy who made Rcmdr, is an
  excellent author and he provides an R based
  supplement for his superb statitics book.
                             Spectrum
  • If you are doing biomedical research and have
    questions we are here to help.
      – Study design
      – Analysis plan
      – Power and sample size calculation
      – (Limited availability help with SAS and R code)

          med.stanford.edu/spctrm/biostatistician.html

Special thanks to 5F-X, Assemblage 23

								
To top