A Short Introduction to the ‗R‘ Programming Language for Biologists
is a free (GNU open source) software for statistical computing, created by Ross
Ihaka and Robert Gentleman. It has several important features that make it especially
useful for biologists interested in genomics. R contains many pre-built general statistical
functions as well as a large collection of modules specifically designed to analyze
genomic data (BioConductor). Also, R can handle extremely large tables of numbers
(arrays) such as the tens of thousands of probes on a gene expression microarray (or
genome tiling array) and the millions of SNPs on the latest genotyping platforms.
R contains operators that can manipulate numbers or matrices as well as a programming
language equipped with conditional expressions, loops and input/output abilities; which
allows users to create their own scripts and functions. Using this language, the open
source community has created many useful statistical tools for a wide variety of different
purposes, including biology. Often, the most sophisticated new data analysis methods are
first publicly released as R modules.
R is typically used (by statisticians) as tool for direct data analysis. It is controlled by
directly typing text commands (a ―command line interface‖), but it is capable of
producing very elaborate publication-quality graphical output. R is available for
Windows, Macintosh, and Linux computers. It can be used on powerful servers and
clusters when big CPU power is required, but most people will find it more convenient to
install it on their own personal computer or laptop.
So, take a moment and download and install R on your computer now (pick a local site):
That was easy.
Now start up the R application (it may have created a desktop icon during install, or
might be located in your ―Programs‖ or ―Applications‖ directory). You should get an ―R
Console‖ window on the desktop with some ―Welcome to R‖ text followed by the nice
friendly > sign, indicating that R is ready to execute your every command. Type each
command exactly as shown below (in blue) followed by a carriage return (or ―Enter‖
key). Results returned by R are shown on the line below each command.
> 1+1 # calculate directly
> x=2+2 # or put the value into a variable
> y=((3 / 2)^2 + 2) * pi
> Q <- x+y
> q # case sensitive
Error: object "q" not found
> dog <- Spot # text needs quotes
Error: object "Spot" not found
> dog <- “Lassie”
So, from this little exercise we learn that R can be used a calculator for both simple and
complex math; it can store values in a variable, which are returned when you enter the
variable name; variables can be used in mathematical operations; and variable names are
case sensitive (even on a Windows computer). The assignment operator ―<-― is
traditionally used instead of the equals sign ―=‖ in R to put values into variables, but you
can do it either way. Variables can also hold text (strings), which are specified with quote
In addition to variables that hold a single value, R has a number of more complex data
types. A vector holds many different values in a specific order. The elements of a vector
must all be of the same type, i.e. all numbers, all strings, or all logical T/F values.
Elements of a vector can be manipulated by an index number, indicated by square
brackets: x[i]. The index number can be used to query a value from a vector or to write a
value into the vector. The command length ( ) reports how many elements are in a vector.
Note that a series is indicated by two numbers separated by a colon, so [1:10] gives you
the integers from one to ten. Math and statistical operations can be done on vectors. Try
the following commands:
> x = c(1,2,4,66,8,4) # concatenate these elements into x
> x = 11 # put 11 into the third element of x
> x[3:5] # show elements 3-5 from x
 11 66 8
> length (x) # how many elements in x?
> sum(x) # add up all elements in x
> mean(x) # average of all elements in x
> y = c(8,8,3,4,1,3)
> x+y # add corresponding elements in vectors
 9 10 14 70 9 7
Got all that? The letter ‗c‘ in the first command indicates that the elements separated by
commas should be concatenated into the vector x. You can assign (or change) an element
in a vector by using its index number. The third to fifth elements of x are now 11, 66, and
8. The length ( ) of x is 6 because it has 6 elements. Sum ( ) and mean ( ) operate on all
the elements of vector x. Note that functions such as length ( ), sum ( ), mean ( ), etc are
always followed by a target expression in parentheses. When you add vectors x and y, the
individual elements in corresponding positions are added.
A matrix is a table that holds both rows and columns of values: y[i,j]. An array holds
values in as many dimensions as you wish: z[i,j,k,l,m]. Try the following commands:
> mat1= matrix(1:12, nrow=2, ncol=6, byrow= TRUE)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
> length(mat1) # how many elements in mat1?
> dim(mat1) # dimensions of mat1
 2 6
> mat2= rbind(x,y) # put x and y into mat2 by rows
[,1] [,2] [,3] [,4] [,5] [,6]
x 1 2 11 66 8 4
y 8 8 3 4 1 3
> mat1 * mat2 # multiply corresponding matrix elements
[,1] [,2] [,3] [,4] [,5] [,6]
x 1 4 33 264 40 24
y 56 64 27 40 11 36
Mat1 contains the numbers from 1 to 12 in two rows and six columns, entered by row.
The length of a matrix is equal to the total number of elements it contains; the dimensions
are the number of rows and columns. Mat2 contains vectors x and y, entered by row. The
product of two matrices is the product of individual elements in corresponding positions.
R can do math on very large matrices very quickly.
A data frame is a matrix that contains values that may be of different types (like a
spreadsheet with some columns that contain text and other columns with numbers). Each
column of the data frame is a vector, and all columns must be the same length. Data
frames are often used to hold genomic information. To create a data frame, first create
three vectors, using some genome size data from www.ornl.gov:
> organism <- c("Human","Mouse","Fruit Fly", "Roundworm","Yeast")
> genomeSizeBP <- c(3000000000,3000000000,135600000,97000000,12100000)
> GeneCount <- c(30000, 30000, 13061, 19099, 6034)
Now, join these vectors into a data frame using the function data.frame ( ). Note that the
format here is ―column name‖ = ―vector name.‖ Then to work with a specific column of
data within the data frame, use the ―$‖ operator with the column name. A single value
within a column is addressed with an index number in square brackets. You can add an
additional column to an existing data frame with cbind (or add another row with rbind).
> compGenomes<- data.frame(organism=organism,
+ genomeSizeBP=genomeSizeBP, GeneCount=GeneCount)
organism genomeSizeBP GeneCount
1 Human 3.000e+09 30000
2 Mouse 3.000e+09 30000
3 Fruit Fly 1.356e+08 13061
4 Roundworm 9.700e+07 19099
5 Yeast 1.210e+07 6034
> GeneDensity = compGenomes$genomeSizeBP/compGenomes$GeneCount
> compGenomes <- cbind(compGenomes, GeneDensity)
organism genomeSizeBP GeneCount GeneDensity
1 Human 3.000e+09 30000 100000.000
2 Mouse 3.000e+09 30000 100000.000
3 Fruit Fly 1.356e+08 13061 10382.053
4 Roundworm 9.700e+07 19099 5078.800
5 Yeast 1.210e+07 6034 2005.303
In order to do useful bioinformatics work with R, you will need to be able to load data
from a file. R works best with tab-delimited plain text files. It DOES NOT read standard
Excel files, but it is a simple process to open a file in Excel and "Save As..." in Text (Tab
The function read.table ( ) reads data from a plain text file (space or tab delimited
columns) directly into a data frame. Read.csv ( ) reads a comma delimited file (an
option used by many database programs). Before R can read a file, you have to guide it to
the file — use the ―Change Working Directory…‖ command from the ―Misc‖ menu and
navigate to the folder that holds the data file. The read.table function takes many
options, but the simplest ones are the filename and ―header=TRUE‖ to read the first line
of the file as a header which contains the column names. It is necessary to assign a
variable name to the data frame created by read.table. Copy the compGenomes data
above into a text file, then the command to import the file would look like this:
> compGenomes2 <- read.table(―compGenomes.txt‖, header=TRUE)
Once your data analysis is done, write (for a single vector) or write.table (for a data
frame) will save output to a file:
> write (myResults, file = "myresults.txt", sep = "\t")
In order to be a functional computer language, R must be able to make IF decisions based
on mathematical or logical criteria. The IF command is implicit when any of the
comparison operators are used (>, <, >=, <=, ==, !=), which can be combined with the
logical operators AND, OR, NOT (&, |, !). A comparison can be used as part of an
expression to extract a subset of values from a matrix. The function which ( ) returns the
index numbers (rather than the values) from a vector or matrix that satisfy a condition.
> a>1 # is a greater than 1?
> a == 7 | a !=4 # is a equal to 7? OR is a not equal to 4?
> a < 9 & a ==2 # is a less than 9? AND is a equal to 2?
> plus5 <- [mat1 > 5] # put into plus5 all values from mat1 that are greater than 5
#note that plus5 is a vector, not a matrix
 7 8 9 10 11 6 12
> x = c(1,2,11,66,8,4)
> listx <- which (x >= 5) # put into listx indexes for values in x >= 5
 3 4 5
> x[listx] # show values from x for indexes in listx
 11 66 8
R includes very sophisticated graphical output functions. Try the following commands:
> barplot(x, bg=‖blue‖)
QuickTime™ an d a
TIFF (Uncompressed) decompressor
are need ed to see this p icture .
Scripts and Programming in R
Up to this point, this tutorial has used R only as an interactive tool – by typing commands
directly into the R Console and executing them immediately with the Return/Enter key.
Like Unix and Perl, R can be used as a scripting language by simply writing a series of
commands in a text file, then executing the file. This allows more complex programs to
be built, modified, and re-used. A series of R commands can be written in any text editor
(Notepad, TextEdit, Word) or using the built-in text editor from the R application. In any
case, the script is saved as a text file with the ―.R‖ extension, then loaded into R using the
source( ) command or as a menu item (File > Source File …) from the R Console.
The simplest programming task in R is to define a function, which can be any
combination of existing R operators and functions. For example, the function std ( )
(standard deviation) can be defined as the square root of the variance (two existing
> std = function (x) sqrt(var(x))
> data <- c(1,3,2,4,1,4,6)
Once this function has been defined in a session (or in a script) then you can apply the
function to any appropriate object (in this case a vector or matrix of numerical values).