Docstoc

Spatial Statistics and Spatial Knowledge Discovery

Document Sample
Spatial Statistics and Spatial Knowledge Discovery Powered By Docstoc
					Spatial Statistics and Spatial Knowledge
Discovery

First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.

Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers]




            Lecture 1 : Introduction to R
                    Pat Browne
  Introduction to programming in R
• R is a computer language and environment that
  allows users to program algorithms and use pre-
  written packages. R is a free software
  environment for statistical computing and
  graphics (including mapping).
• There are special R-packages for handling and
  analyzing spatial data. For example, The sp
  package provides classes and methods for
  points, lines, polygons, and grids.
• R can extract spatial data from PostgreSQL.
  Also, R can be combined with SQL using PL/R.
                                 Installing R
• R for Windows can be downloaded from
•   http://ftp.heanet.ie/mirrors/cran.r-project.org/bin/windows/base/R-2.12.1-win.exe

• See Lab1.doc for installation details.
                  Starting R
• We will look at the main features of R, see
  lab1.doc for more details. This lecture also
  presents an introduction to programming.
• The basic components of current languages are:
  – Data types e.g. Integers, String, Polygon.
  – Variables to refer to data types e.g. a <- 2
  – Operations on those data types e.g. area(polygon)
  – Control structures e.g. sequence, iteration, and
    conditions.
  – Logic is an important part of programming, but it is
    often implicit and external to the language. Some
    languages like SQL are quite close to logic.
 Starting R: Programs consists of
        Data, Operations etc.
• The basic components of current languages are:
  – Data types e.g. Integer, String, Polygon.
  – Variables to refer to data types e.g. a <- 2
  – Operations on those data types e.g. area(polygon)
  – Control structures e.g. sequence, iteration, and
    conditions.
  – Logic is an important part of programming, but it is
    often implicit and external to the language. Some
    languages like SQL are quite close to logic.
         Starting R: Variables
• Variables provide a means of accessing the data
  stored in computer memory. R provides a
  number of specialized data structures or objects
  (also called data types). These objects are
  referenced in your programs using variables.
Store: a <- 2 Access: a
Store: b <-”Pat” Access: b
• Assigns the variable a the number 2 and the
  variable b the string “Pat”.
         Starting R: Data types
• A data type represents a constraint placed upon the
  interpretation of data in a type system, describing
  representation, interpretation, legal operations and
  structure of values.
• Data types are a way to limit the kind of data that can be
  used by a particular program or stored in a database
  table. Types restrict the data to a certain set of values
  (e.g. 1,2,3,..for Integers).
• Data types also are restricted to certain operations on
  the type (e.g. addition for Integers). R comes with a
  range of standard data types that can be used to
  represent strings, integers, real numbers, and dates, but
  R also has types that are especially suited to statistics
  such as vectors and tables.
           Starting R: Data types




The c() function combines its argument into a vector.
In R the term modes is used to describe data types. There are 4 basic types or
modes: numeric, character, complex , and logical. These can be combined to
form collections or what are called object in R.
Starting R: Data types (Objects)
Starting R: Data types (Objects)
Starting R: Data types (Objects)
Starting R: Finding data types
          Starting R: Data types
•   Numbers: 1, 1.4.
•   Strings: “ABC” or “abc”
•   Vector:
•   Arrays: are vectors plus dimension vector (dim)
•   Factors: for nominal & ordered categorical data
•   Data Frames: matrix-like for data of different types
•   Tables
•         One Way Tables
•         Two Way Tables
    Starting R: Data types- Numbers
a <- 3
b <- sqrt(a*a+3)
• List of the defined variables:
> ls()
• We can add 1 to every element of a list
> a <- c(1,2,3,4,5)
> a+1
• We can get the mean, variance, and standard deviation
  from a list of numbers
> mean(a)
> var(a)
> sd(a)
    Starting R: Data types- Strings
>    a   <- "hello"
>    a   [1] "hello"
>    b   <- c("hello","there")
>    b   [1]
>    b   [2]
  Starting R: Data types-Vector
• R operates on named data structures. The simplest such
  structure is the numeric vector, which is a single entity
  consisting of an ordered collection of numbers. To set up
  a vector named x use the R command
     > x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
     > x[2]
• The is an assignment statement using the function c()
  which can take an arbitrary number of vector arguments
  and whose value is a vector got by concatenating its
  arguments end to end.
• A number occurring by itself in an expression is taken as
  a vector of length one.
  Starting R: Data types-Arrays
• Arrays are vectors plus the dim attribute
  (dimension vector), matrices are arrays
  with a dim attribute of length 2. Arrays are
  ordered column major order
Starting R: Data types-Matrices
• Arrays are vectors plus the dim attribute
  (dimension vector), matrices are arrays
  with a dim attribute of length 2. Arrays are
  ordered column major order
 Starting R: Data types- Factor
• When looking at the impact of carbon
  dioxide on the growth rate of a tree you
  might try to observe how different trees
  grow when exposed to different preset
  concentrations of carbon dioxide. The
  different levels are also called factors.
 Starting R: Data types- Factor
• Load in the file tree91.csv.
• tree <-
  read.csv(file="C:\\yourDir\\trees91.csv",header=TR
  UE,sep=",");

• The summary operation prints out the
  possible values and the frequency that
  they occur. Find summary of the chamber
  identification label (CHBR)
• summary(tree$CHBR)
    Starting R: Data types- Factor
• Load in the file tree91.csv.
• tree <-
    read.csv(file="C:\\yourDir\\trees91.csv",header=TRUE,sep=",");
• The summary operation prints out the possible values
  and the frequency that they occur. Find summary of the
  chamber identification label (CHBR)
• summary(tree$CHBR)
• Note the output of the summary operation produces
  quartiles. A quartile is one of three points (including the
  median), that divide a data set into four equal groups,
  each representing a fourth of the distributed sampled
  population.
    Starting R: Data types- Factor
• Load in the file tree91.csv.
• tree <-
    read.csv(file="C:\\yourDir\\trees91.csv",header=TRUE,sep=",");
• The summary operation prints out the possible values
  and the frequency that they occur. Find summary of the
  chamber identification label (CHBR)
• summary(tree$CHBR)
• Note the output of the summary operation produces
  quartiles. A quartile is one of three points (including the
  median), that divide a data set into four equal groups,
  each representing a fourth of the distributed sampled
  population.
 Starting R: Data types- Factor
• A nominal value is represented as a factor
  in R. The factor stores the nominal values
  as a vector of integers in the range [
  1... k ] (where k is the number of
  unique values in the nominal variable),
  and an internal vector of character strings
  (the original values) mapped to these
  integers.
  Starting R: Data types- Factor
• Variable gender with 20 male entries and 30
  female entries
• gender <- c(rep("male",20), rep("female", 30))
• gender <- factor(gender)

• Stores gender as 20 1s and 30 2s, where
  1=female, 2=male internally (alphabetically)
• R now treats gender as a nominal variable
• summary(gender)
  Starting R: Data types- Factor
• An ordered factor is used to represent an ordinal
  variable. Consider a variable rating coded as
  large, medium, small
rating <- c(rep("large",10), rep("medium", 10),rep("small", 10) )
rating <- ordered(rating)

• R codes rating to 1,2,3 and associates: 1=large,
  2=medium, 3=small internally
• R treats factors as nominal variables and
  ordered factors as ordinal variables in statistical
  procedures and graphical analyses.
    Starting R: Data types- Factor
• A factor is a vector object used to specify a discrete
  classification (grouping) of the components of other
  vectors of the same length. R provides both ordered and
  unordered factors. The application of factors is with
  model formulae. A sample of 30 tax accountants from all
  the states of Australia by a character vectors as
•   state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
    "qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas", "sa",
    "nt", "wa", "vic", "qld", "nsw", "nsw", "wa", "sa", "act",
    "nsw", "vic", "vic", "act")
• A factor is created using the factor() function:
• statef <- factor(state)
• To find out the levels of a factor the function levels() can be used.
levels(statef) [1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
  Starting R: Data types- Factor
• Categorical data is often used to classify data into
  various levels or factors. For example, the smoking data
  could be part of a broader survey on student health
  issues. R has a special class for working with factors
  which is occasionally important to know as R will
  automatically adapt itself when it knows it has a factor.
• x=c("Yes","No","No","Yes","Yes")
• > x
• > factor(x)
• [1] Yes No No Yes Yes
• Levels: No Yes
• Notice levels are printed.
 Starting R: Data types- Dataframe
• A dataframe is more general than a matrix, in
  that different columns can have different modes
  (numeric, character, factor, etc.). It is a bit like an
  SQL table.
 d <- c(1,2,3,4)
  e <- c("red", "white", "red", NA)
  f <- c(TRUE,TRUE,TRUE,FALSE)
  mydata <- data.frame(d,e,f)
  names(mydata) <- c("ID","Color","Passed")
• There are a variety of ways to identify the
  elements of a dataframe .
mydata[2:3] # columns 2,3 of dataframe
mydata[c("ID",“Color")] # columns ID,Color
myframe$ID # name in dataframe
  Starting R: Data types- Table
• One way tables are created with table
  command, its arguments are a vector of
  factors, and it calculates the frequency
  that each factor occurs.
   Starting R: Data types- Table
> a <- factor(c("A","A","B","A","B","B","C","A","C"))
> results <- table(a)
> attributes(results)
>attributes(results) $dimnames$a
>attributes(results) $dim
>attributes(results) $ class
> summary(results)
  Starting R: Data types- Table
• Say we have two questions: first
  responses are Never, Sometimes, Always,
  the second are Yes, No, Maybe. The set
  of vectors a and b contain the response
  for each measurement. In the vectors,
  responses are represented by position.
  The third item in a is how the third person
  responded to the first question, and the
  third item in b is how the third person
  responded to the second question.
   Starting R: Data types- Table
> a <-
c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"
   ,"Never")

> b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
> results <- table(a,b)

> results
       b
a         Maybe No Yes
 Always      2 0 0
 Never       0 1 1
 Sometimes 2 1 1

The table shows that two people who said Maybe to the first question
also said Sometimes to the second question.
    Starting R: Data types- Table
•   Two Way Tables

•   If you want to add rows to your table just add another vector to the argument of the table
    command. In the example below we have two questions. In the first question the responses are
    labeled "Never," "Sometimes," or "Always." In the second question the responses are labeled
    "Yes," "No," or "Maybe." The set of vectors "a," and "b," contain the response for each
    measurement. The third item in "a" is how the third person responded to the first question, and the
    third item in "b" is how the third person responded to the second question.

•   > a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
•   > b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
•   > results <- table(a,b)
•   > results
•          b
•   a       Maybe No Yes
•    Always        2 0 0
•    Never        0 1 1
•    Sometimes 2 1 1
•   >

•   The table command allows us to do a very quick calculation, and we can immediately see that two
    people who said "Maybe" to the first question also said "Sometimes" to the second question.
              Useful functions
length(object) # number of elements or components
str(object)    # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...)#combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object     # prints the object
ls()       # list current objects
rm(object) # delete an object
newobject<-edit(object) #edit,copy,save,newobject
fix(object)                 # edit in place
   Starting R : Input-Output IO
• There are many ways to data into R. We
  focus on just three:
  – Assignment
  – Reading a CSV File (writing later)
  – Loading data from PostgreSQL (later)
     Starting R : IO-Assignment
• Assignment (RHS <- LHS) allows an expression on the
  RHS to be stored in a name object on the LHS. In R
> a <- c(3,5,7,9) >
• The above assignment uses the combine command. (c
  means combine). This makes a vector called a. No
  output is produced yet. Now we can retrieve the contents
  of a just by typing it in.
• > a
• > a[3]
• The command gives all of a the second command gives
  the third element of a . [3] is called the index. The zero
  entry hold the data type of the a vector. Try:
• b <- c("one","two","three")
      Starting R : IO-Assignment
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <-
   matrix(cells, nrow=2, ncol=2,
           byrow=TRUE,
           dimnames=

                list(rnames, cnames))



Type >attributes(mymatrix)
Type >help(array) to find more details on Arrays
         Starting R : Input: File
• Place the file simple.csv in a directory (folder).
• Load the file into R using:

• View the contents of h:



• Now the contents of the file are stored in R as
  the object named h.
• Type >names(h)
Starting R: Data types-Matrices
• All columns in a matrix must have the same data type
  (numeric, character, etc.) and the same length. The
  general format is:
mymatrix <-
matrix(vector, nrow=r, ncol=c,
         byrow=FALSE,
         dimnames=list(char_vector_rownames,
         char_vector_colnames))

• byrow=TRUE indicates that the matrix should be filled
  by rows. byrow=FALSE indicates that the matrix should
  be filled by columns (the default). dimnames provides
  optional labels for the columns and rows.
  Review - vectors, lists, matrices,
            data frames
• To make vectors x, y, year, names
x <- c(2,3,7,9)
y <- c(9,7,3,2)
year <- 1990:1993
names <- c("payal", "shraddha", "kritika", "itida")
Accessing last element
y[length(y)]
• To make a list person
person <- list(name="payal", x=2, y=9, year=1990)
Accessing   person$name, person$x
   Review - vectors, lists, matrices,
             data frames
• To make a matrix, pasting together the columns year , x,
  y using column bind.
m <- cbind(year, x, y)
• To make a data frame, which is a list of vectors of the
  same length
D <- data.frame(names, year, x, y)
nrow(D)
• Accessing one of these vectors
D$names
Accessing the last element of this vector
  D$names[nrow(D)]
 D$names[length(D$names)]
                      Sorting
• The variable i is a vector of integers, then the
  data frame D[i,] picks up rows from D based on
  the values found in `i'. The order() function
  makes an integer vector which is a correct
  ordering for the purpose of sorting.
• D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2))
•   Sort on x
•   indexes <- order(D$x)
•   D[indexes,]
•   Print out sorted dataset, sorted in reverse by y
    D[rev(order(D$y)),]
  Logical constants & variables
• TRUE and FALSE are logical constants
• T and F are logical variables
• T and F are quite not synonyms for TRUE
  and FALSE but variables that have the
  expected values by default
• TRUE == TRUE
• T == T
• Normally give the expected result.
           Missing Values : NA
• Not Available or Missing Values are represented as NA,
  which is a logical constant (either T or F) which contains
  a missing value indicator.
• Examples

is.na(c(1, NA)) #FALSE TRUE
is.na(c(NA, NA)) #TRUE TRUE

is.na(paste(c(1, NA))) > FALSE FALSE
xx <- c(0:4)
is.na(xx) <- c(2, 4)
xx                 > 0 NA 2 NA 4
   Writing your own functions.
• R comes with a built-in median function.
• Usage: median(x, na.rm = FALSE)
• x an object for which a method has been
  defined, or a numeric vector containing the
  values whose median is to be computed.
• na.rm a logical value indicating whether
  NA values should be stripped before the
  computation proceeds.
                    Control - If
> if (T) print("Hello") else print("Good Bye")
[1] "Hello"
> if (F) print("Hello") else print("Good Bye")
[1] "Good Bye"
               Control - Sequence
a <- c(1,2,3,4,5)
b <- c(2,3,4,5)
odd.even <- length(a) %% 2
if (odd.even == 0)
             (sort(a)[length(a)/2] +
              sort(a)[1 + length(a)/2])/2 else
       sort(a)[ceiling(length(a)/2)]

If we want to find the median of b we have to type the whole thing again.
> if (odd.even == 0) (sort(b)[length(b)/2] + sort(b)[1 + length(b)/2])/2
   else sort(b)[ceiling(length(b)/2)]
It would be better to write a function.
        User Written - Functions
a <- c(1,2,3,4,5)
b <- c(2,3,4,5)
mymedian <- function(x){
             odd.even <- length(x) %% 2
             if (odd.even == 0)
                 (sort(x)[length(x)/2] +
                   sort(x)[1 + length(x)/2])/2 else
        sort(x)[ceiling(length(x)/2)]
}
Now we can call, run, execute or invoke my median on any vector.
> mymedian(a)
> mymedian(b)
                          References



                                    Applied Spatial Data Analysis with R
Lloyd: Spatial Data Analysis         Bivand, Pebesma, Gómez-Rubio




                                            http://www.manning.com/obe/
   http://www.spatial.cs.umn.edu/Book/