Docstoc

Spatial Statistics and Spatial Knowledge Discovery

Document Sample
Spatial Statistics and Spatial Knowledge Discovery Powered By Docstoc
					Spatial Statistics and Spatial Knowledge
Discovery

First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.

Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers]




        Lecture 3 : More Basic Statistics
                     with R
                  Pat Browne
       Population & Sample
• Statistics often involves selecting a
  random (or representative) subset of a
  population called a sample.
Degrees of freedom (df)
        Degrees of Freedom
• We had total freedom in selecting the first four
  numbers, but we had no choice in selecting the
  fifth number. We have four degrees of freedom
  when selecting five numbers. In general we have
  (n-1) DOF if we estimate the mean from a
  sample size n.
• DOF is the sample size, n, minus the number of
  parameters, p, estimated from the data.
  Recall Permutations & Combinations
• P(n,r) = n! / (n-r)!
• Permutations (sequence) of a, b, and c taken 2
  at a time is 3*2/1=6=<ab>,<ba>,<ac>,<ca>,<bc>,<cb>
• C(n,r) = n! /r! (n-r)!
• Combinations (set) of a, b, and c taken 2 at a
  time is 3*2/2*1=3={a,b},{a,c},{b,c}
• ab is a distinct permutation from ba, but they are
  the same combination.
        Probability Calculations
• Conditional probability
• P(A|B) = P(A  B)/P(B) (probability of A, given B)
• Test for independence

• P(A  B) = P(A)P(B)

• Calculation of union

• P(A  B) = P(A) + P(B) – P(A  B)
          Frequency Table
• One way of organizing raw data is to use a
  frequency table (or frequency distribution),
  which shows the number of times that an
  individual item occurs or the number of
  items that fall within a given range or
  interval.
            Frequency Table
#tennents       Frequency

            1               8

                                              Frequency
            2               14
                                 16
                                 14
            3               7
                                 12
                                 10
            4               12    8                       Frequency
                                  6
                                  4
            5               3     2
                                  0




                                                      e
                                      1


                                          3


                                              5

                                                   or
            6               1




                                                  M
 Histogram with class interval

TempRange         Frequency
            70                0


            75                3              Frequency

            80                7
                                  10
            85                7   8

            90                5   6
                                                         Frequency
                                  4
            95                8

                                  2
            100               2
                                  0
            105               0




                                               0

                                               0
                                  70

                                       80

                                             90
                                            10

                                            11
            110               3
   Random variables and probability
           distributions.
• Suppose you toss a coin two times. There are
  four possible outcomes: HH, HT, TH, and TT.
  Let the variable X represents the number of
  heads that result from this experiment. The
  variable X can take on the values 0, 1, or 2. In
  this example, X is a random variable; because
  its value is determined by the outcome of a
  statistical experiment.
   Random variables and probability
           distributions.
• A probability distribution is a table (or an
  equation) that links each outcome of a
  statistical experiment with its probability of
  occurrence. The table below, which associates
  each outcome (the number of heads) with its
  probability. This is an example of a probability
  distribution.
                  Mean
• The arithmetic mean is the sum of the
  values in a data set divided by the number
  of elements in that data set.

         x =    ∑xi
                n

         x =    ∑fixi   where f denotes frequency
                 ∑fi
 Variance & Standard Deviation
• List A: 12,10,9,9,10
• List B: 7,10,14,11,8
• The mean (x) of A & B is 10, but the
  values of A are more closely clustered
  around the mean than those in B (there is
  greater dispersion or spread in B). We
  use the standard deviation to measure this
  spread.
 Variance & Standard Deviation
• The variance is always positive and is zero only
  when all values are equal.

         variance =    ∑(xi - xi )2
                          n
               
                 t 2 x2
             ( xx2 )   
              x2 2 ) ( x ( x
              1)(x   x
                    ...  i )
                        
Alternatively
                  n       n
         
          x
       x x ... 2  2
       2 2
       1 2
             2
             t
                 2
                 x
                 i
               
               x x
          n      n
standard deviation =     variance
Variance of a frequency distribution

( ) x tx ii )
 x 2 f 2 ( 2
f 2 x ) f x
1 x( )
 1
   2
      t x
         x 
         (
    f2 ...  
           
    1 
      f
    ff 
     2...
       t    
            f

Alternatively


         
  f  f 
   xf
    1
     2
     1x...    x
            x f 2
                 2
                 2   tt
                       2
                            ii
                              2
               x              2

      f
    1 2 
    f f ...
          t   
              fi
                Median
• The median is the middle value. If the
  elements are sorted the median is:
• Median = valueAt[(n+1)/2]
• Median = average(valueAt[n/2],
                 valueAt[n/2+1])
• For odd and even n respectively.
                  Mode
• The mode is the class or class value which
  occurs most frequently. We can have
  bimodal or multimodal collections of data.
Trials with 2 possible outcomes.
• Outcome = success or failure
• Let p be the probability of success, then q=1-p is
  the probability of failure.
• Often we are interested in the number of
  successes without considering their order.
• The probability of exactly k successes in n
  repeated trials is:

             n  k n-k
• b(k,n,p)=   p q
            k 
             
               Bernoulli Trials: Example
   • John hits target: p=1/4,        No success (0), all failures,
                                     Anything to the power of 0 is 1

   • John fires 6 times, n=6,:       Only 1 way to pick 0 from 6


   • What is the probability John hits the target at least
     once?
Only 1 way to pick 0 from 6               Probability that John hits target at least once

             6  1   3 
                              0   6
                             729                    729
     P(0)       
             0 4                , P( X  0)  1        0.82
                4      4096                   4096


   Probability that John does not hit target

0 to the power 0 is undefined, anything else to the power of zero is 1.
           Bernoulli Trials: Example
   • Probability that Mary hits target: p=1/4,
   • Mary fires 6 times, n=6,:
   • What is the probability Mary hits the target more
     than 4 times?



                  6  1 
                                     5       1          6
                                         3 1
   P(5)  P(6)    
                 5 4                         0.0046
                                     4 4

This could be written in R: 6*((1/4)^5)*((3/4)^1)+(1/4)^6
         Tossing Dice in R
• The rep function generates repeats; 6 one
  sixths which is the probability of a die
  landing on any one of its faces
• die <- 1:6
• p.die <- rep(1/6,6)
• The total probability sums to 1.
• sum(p.die)
               Tossing Dice in R
die <- 1:6

p.die <- rep(1/6,6)

s <- table(sample(die, size=1000, prob=p.die, replace=T))

barX <- barplot(s, ylim=c(0,200))

lbls = sprintf("%0.1f%%", s/sum(s)*100)

text(x=barX, y=s+10, label=lbls)




Copy the above code and run it R several times.
                Tossing Dice in R
Represesent the dice as a vector with vlaues 1 to 6
> die <- 1:6
Throw the dice 10 time, note replacement.
> sample(die, size=10, prob=p.die, replace=T)
 [1] 1 1 1 2 1 6 6 2 5 1
Calculate the expected value
>sum(die*P.die)
[1] 3.5
If we sample twice we usually get distinct samples.
> sam1 <- sample(die, size=10, prob=p.die, replace=T)
> sam2 <- sample(die, size=10, prob=p.die, replace=T)
                  Tossing Dice in R
• R code to throw a 1000 dice and make a bar chart
  of their values.
s <- table(sample(die, size=1000,       prob=p.die, replace=T))
lbls = sprintf("%0.1f%%", s/sum(s)*100)
barX <- barplot(s, ylim=c(0,200))
text(x=barX, y=s+10, label=lbls)



Print s and sum(s).
> s
 1   2   3   4   5   6
160 155 170 173 164 178
> sum(s)
[1] 1000
                Tossing Dice in R
• Expected value of a discrete random
  variable X is the weighted average of the
  values in the range of X.
• For a die it is:
•   1*(1/6)+2*(1/6)+3*(1/6)+4*(1/6)+5*(1/6)+6*(1/6) = 3.5

• Or more simply:
• (1+2+3+4+5+6)/6 = 3.5
          Random Variable
• A random variable X on a finite sample
  space S is a function from S to a real
  number R in S’.
• Let S be sample space of outcomes from
  tossing two coins. Then mapping a is;
• S={HH,HT,TH,TT} (assume HT≠TH)
• Xa(HH)=1, Xa(HT)=2, Xa(TH)=3, Xa(TT)=4
• The range (image) of Xa is:
• S’={1,2,3,4}
          Random Variable
• Let S be sample space of outcomes from
  tossing two coins, where we are interested
  in the number of heads. Mapping b is:
• S={HH,HT,TH,TT}
• Xb(HH)=2, Xb(HT)=1, Xb(TH)=1, Xb(TT)=0
• The range (image) of X is:
• S’’={0,1,2}
          Random Variable
• A random variable is a function that maps
  a finite sample space into to a numeric
  value. The numeric value has a finite
  probability space of real numbers, where
  probabilities are assigned to the new
  space according to the following rule:
 pi = P(xi)= sum of probabilities of points
  in S whose range is xi.
           Random Variable
• The function assigning pi to xi can be
  given as a table called the distribution of
  the random variable.
• pi = P(xi)=
    number of points in S whose image is xi
    number of points in S
(i = 1,2,3...n) gives the distribution of X
          Random Variable
• The equiprobable space generated by
  tossing pair of fair dice, consists of 36
  ordered pairs(1):
• S={(1,1),(1,2),(1,3)...(6,6)}
• Let X be the random variable which
  assigns to each element of S the sum of
  the two integers: 2,3,4,5,6,7,8, 9,10,11,12
                Random Variable
   • Continuing with the sum of the two dice.
   • There is only one point whose image is 2, giving
     P(2)=1/36.
   • There are two points whose image is 3, giving
     P(3)=2/36. ( <1,2>≠<2,1>, but their sums are =)
   • Below is the distribution of X.

xi 2 3 4 5 6 7 8 9 10 11 12
                                                            =36/36
pi 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
     Example: Random Variable
 • A box contains 9 good items and 3 defective items
   (total 12 items). Three items are selected at
   random from the box. Let X be the random variable
   that counts the number of defective items in a 108
                                                    27
   sample. X can have values 0-3.                   84
                           3
                            9 12                1
                       p 
                        i    
                                
                          i3x/ 3   
                                     
                                                ----
                           x i 
                                               220

 • Below is the distribution of X.
xi       0         1         2        3
pi       84/220    108/220   27/220   1/220     = 220/220
     Example: Random Variable


• There are choose(12,3) different 3 samples.
• There are choose(9,3) = 84 of sample size 3,
  with 0 defective.
• There are choose(9,2)*3 = 108 of sample size 3,
  with 1 defective.
• There are choose(3,2)*9 = 27 of sample size 3,
  with 2 defective.
• There is 1 of sample size 3, with 3 defective.
Functions of a Random Variable
• If X is a random variable then so is
  Y=f(X).
• P(yk) = sum of probabilities xi, such that
  yk=f(xi)
      Expectation and variance of a
            random variable
• Let X be a discrete random variable over sample
  space S.
• X takes values x1,x2,x3,... xt with respective
  probabilities p1,p2,p3,... pt
• An experiment which generates S is repeated n
  times and the numbers x1,x2,x3,... xt occur
  with frequency f1,f2,f3,... ft (fi=n)

                       f1  f2   ft
• If n is large then
                         ,  ,...
                          p
                          1   p
                              2    p
                                   t
  one expects          n   n    n
    Expectation of a random variable
• So      x
             f x
                i i
                      becomes
             f  i



         f 1x  f 2x2  ... ftx
              1                 t
    x
                    n
       f1       f2           ft
         x1      x2  ... x  t
       n        n            n
     x p  x p  ... xp
       1 1     2 2         t t



• The final formula is the population mean, expectation,
  or expected value of X is denoted as  or E(X).
     Variance of a random variable
    • The variance of X is denoted as 2 or Var(X).
                                                                               2
                f 1( x1    x )  2f   2   ( x2    x ) 2 
                                                         ...   ft ( xt  x )
variance 
                                               n
    f1               f2                      ft
 ( x1  x )  ( x 2  x )  ...  ( xt  x )
               2                  2                       2

    n                n                       n
 ( x1   ) p1  x 2( x 2   ) p 2  ...  ( x 2   ) pt
            2                   2                       2



                                             Var )
                                                (X
    • The standard deviation is
     Expected value, Variance,
        Standard Deviation
• E(X)= μ = μx =∑xipi
• Var(X)= 2 = 2x =∑(xi - μ)2pi

• SD(X)= x =
                 Var )
                    (X
Relation between population and
         sample mean.
• If we select a sample size N at random from
  a population, then it is possible to show that
  the expected value of the sample mean m
  approximates the population mean μ.
• This rule differs slightly for variance. The
  sample variance is (N-1)/N times the
  population variance (almost 1).
       Example: Random Variable
    • A box contains 9 good items and 3 defective items
        (total 12 items). Three items are selected at
        random from the box. Let X be the random variable
        that counts the number of defective items in a 108
                                                        27
        sample. X can have values 0-3.                  84
There are choose(9,3) =   choose(12, 3)    3      
                                            9 12 1
84 of sample size 3, with = 1320/6=220  p 
                                        i
                                              
                                                 
                                          i3x/ 3 ----
                                                  
                                                  
0 defective                                x i  220
                                                
  • Below is the distribution of X.
 xi        0         1         2        3
 pi        84/220    108/220   27/220   1/220      = 220/220
     Example: Random Variable


• There are choose(12,3) different 3 samples.
• There are choose(9,3) = 84 of sample size 3,
  with 0 defective.
• There are choose(9,2)*3 = 108 of sample size 3,
  with 1 defective.
• There are choose(3,2)*9 = 27 of sample size 3,
  with 2 defective.
• There is 1 of sample size 3, with 3 defective.
   Example : Random Variable & Expected
                  Value
   xi         0          1          2          3
   pi         84/220     108/220    27/220     1/220


 μ is the expected value of defective items in in a
  sample size of 3.
μ=E(X)=
 0(84/220)+1(108/220)+2(27/220)+3(1/220)=132/220=?
• Var(X)=
 02(84/220)+12 (108/220)+22 (27/220)+32 (1/220) - μ 2 =?
• SD(X) sqrt(μ2)=?
                     Fair Game1?
• If a prime number appears on a fair die the player
  wins that value. If an non-prime appears the player
  looses that value. Is the game fair?(E(X)=0)
• S={1,2,3,4,5,6}

              xi 2         3     5      -1    -4    -6
              pi 1/6       1/6   1/6    1/6   1/6   1/6


• E(X) = 2(1/6)+3(1/6)+5(1/6)+(-1)(1/6)+(-4)(1/6)+(-6)(1/6)= -1/6

• Note: 1 is not prime, 2 is prime
               Fair Game2?
• A player tosses two fair coins. The player
  wins €2 if two heads occur, and wins €1 if
  one head occurs. The player looses €3 if no
  heads occur. Find the expected value of the
  game. How would you test whether or not the
  game is fair? Is the game fair? Show the
  sample space and distribution.
               Fair Game2?
• Sample Space S = {HH,HT,TH,TT} each point
  has probability ¼.
• X(HH) = 2, X(HT)=X(TH)=1, X(TT)= -3



• E(X) = 2(1/4)+1(2/3)-3(1/4) = 0.25
• Game is fair if E(X)=0
• Game favours player because E(X)>0
         Distribution Example
• Five cards are numbered 1 to 5. Two
  cards are drawn at random. Let X denote
  the sum of the numbers drawn. Find (a)
  the distribution of X and (b) the mean,
  variance, and standard deviation.
• There are choose(5,2) = 10 ways of
  drawing two cards at random.
                 Distribution Example
• Ten equiprobable sample points with their
  corresponding X-values are

points
         1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5

xi       3   4     5   6   5   6   7   7   8   9
         Distribution Example(3)
• The distribution is:



xi   3   4     5    6    5   6   7   7   8   9

pi   0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1
            Distribution Example(4)
 • The distribution is:

xi     3    4    5    6    5    6    7       7   8   9
pi     0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1


•    The mean is: 3(0.1)..+..9(0.1)=6
•    The E(X2) is 32(0.1)..+..92(0.1) = 39
•    The variance is 39 – 62 = 3
•    The SD is sqrt(3) = 1.7
Identically Distributed variable
Same probability distributions
            Binomial Distribution
• A random variable Xn is defined on a sample space
  S. We count the number of successful outcomes of
  n repeated trials of a success or failure type
  experiment. The distribution of Xn is:
k       0     1            2            ..   n

P(k)    qn      n n1                      pn
                            n 2 n2
                 pq
               1           p q
                            2
                           
• Where probability of success in a trial is: p = 1 – q
         Binomial Distribution
• E(Xn ) = np
• Var(Xn)=npq
• SD(Xn)=sqrt(Var(Xn))


k      0     1              2           ..   n

P(k)   qn         n n1   n 2 n2        pn
                   pq
                 1         p q
                            2
                           
         Binomial Distribution
• If a fair die is tossed 180 times the expected
  number of 6’s is:
   μ=E(X)=np=180(1/6)=30

• The standard deviation is:

   npq6/)5
     ( )(
       15
        / 6
      180
Normal Distribution
The expected value is the mean of
a sampling distribution of a statistic.
• The number of heads after a fair coin is
  tossed 6 time.
• E(X) = (0x1.5%)+(1x 9.3%)+(2x23.4%)+(3 x31.2%)
         (4x23.4%)+(5x9.3%)+(6x1.5%) =3
    L7: Review: Permutations &
          Combinations
• The number of distinguishable
  permutations of the word TITLE.



• Number of 2-permutations of the word
  HOGS.
• List the 2-combinations of the word
  HOGS.
Machine Learning
Correct and Incorrect
   Interpretations
Data and a Linear Model (see Lab1)

                                                         Moving the
                                                         line to get a
                                                         best fit




                                                         Changing
                                                         the slope of
                                                         the line to
                                                         get a best fit



   R can calculate the maximum likelihood estimate of
   the intercept and slope giving: y = 4.8 + (0.6 * x)
Two types of data Categorical and Continuous. The type of data will determine the
types Statistics and Graphs Two main types of statistical variable:

Categorical
Nominal: Mutually exclusive categories: male/female, dead/alive, smoker/non-
smoker, bus/car/train. Tends to be unordered or have no logical hierarchy

Ordinal: Can be ranked in a meaningful order. Distance between values is not
relevant as there is no distance information: race positions (1st, 2nd, 3rd), grouped
amounts (1-5, 6-10, 11-15 per day). Unlike nominal data, ordinal data can be
compared against each other

Continuous
Interval: Meaningful distance information. Intervals are equidistant e.g. Fahrenheit
scale, Celsius scale. Addition or subtraction allowed, but not multiplication or
division.

Ratio: Similar to interval data but has a true zero point: height, weight, speed, time,
Kelvin scale. Multiplication and division are allowed
There is a hierarchy of data “quality”. Ratio is the highest level of data, nominal is
the lowest.
    Measurements, Observations,
         Variables, Values
Measurement
                                              ID   Gender   Height (cm)
  - How we get our data
                                              1      2        168.7

Observations                                  2      1        172.0
                                              3      1        176.5
  - Person or thing measured (rows)           4      1        160.5

Statistical Variables                         5      2        174.0
                                              6      1        168.6
  - Characteristic being measured (columns)   7      2        160.0

Values                                        8      2        163.0
                                              9      1        175.0
  - Realised measurements / datum             10     2        161.4
          Descriptive Statistics
• A good statistical model should…
   - be simpler than the original data
   - make the most of the data
   - communicate accurately without distortion
• Mean is a measure of central tendency
• Median is the central value when values are sorted.
• Standard Deviation is a measure of dispersion.
• When the distribution of values is skewed, the mean can
  be an unreliable measure of central tendency, and the
  median becomes the preferred reporting method.
       Descriptive Statistics
• The mean is sensitive to sample size.
            Descriptive Statistics

            frequency




frequency                                      frequency



                                 Values
                        or normalized values
               Descriptive Statistics

               distribution




distribution                                         distribution



                                       Values
                              or normalized values
Normal Distribution in R
          Normal Distribution in R
• The height of one hundred people was
  measured in centimetres, with
  mean = 170, sd=8.
• We can program this in R:
• ht <- seq(150,190,0.1)
• #Note type is “l” for line
plot(ht,dnorm(ht,170,8), type="l",ylab="Probability density",xlab="height")
      Normal Distribution in R
• > plot(ht,pnorm(ht,170,8), type="l",ylab="
  Cumulative Distribution Function
  ",xlab="height")



• > plot(ht,dnorm(ht,170,8),
  type="l",ylab="Probability
  density",xlab="height")
                         Z
• What is the probability that a randomly
  selected individual will be:
  – Taller than a particular height
  – Shorter that a particular height
  – Between two heights
• We answer these questions using R
  pnorm function. We first convert a height
  to a z value, where : z = (y - y)
                              s
Z
  Standard Normal Distribution
• Find the probability that someone is less
  than 160cm
Z= (160-170) = -1.25, pnorm(-1.25)=0.1
        8
• Find the probability that someone is
  greater than 185cm
Z =(185-170) = 1.875, 1-pnorm(1.875)=0.03
        8
                            T-Test
• The t-test assesses whether the means of two groups are
  statistically different from each other.




• If there is a less than 5% chance (p-value<0.05) of getting the
  observed differences by chance, we reject the null hypothesis and
  say we found a statistically significant difference between the two
  groups.
T-Test
Correlation
                    Correlation



The correlation coefficient is equal to the slope of the
regression line when both the X and Y variables have been
converted to z-scores. Where z is the standardized score:
        Confidence Intervals
• A value higher and lower than the mean
• Are used to infer the mean results from a
  sample to a wider population
• Results show that if a study was
  conducted 100 times, 95 of the times the
  mean would fall within the upper and lower
  range
• Confidence intervals are wider if the
  sample is small and if the data is varied.
         Confidence Intervals
• A survey was conducted on rate of work-related
  stress in a 12 month period (per100,000
  employed).
• The mean was 780 / 100,000 employed.
• The confidence limits are 700 to 860 people
• This shows that 95% of the time the mean
  number of people that self-reported work-related
  stress in the 12 months would fall between these
  values
Confidence Intervals
    simpleR : Using R for Introductory
       Statistics, by John Verzani
•   Univariate Data
•   Bivariate Data
•   Linear regression
•   Random
•   Data Simulations
•   Exploratory Data Analysis.
•   Confidence Interval Estimation
•   Hypothesis Testing
•   Two-sample tests
•   Regression Analysis
•   Multiple Linear Regression
•   Analysis of Variance
Correct and Incorrect
   Interpretations
Data and a Linear Model (see Lab1)

                                                         Moving the
                                                         line to get a
                                                         best fit




                                                         Changing
                                                         the slope of
                                                         the line to
                                                         get a best fit



   R can calculate the maximum likelihood estimate of
   the intercept and slope giving: y = 4.8 + (0.6 * x)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/11/2012
language:English
pages:82