VIEWS: 3 PAGES: 82 POSTED ON: 10/11/2012
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related than distant things. Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers] Lecture 3 : More Basic Statistics with R Pat Browne Population & Sample • Statistics often involves selecting a random (or representative) subset of a population called a sample. Degrees of freedom (df) Degrees of Freedom • We had total freedom in selecting the first four numbers, but we had no choice in selecting the fifth number. We have four degrees of freedom when selecting five numbers. In general we have (n-1) DOF if we estimate the mean from a sample size n. • DOF is the sample size, n, minus the number of parameters, p, estimated from the data. Recall Permutations & Combinations • P(n,r) = n! / (n-r)! • Permutations (sequence) of a, b, and c taken 2 at a time is 3*2/1=6=<ab>,<ba>,<ac>,<ca>,<bc>,<cb> • C(n,r) = n! /r! (n-r)! • Combinations (set) of a, b, and c taken 2 at a time is 3*2/2*1=3={a,b},{a,c},{b,c} • ab is a distinct permutation from ba, but they are the same combination. Probability Calculations • Conditional probability • P(A|B) = P(A B)/P(B) (probability of A, given B) • Test for independence • P(A B) = P(A)P(B) • Calculation of union • P(A B) = P(A) + P(B) – P(A B) Frequency Table • One way of organizing raw data is to use a frequency table (or frequency distribution), which shows the number of times that an individual item occurs or the number of items that fall within a given range or interval. Frequency Table #tennents Frequency 1 8 Frequency 2 14 16 14 3 7 12 10 4 12 8 Frequency 6 4 5 3 2 0 e 1 3 5 or 6 1 M Histogram with class interval TempRange Frequency 70 0 75 3 Frequency 80 7 10 85 7 8 90 5 6 Frequency 4 95 8 2 100 2 0 105 0 0 0 70 80 90 10 11 110 3 Random variables and probability distributions. • Suppose you toss a coin two times. There are four possible outcomes: HH, HT, TH, and TT. Let the variable X represents the number of heads that result from this experiment. The variable X can take on the values 0, 1, or 2. In this example, X is a random variable; because its value is determined by the outcome of a statistical experiment. Random variables and probability distributions. • A probability distribution is a table (or an equation) that links each outcome of a statistical experiment with its probability of occurrence. The table below, which associates each outcome (the number of heads) with its probability. This is an example of a probability distribution. Mean • The arithmetic mean is the sum of the values in a data set divided by the number of elements in that data set. x = ∑xi n x = ∑fixi where f denotes frequency ∑fi Variance & Standard Deviation • List A: 12,10,9,9,10 • List B: 7,10,14,11,8 • The mean (x) of A & B is 10, but the values of A are more closely clustered around the mean than those in B (there is greater dispersion or spread in B). We use the standard deviation to measure this spread. Variance & Standard Deviation • The variance is always positive and is zero only when all values are equal. variance = ∑(xi - xi )2 n t 2 x2 ( xx2 ) x2 2 ) ( x ( x 1)(x x ... i ) Alternatively n n x x x ... 2 2 2 2 1 2 2 t 2 x i x x n n standard deviation = variance Variance of a frequency distribution ( ) x tx ii ) x 2 f 2 ( 2 f 2 x ) f x 1 x( ) 1 2 t x x ( f2 ... 1 f ff 2... t f Alternatively f f xf 1 2 1x... x x f 2 2 2 tt 2 ii 2 x 2 f 1 2 f f ... t fi Median • The median is the middle value. If the elements are sorted the median is: • Median = valueAt[(n+1)/2] • Median = average(valueAt[n/2], valueAt[n/2+1]) • For odd and even n respectively. Mode • The mode is the class or class value which occurs most frequently. We can have bimodal or multimodal collections of data. Trials with 2 possible outcomes. • Outcome = success or failure • Let p be the probability of success, then q=1-p is the probability of failure. • Often we are interested in the number of successes without considering their order. • The probability of exactly k successes in n repeated trials is: n k n-k • b(k,n,p)= p q k Bernoulli Trials: Example • John hits target: p=1/4, No success (0), all failures, Anything to the power of 0 is 1 • John fires 6 times, n=6,: Only 1 way to pick 0 from 6 • What is the probability John hits the target at least once? Only 1 way to pick 0 from 6 Probability that John hits target at least once 6 1 3 0 6 729 729 P(0) 0 4 , P( X 0) 1 0.82 4 4096 4096 Probability that John does not hit target 0 to the power 0 is undefined, anything else to the power of zero is 1. Bernoulli Trials: Example • Probability that Mary hits target: p=1/4, • Mary fires 6 times, n=6,: • What is the probability Mary hits the target more than 4 times? 6 1 5 1 6 3 1 P(5) P(6) 5 4 0.0046 4 4 This could be written in R: 6*((1/4)^5)*((3/4)^1)+(1/4)^6 Tossing Dice in R • The rep function generates repeats; 6 one sixths which is the probability of a die landing on any one of its faces • die <- 1:6 • p.die <- rep(1/6,6) • The total probability sums to 1. • sum(p.die) Tossing Dice in R die <- 1:6 p.die <- rep(1/6,6) s <- table(sample(die, size=1000, prob=p.die, replace=T)) barX <- barplot(s, ylim=c(0,200)) lbls = sprintf("%0.1f%%", s/sum(s)*100) text(x=barX, y=s+10, label=lbls) Copy the above code and run it R several times. Tossing Dice in R Represesent the dice as a vector with vlaues 1 to 6 > die <- 1:6 Throw the dice 10 time, note replacement. > sample(die, size=10, prob=p.die, replace=T) [1] 1 1 1 2 1 6 6 2 5 1 Calculate the expected value >sum(die*P.die) [1] 3.5 If we sample twice we usually get distinct samples. > sam1 <- sample(die, size=10, prob=p.die, replace=T) > sam2 <- sample(die, size=10, prob=p.die, replace=T) Tossing Dice in R • R code to throw a 1000 dice and make a bar chart of their values. s <- table(sample(die, size=1000, prob=p.die, replace=T)) lbls = sprintf("%0.1f%%", s/sum(s)*100) barX <- barplot(s, ylim=c(0,200)) text(x=barX, y=s+10, label=lbls) Print s and sum(s). > s 1 2 3 4 5 6 160 155 170 173 164 178 > sum(s) [1] 1000 Tossing Dice in R • Expected value of a discrete random variable X is the weighted average of the values in the range of X. • For a die it is: • 1*(1/6)+2*(1/6)+3*(1/6)+4*(1/6)+5*(1/6)+6*(1/6) = 3.5 • Or more simply: • (1+2+3+4+5+6)/6 = 3.5 Random Variable • A random variable X on a finite sample space S is a function from S to a real number R in S’. • Let S be sample space of outcomes from tossing two coins. Then mapping a is; • S={HH,HT,TH,TT} (assume HT≠TH) • Xa(HH)=1, Xa(HT)=2, Xa(TH)=3, Xa(TT)=4 • The range (image) of Xa is: • S’={1,2,3,4} Random Variable • Let S be sample space of outcomes from tossing two coins, where we are interested in the number of heads. Mapping b is: • S={HH,HT,TH,TT} • Xb(HH)=2, Xb(HT)=1, Xb(TH)=1, Xb(TT)=0 • The range (image) of X is: • S’’={0,1,2} Random Variable • A random variable is a function that maps a finite sample space into to a numeric value. The numeric value has a finite probability space of real numbers, where probabilities are assigned to the new space according to the following rule: pi = P(xi)= sum of probabilities of points in S whose range is xi. Random Variable • The function assigning pi to xi can be given as a table called the distribution of the random variable. • pi = P(xi)= number of points in S whose image is xi number of points in S (i = 1,2,3...n) gives the distribution of X Random Variable • The equiprobable space generated by tossing pair of fair dice, consists of 36 ordered pairs(1): • S={(1,1),(1,2),(1,3)...(6,6)} • Let X be the random variable which assigns to each element of S the sum of the two integers: 2,3,4,5,6,7,8, 9,10,11,12 Random Variable • Continuing with the sum of the two dice. • There is only one point whose image is 2, giving P(2)=1/36. • There are two points whose image is 3, giving P(3)=2/36. ( <1,2>≠<2,1>, but their sums are =) • Below is the distribution of X. xi 2 3 4 5 6 7 8 9 10 11 12 =36/36 pi 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 Example: Random Variable • A box contains 9 good items and 3 defective items (total 12 items). Three items are selected at random from the box. Let X be the random variable that counts the number of defective items in a 108 27 sample. X can have values 0-3. 84 3 9 12 1 p i i3x/ 3 ---- x i 220 • Below is the distribution of X. xi 0 1 2 3 pi 84/220 108/220 27/220 1/220 = 220/220 Example: Random Variable • There are choose(12,3) different 3 samples. • There are choose(9,3) = 84 of sample size 3, with 0 defective. • There are choose(9,2)*3 = 108 of sample size 3, with 1 defective. • There are choose(3,2)*9 = 27 of sample size 3, with 2 defective. • There is 1 of sample size 3, with 3 defective. Functions of a Random Variable • If X is a random variable then so is Y=f(X). • P(yk) = sum of probabilities xi, such that yk=f(xi) Expectation and variance of a random variable • Let X be a discrete random variable over sample space S. • X takes values x1,x2,x3,... xt with respective probabilities p1,p2,p3,... pt • An experiment which generates S is repeated n times and the numbers x1,x2,x3,... xt occur with frequency f1,f2,f3,... ft (fi=n) f1 f2 ft • If n is large then , ,... p 1 p 2 p t one expects n n n Expectation of a random variable • So x f x i i becomes f i f 1x f 2x2 ... ftx 1 t x n f1 f2 ft x1 x2 ... x t n n n x p x p ... xp 1 1 2 2 t t • The final formula is the population mean, expectation, or expected value of X is denoted as or E(X). Variance of a random variable • The variance of X is denoted as 2 or Var(X). 2 f 1( x1 x ) 2f 2 ( x2 x ) 2 ... ft ( xt x ) variance n f1 f2 ft ( x1 x ) ( x 2 x ) ... ( xt x ) 2 2 2 n n n ( x1 ) p1 x 2( x 2 ) p 2 ... ( x 2 ) pt 2 2 2 Var ) (X • The standard deviation is Expected value, Variance, Standard Deviation • E(X)= μ = μx =∑xipi • Var(X)= 2 = 2x =∑(xi - μ)2pi • SD(X)= x = Var ) (X Relation between population and sample mean. • If we select a sample size N at random from a population, then it is possible to show that the expected value of the sample mean m approximates the population mean μ. • This rule differs slightly for variance. The sample variance is (N-1)/N times the population variance (almost 1). Example: Random Variable • A box contains 9 good items and 3 defective items (total 12 items). Three items are selected at random from the box. Let X be the random variable that counts the number of defective items in a 108 27 sample. X can have values 0-3. 84 There are choose(9,3) = choose(12, 3) 3 9 12 1 84 of sample size 3, with = 1320/6=220 p i i3x/ 3 ---- 0 defective x i 220 • Below is the distribution of X. xi 0 1 2 3 pi 84/220 108/220 27/220 1/220 = 220/220 Example: Random Variable • There are choose(12,3) different 3 samples. • There are choose(9,3) = 84 of sample size 3, with 0 defective. • There are choose(9,2)*3 = 108 of sample size 3, with 1 defective. • There are choose(3,2)*9 = 27 of sample size 3, with 2 defective. • There is 1 of sample size 3, with 3 defective. Example : Random Variable & Expected Value xi 0 1 2 3 pi 84/220 108/220 27/220 1/220 μ is the expected value of defective items in in a sample size of 3. μ=E(X)= 0(84/220)+1(108/220)+2(27/220)+3(1/220)=132/220=? • Var(X)= 02(84/220)+12 (108/220)+22 (27/220)+32 (1/220) - μ 2 =? • SD(X) sqrt(μ2)=? Fair Game1? • If a prime number appears on a fair die the player wins that value. If an non-prime appears the player looses that value. Is the game fair?(E(X)=0) • S={1,2,3,4,5,6} xi 2 3 5 -1 -4 -6 pi 1/6 1/6 1/6 1/6 1/6 1/6 • E(X) = 2(1/6)+3(1/6)+5(1/6)+(-1)(1/6)+(-4)(1/6)+(-6)(1/6)= -1/6 • Note: 1 is not prime, 2 is prime Fair Game2? • A player tosses two fair coins. The player wins €2 if two heads occur, and wins €1 if one head occurs. The player looses €3 if no heads occur. Find the expected value of the game. How would you test whether or not the game is fair? Is the game fair? Show the sample space and distribution. Fair Game2? • Sample Space S = {HH,HT,TH,TT} each point has probability ¼. • X(HH) = 2, X(HT)=X(TH)=1, X(TT)= -3 • E(X) = 2(1/4)+1(2/3)-3(1/4) = 0.25 • Game is fair if E(X)=0 • Game favours player because E(X)>0 Distribution Example • Five cards are numbered 1 to 5. Two cards are drawn at random. Let X denote the sum of the numbers drawn. Find (a) the distribution of X and (b) the mean, variance, and standard deviation. • There are choose(5,2) = 10 ways of drawing two cards at random. Distribution Example • Ten equiprobable sample points with their corresponding X-values are points 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5 xi 3 4 5 6 5 6 7 7 8 9 Distribution Example(3) • The distribution is: xi 3 4 5 6 5 6 7 7 8 9 pi 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 Distribution Example(4) • The distribution is: xi 3 4 5 6 5 6 7 7 8 9 pi 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 • The mean is: 3(0.1)..+..9(0.1)=6 • The E(X2) is 32(0.1)..+..92(0.1) = 39 • The variance is 39 – 62 = 3 • The SD is sqrt(3) = 1.7 Identically Distributed variable Same probability distributions Binomial Distribution • A random variable Xn is defined on a sample space S. We count the number of successful outcomes of n repeated trials of a success or failure type experiment. The distribution of Xn is: k 0 1 2 .. n P(k) qn n n1 pn n 2 n2 pq 1 p q 2 • Where probability of success in a trial is: p = 1 – q Binomial Distribution • E(Xn ) = np • Var(Xn)=npq • SD(Xn)=sqrt(Var(Xn)) k 0 1 2 .. n P(k) qn n n1 n 2 n2 pn pq 1 p q 2 Binomial Distribution • If a fair die is tossed 180 times the expected number of 6’s is: μ=E(X)=np=180(1/6)=30 • The standard deviation is: npq6/)5 ( )( 15 / 6 180 Normal Distribution The expected value is the mean of a sampling distribution of a statistic. • The number of heads after a fair coin is tossed 6 time. • E(X) = (0x1.5%)+(1x 9.3%)+(2x23.4%)+(3 x31.2%) (4x23.4%)+(5x9.3%)+(6x1.5%) =3 L7: Review: Permutations & Combinations • The number of distinguishable permutations of the word TITLE. • Number of 2-permutations of the word HOGS. • List the 2-combinations of the word HOGS. Machine Learning Correct and Incorrect Interpretations Data and a Linear Model (see Lab1) Moving the line to get a best fit Changing the slope of the line to get a best fit R can calculate the maximum likelihood estimate of the intercept and slope giving: y = 4.8 + (0.6 * x) Two types of data Categorical and Continuous. The type of data will determine the types Statistics and Graphs Two main types of statistical variable: Categorical Nominal: Mutually exclusive categories: male/female, dead/alive, smoker/non- smoker, bus/car/train. Tends to be unordered or have no logical hierarchy Ordinal: Can be ranked in a meaningful order. Distance between values is not relevant as there is no distance information: race positions (1st, 2nd, 3rd), grouped amounts (1-5, 6-10, 11-15 per day). Unlike nominal data, ordinal data can be compared against each other Continuous Interval: Meaningful distance information. Intervals are equidistant e.g. Fahrenheit scale, Celsius scale. Addition or subtraction allowed, but not multiplication or division. Ratio: Similar to interval data but has a true zero point: height, weight, speed, time, Kelvin scale. Multiplication and division are allowed There is a hierarchy of data “quality”. Ratio is the highest level of data, nominal is the lowest. Measurements, Observations, Variables, Values Measurement ID Gender Height (cm) - How we get our data 1 2 168.7 Observations 2 1 172.0 3 1 176.5 - Person or thing measured (rows) 4 1 160.5 Statistical Variables 5 2 174.0 6 1 168.6 - Characteristic being measured (columns) 7 2 160.0 Values 8 2 163.0 9 1 175.0 - Realised measurements / datum 10 2 161.4 Descriptive Statistics • A good statistical model should… - be simpler than the original data - make the most of the data - communicate accurately without distortion • Mean is a measure of central tendency • Median is the central value when values are sorted. • Standard Deviation is a measure of dispersion. • When the distribution of values is skewed, the mean can be an unreliable measure of central tendency, and the median becomes the preferred reporting method. Descriptive Statistics • The mean is sensitive to sample size. Descriptive Statistics frequency frequency frequency Values or normalized values Descriptive Statistics distribution distribution distribution Values or normalized values Normal Distribution in R Normal Distribution in R • The height of one hundred people was measured in centimetres, with mean = 170, sd=8. • We can program this in R: • ht <- seq(150,190,0.1) • #Note type is “l” for line plot(ht,dnorm(ht,170,8), type="l",ylab="Probability density",xlab="height") Normal Distribution in R • > plot(ht,pnorm(ht,170,8), type="l",ylab=" Cumulative Distribution Function ",xlab="height") • > plot(ht,dnorm(ht,170,8), type="l",ylab="Probability density",xlab="height") Z • What is the probability that a randomly selected individual will be: – Taller than a particular height – Shorter that a particular height – Between two heights • We answer these questions using R pnorm function. We first convert a height to a z value, where : z = (y - y) s Z Standard Normal Distribution • Find the probability that someone is less than 160cm Z= (160-170) = -1.25, pnorm(-1.25)=0.1 8 • Find the probability that someone is greater than 185cm Z =(185-170) = 1.875, 1-pnorm(1.875)=0.03 8 T-Test • The t-test assesses whether the means of two groups are statistically different from each other. • If there is a less than 5% chance (p-value<0.05) of getting the observed differences by chance, we reject the null hypothesis and say we found a statistically significant difference between the two groups. T-Test Correlation Correlation The correlation coefficient is equal to the slope of the regression line when both the X and Y variables have been converted to z-scores. Where z is the standardized score: Confidence Intervals • A value higher and lower than the mean • Are used to infer the mean results from a sample to a wider population • Results show that if a study was conducted 100 times, 95 of the times the mean would fall within the upper and lower range • Confidence intervals are wider if the sample is small and if the data is varied. Confidence Intervals • A survey was conducted on rate of work-related stress in a 12 month period (per100,000 employed). • The mean was 780 / 100,000 employed. • The confidence limits are 700 to 860 people • This shows that 95% of the time the mean number of people that self-reported work-related stress in the 12 months would fall between these values Confidence Intervals simpleR : Using R for Introductory Statistics, by John Verzani • Univariate Data • Bivariate Data • Linear regression • Random • Data Simulations • Exploratory Data Analysis. • Confidence Interval Estimation • Hypothesis Testing • Two-sample tests • Regression Analysis • Multiple Linear Regression • Analysis of Variance Correct and Incorrect Interpretations Data and a Linear Model (see Lab1) Moving the line to get a best fit Changing the slope of the line to get a best fit R can calculate the maximum likelihood estimate of the intercept and slope giving: y = 4.8 + (0.6 * x)