# Introduction to Spatial Data Analysis

Document Sample

```					Introduction to Spatial Data Analysis

Ulysses Diva
Student Seminar Spring 2005
Department of Statistics
University of Connecticut
Storrs, CT 06269
March 20, 2005
Outline of Talk

Background
Types of Spatial Data
Classical Approach to Spatial Modeling
Point-Level Data
Areal Data
Hierarchical Bayesian Approach
My Current Research
Remarks

4/22/2005                       Introduction to Spatial Statistics   2
Background

What is spatial statistics?
Refers to a very broad collection of methods and techniques of
visualization, exploration and analysis applied to data with spatial
structure.
Spatiotemporal – when time is involved
Why is it useful?
A lot of data contain geographic information
Interest in studying response patterns over a particular region
Ignoring the spatial structure    spurious results.
A Few Examples
Meteorology – weather patterns over a country/region
Environmental Science – pollutant concentrations over an area
Epidemiology – disease monitoring
4/22/2005                            Introduction to Spatial Statistics            3
Basic Notation

Data/Response: Y(s)
Covariates: X(s)
Location Index: s ∈D ⊂ ℜk
k=1   uni-dimensional (ex. Time series)
k=2   two-dimensional (ex. Geographic area)
k=3   three-dimensional (ex. Spatiotemporal, Long*Lat*Alt)
Sample Points/Locations: si , i = 1, …, n

4/22/2005                       Introduction to Spatial Statistics       4
Types of Spatial Data

Point-Level Data
Y(s) is a random vector
s varies continuously over D (fixed); depends on the coordinate system
ex. Amount of snowfall at fixed locations (coordinates) across CT
Areal Data
D is fixed but partitioned into a finite number of units/sections/blocks
with well-defined boundaries
Y(Bi), i = 1, …, n to distinguish from continuous coordinates
ex. Number of republican votes from each state last November
Point Pattern Data
Y(s) fixed (indicating occurrence of event, may contain covariate info)
D is random
ex. Locations of a particular species of trees in a forest region

4/22/2005                            Introduction to Spatial Statistics                5
Sample Plot: Point-Level Data

Fig. 1. Map of PM2.5
(particulate matter less
than 2.5 microns in
diameter) sampling
sites over Illinois,
Indiana, and Ohio;
plotting character
indicates range of
average monitored
PM2.5 level (in ppb)
over year 2001.
Source: Banerjee, et al.
(2004)

4/22/2005                      Introduction to Spatial Statistics   6
Sample Plot: Areal Data

Fig. 2. Map of percent of surveyed population with household income
below 200% of the federal poverty limit, regional survey units in Hennepin
County, MN.
Source: Banerjee, et al. (2004)
4/22/2005                      Introduction to Spatial Statistics                7
Sample Plot: Combination of Point-Level
and Areal Data
Fig. 3. Zip code
boundaries in the
Atlanta metropolitan
area and 8-hour
maximum ozone levels
(ppm) at 10 monitoring
sites for July 15, 1995.
Source: Banerjee, et al.
(2004)

4/22/2005                      Introduction to Spatial Statistics   8
Modeling Point-Level Data
(Classical Approach)
Fundamental concept: (Spatial) Stochastic Process
{Y(s): s ∈ D}, D is fixed subset of r-dim (Euclidean) space
Data only on {s1, s2, …, sn} (partial realization)
Objectives:
Make inferences about the process (parameters)
Predict response at new locations
Note: Point-level data is also known as point-referenced or
geostatistical data

4/22/2005                           Introduction to Spatial Statistics    9
Modeling Point-Level Data
Common Approach

Assume that the covariance between the random variables at
two locations depends on the distance between them
Exponential model for the covariance function
Cov(Y(s i ), Y(s i ' )) ≡ C (d ii ' ) = σ 2 e −φdii ' + τ 2 I (i = i ')
.
dii’ = distance between si, si’
σ2 (>0) – partial sill parameter
φ (>0) – decay parameter (1/φ - range parameter)
τ2 (>0) – nugget effect
τ2 + σ2 is called the sill
Several other covariance models are available
Covariogram – plot of covariance versus distance

4/22/2005                          Introduction to Spatial Statistics           10
Modeling Point-Level Data
Common Approach

Assume Gaussian process for the random variable
(.Y (si ), i = 1,K, n ) ≡ Y | µ, θ ~ N n (µ, Σ(θ ))
Nn is the n-dim normal distribution
µ – (stationary) mean (= 0)
θ = (τ2, σ2, φ)T
Σ(θ) – variance-covariance matrix
Stationary
Strictly Stationary – the distribution is lag-invariant
Weakly Stationary – Cov(Y(s),Y(s+h)) = C(h), ∀ h ∈ℜk
Intrinsically Stationary – E[Y(s+h) – Y(s)]2 = V[Y(s+h) – Y(s)] = 2γ(h)
γ(h) – semi-variogram; 2γ(h) – variogram
Isotropic – if γ(h) depends only on ||h|| and not on h itself

4/22/2005                            Introduction to Spatial Statistics               11
Modeling Point-Level Data
Common Approach

Exploratory Data Analysis
Basic principle – Spatial process is composed of a global (first-order
behavior; mean) and local (second-order; error covariance function)
Graphical approaches
Estimated/empirical variogram and discretizing the coordinates into
(uniform-sized bins)
Classical Prediction – Kriging
Named by Matheron (1963) in honor of D.G. Krige
Goal: Given a random field Y = {Y(si), i = 1,…,n}   predict Y(s0)
where s0 is unobserved location
Approach: Consider the linear model Y = Xβ + ε, ε ~ Nn(0, Σ)
MMSE Estimate = WLS Estimate (complicated!)
EM algorithm is used if covariate info, x0, is also not available
REML possesses “optimal” properties

4/22/2005                              Introduction to Spatial Statistics            12
Modeling Point-Level Data
Example

Fig. 4. Left panel: Sampled sites over the Atlantic Ocean for the 1990 scallops data.
Right panel: Same data with contour plots of log catch.
Source: Banerjee et al.(2004)
4/22/2005                          Introduction to Spatial Statistics                       13
Modeling Point-Level Data
Example

Fig. 5. Ordinary (left) and robust (right) empirical variograms for the scallops data.
Source: Banerjee et al.(2004)

4/22/2005                           Introduction to Spatial Statistics                       14
Modeling Areal Data
(Classical Approach)
Recall: Data are in grids/blocks
Inferential Issues
Is there spatial pattern? How strong is it?
Units “near” to each other should be more similar
How much (spatial) smoothing? Extreme cases:
No smoothing observed data points (not much insight)
Maximal smoothing only one single value for the units (lost info)
What data values are expected if we have a new unit (or set of units)?
Modifiable Areal Unit Problem (MAUP)
Ex. From zip codes to census blocks
Algorithmic approaches vs. model based approaches

4/22/2005                            Introduction to Spatial Statistics              15
Modeling Areal Data
Foundation

Data: Yi ≡ Y(Bi), i = 1, …, n
The Proximity Matrix, W = {wij}
Spatially connects two units
Common choices
Binary: wij = 1 if i and j share a common boundary or vertex, 0 o.w.
Decreasing function of inter-centroidal distance – Reflect “distance”
between units
Need not be symmetric (ex. Standardized by Σjwij = wi+)
May be viewed as weights: more weights to areas that are “closer”
Neighbors
Discretize distances into bins: (0 < d1 < d2 < d3 < …)
First-order    wij(1) = 1 if within (0, d1]
k-th order    wij(k) = 1 if within (dk-1, dk]
4/22/2005                             Introduction to Spatial Statistics               16
Modeling Areal Data
Measures of Spatial Association

Moran’s I
I = [nΣiΣjwij(Yi – Ybar)(Yj – Ybar)]/[(Σi ≠ j wij) Σi(Yj – Ybar)2]
Larger |value| stronger association; Not strictly supported on [-1, 1]
Geary’s C
C = [(n-1)ΣiΣjwij(Yi – Yj)2]/[(Σi ≠ j wij) Σi(Yj – Ybar)2]
C > 0; null value = 1; small values (0,1) positive spatial association
For significance testing use Monte Carlo Approach
Correlogram
In I or C, replace wij with wij(r) I(r) or C(r)
Plot I(r) (or C(r)) vs. r similar to autocorrelation plot
Smoothing
Replace Yi by (W-) weighted average of the other Y’s
Use convex combo of Yi and weighted average of the other Y’s

4/22/2005                            Introduction to Spatial Statistics              17
Modeling Areal Data
Common Models

Conditionally Autoregressive (CAR) Models
Assume: Yi|yj, j≠i ~ N(Σjbijyj, τi2), i = 1, …, n    (*)
By Brook’s Lemma: p(y1, …, yn) ∝ exp{-0.5(y’D-1(I – B)y)}
where B = {bij}, D = diag{τi2} and Σy-1 = D-1(I – B)
Do we have a joint multivariate normal for Y?
Σy-1 must be symmetric set bij = wij/wi+ and τi2 = τ2/wi+
so that we have p(y1, …, yn) ∝ exp{-(y’(Dw – W)y)/(2τ2)}
But (Dw – W)1 = 0
i.e. Σy-1 is singular (improper) constraint
Intrinsically Autoregressive (IAR): set ΣiYi = 0 (centered; still improper)
For propriety:
Use Σy-1 = Dw – ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W
Use W* = diag{1/wi+}W; so that we have Σy-1 = M-1(I – αW*) is p.d.,
where M is diagonal and |α| < 1

4/22/2005                             Introduction to Spatial Statistics                     18
Modeling Areal Data
Common Models

Conditionally Autoregressive (CAR) Models
From (*), Y = BY + ε ⇔ (I – B)Y = ε
i.e. Y induces a distribution on ε
If p(y) is proper we can readily interpret the entries in Σy-1
Extensions
When covariate information is available same covariance structure
When we have vector of dependent random effects MCAR
When we have non-Gaussian data (exponential family)
For details
Banerjee et al. (2004)
Besag (1974)

4/22/2005                               Introduction to Spatial Statistics         19
Modeling Areal Data
Common Models

Simultaneous Autoregressive (SAR) Models
Let ε induce a distribution on Y
Suppose ε = {εi} where εi’s are independent innovations
ε ~ Nn(0, D*) where D* = diag{σi2}
Yi = ΣjbijYj + εi for i = 1, …, n; let B = {bij}
If (I – B) is full rank, Y ~ Nn(0, (I – B)-1D* ((I – B)-1)’)
Cov(ε, Y) = D* (I – B)-1
To make (I – B) full rank
Use B = ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W
Use W* = diag{wij/wi+}; so that we have B = αW* and |α| < 1
Remarks
Very efficient likelihood approaches but not MCMC and Gibbs
Commonly used in Econometrics

4/22/2005                               Introduction to Spatial Statistics         20
Modeling Areal Data
Example
The above two models are usually placed on the spatial frailties (random effects)
rather on the response variables themselves.
Ex. Fig. 6. Posterior mean spatial frailties, Iowa cancer data (Banerjee et al., 2004)

4/22/2005                            Introduction to Spatial Statistics                        21
Hierarchical Bayesian Modeling

Can be applied to any type of spatial data
All that you need
Data: Y(s) with covariates X(s)
Likelihood for the data: f(Y(s)|X(s), θ), usually exponential family
Prior on θ: π(θ), usually θ = (β, σ2, τ2, φ)
Posterior distributions: π(θ|Y(s), X(s)) ∝ f(Y(s)| X(s), θ) π(θ)
Just make sure that the posterior distribution is proper to be able to do
Bayesian inference
Do model checking and comparison
Done!!!
It is not that easy to setup the computational part for most models.

4/22/2005                            Introduction to Spatial Statistics                 22
Illustration of Bayesian Approach:
Modeling Spatially Correlated Survival Data for
Individuals with Multiple Cancer
Joint work with Dr. Dey and Prof. Sudipto Banerjee (UMN)
Data Source: Surveillance, Epidemiology and End Results
(SEER) Registry Data (1973-2001)
Variables of interest
Patient Level: Patient ID, County, Gender (2 levels), Marital Status (3
levels), Number of Primaries, Survival status at study cutoff (Dead or
Alive)
Patient-Cancer Level: Cancer type, Age at diagnosis, Survival Time (in
months) at cutoff
Modeling Framework
Cox Proportional Hazards Model
Semi-parametric (mixture of betas) specification of baseline hazard
Full MCAR for the spatial frailties
4/22/2005                           Introduction to Spatial Statistics                23
Modeling Spatially Correlated Survival Data for
Individuals with Multiple Cancer

Proportional Hazards Model:
h(tijk) = h0(tijk)exp(xijkTβ + u(i,j) + νk + φik)
i = 1, …, I (= 99 counties)
j = 1, …, ni (patients in county i)
k ∈{Colon, Rectum}
h0(t) ≡ h0(t|η) is the baseline hazard (modeled as mixture of betas,
Gelfand and Mallick, 1995)
β regression coefficients
u(i,j) patient specific frailty
νk cancer-specific frailty
φik spatial frailty associated to the kth cancer type

4/22/2005                            Introduction to Spatial Statistics            24
Modeling Spatially Correlated Survival Data for
Individuals with Multiple Cancer

Likelihood Function

Normal prior for the patient and cancer frailties

MCAR prior for the spatial frailties

4/22/2005                   Introduction to Spatial Statistics   25
Modeling Spatially Correlated Survival Data for
Individuals with Multiple Cancer

Dirichlet prior for the mixing weights
η ~ Dirichlet(θ1)
Inverted Wishart prior for Λ
Uniform(0, 1) prior for α
All full-conditional distributions are log-concave

4/22/2005                       Introduction to Spatial Statistics   26
Modeling Spatially Correlated Survival Data for
Individuals with Multiple Cancer

4/22/2005                  Introduction to Spatial Statistics   27
Remarks

Spatial data arises very often in practice
Ignoring spatial information may lead to spurious results
There are a lot of established modeling techniques for spatial
data but they are mostly limited by computational complexity
and interpretability
Ex. Moran’s I and Geary’s C
The Bayesian approach is very straightforward and can
“easily” be extended to accommodate more complex models
Care has to be taken in interpreting the results
Ex. Interpretation of the spatial autocorrelation parameter can vary
across models.

4/22/2005                           Introduction to Spatial Statistics             28
Introduction to Spatial Data Statistics

Thank you.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 252 posted: 11/21/2008 language: English pages: 29
How are you planning on using Docstoc?