Document Sample

Introduction to Spatial Data Analysis Ulysses Diva Student Seminar Spring 2005 Department of Statistics University of Connecticut Storrs, CT 06269 March 20, 2005 Outline of Talk Background Types of Spatial Data Classical Approach to Spatial Modeling Point-Level Data Areal Data Hierarchical Bayesian Approach My Current Research Remarks 4/22/2005 Introduction to Spatial Statistics 2 Background What is spatial statistics? Refers to a very broad collection of methods and techniques of visualization, exploration and analysis applied to data with spatial structure. Spatiotemporal – when time is involved Why is it useful? A lot of data contain geographic information Interest in studying response patterns over a particular region Ignoring the spatial structure spurious results. A Few Examples Meteorology – weather patterns over a country/region Environmental Science – pollutant concentrations over an area Epidemiology – disease monitoring 4/22/2005 Introduction to Spatial Statistics 3 Basic Notation Data/Response: Y(s) Covariates: X(s) Location Index: s ∈D ⊂ ℜk k=1 uni-dimensional (ex. Time series) k=2 two-dimensional (ex. Geographic area) k=3 three-dimensional (ex. Spatiotemporal, Long*Lat*Alt) Sample Points/Locations: si , i = 1, …, n 4/22/2005 Introduction to Spatial Statistics 4 Types of Spatial Data Point-Level Data Y(s) is a random vector s varies continuously over D (fixed); depends on the coordinate system ex. Amount of snowfall at fixed locations (coordinates) across CT Areal Data D is fixed but partitioned into a finite number of units/sections/blocks with well-defined boundaries Y(Bi), i = 1, …, n to distinguish from continuous coordinates ex. Number of republican votes from each state last November Point Pattern Data Y(s) fixed (indicating occurrence of event, may contain covariate info) D is random ex. Locations of a particular species of trees in a forest region 4/22/2005 Introduction to Spatial Statistics 5 Sample Plot: Point-Level Data Fig. 1. Map of PM2.5 (particulate matter less than 2.5 microns in diameter) sampling sites over Illinois, Indiana, and Ohio; plotting character indicates range of average monitored PM2.5 level (in ppb) over year 2001. Source: Banerjee, et al. (2004) 4/22/2005 Introduction to Spatial Statistics 6 Sample Plot: Areal Data Fig. 2. Map of percent of surveyed population with household income below 200% of the federal poverty limit, regional survey units in Hennepin County, MN. Source: Banerjee, et al. (2004) 4/22/2005 Introduction to Spatial Statistics 7 Sample Plot: Combination of Point-Level and Areal Data Fig. 3. Zip code boundaries in the Atlanta metropolitan area and 8-hour maximum ozone levels (ppm) at 10 monitoring sites for July 15, 1995. Source: Banerjee, et al. (2004) 4/22/2005 Introduction to Spatial Statistics 8 Modeling Point-Level Data (Classical Approach) Fundamental concept: (Spatial) Stochastic Process {Y(s): s ∈ D}, D is fixed subset of r-dim (Euclidean) space Data only on {s1, s2, …, sn} (partial realization) Objectives: Make inferences about the process (parameters) Predict response at new locations Note: Point-level data is also known as point-referenced or geostatistical data 4/22/2005 Introduction to Spatial Statistics 9 Modeling Point-Level Data Common Approach Assume that the covariance between the random variables at two locations depends on the distance between them Exponential model for the covariance function Cov(Y(s i ), Y(s i ' )) ≡ C (d ii ' ) = σ 2 e −φdii ' + τ 2 I (i = i ') . dii’ = distance between si, si’ σ2 (>0) – partial sill parameter φ (>0) – decay parameter (1/φ - range parameter) τ2 (>0) – nugget effect τ2 + σ2 is called the sill Several other covariance models are available Covariogram – plot of covariance versus distance 4/22/2005 Introduction to Spatial Statistics 10 Modeling Point-Level Data Common Approach Assume Gaussian process for the random variable (.Y (si ), i = 1,K, n ) ≡ Y | µ, θ ~ N n (µ, Σ(θ )) Nn is the n-dim normal distribution µ – (stationary) mean (= 0) θ = (τ2, σ2, φ)T Σ(θ) – variance-covariance matrix Stationary Strictly Stationary – the distribution is lag-invariant Weakly Stationary – Cov(Y(s),Y(s+h)) = C(h), ∀ h ∈ℜk Intrinsically Stationary – E[Y(s+h) – Y(s)]2 = V[Y(s+h) – Y(s)] = 2γ(h) γ(h) – semi-variogram; 2γ(h) – variogram Isotropic – if γ(h) depends only on ||h|| and not on h itself 4/22/2005 Introduction to Spatial Statistics 11 Modeling Point-Level Data Common Approach Exploratory Data Analysis Basic principle – Spatial process is composed of a global (first-order behavior; mean) and local (second-order; error covariance function) Graphical approaches Estimated/empirical variogram and discretizing the coordinates into (uniform-sized bins) Classical Prediction – Kriging Named by Matheron (1963) in honor of D.G. Krige Goal: Given a random field Y = {Y(si), i = 1,…,n} predict Y(s0) where s0 is unobserved location Approach: Consider the linear model Y = Xβ + ε, ε ~ Nn(0, Σ) MMSE Estimate = WLS Estimate (complicated!) EM algorithm is used if covariate info, x0, is also not available REML possesses “optimal” properties 4/22/2005 Introduction to Spatial Statistics 12 Modeling Point-Level Data Example Fig. 4. Left panel: Sampled sites over the Atlantic Ocean for the 1990 scallops data. Right panel: Same data with contour plots of log catch. Source: Banerjee et al.(2004) 4/22/2005 Introduction to Spatial Statistics 13 Modeling Point-Level Data Example Fig. 5. Ordinary (left) and robust (right) empirical variograms for the scallops data. Source: Banerjee et al.(2004) 4/22/2005 Introduction to Spatial Statistics 14 Modeling Areal Data (Classical Approach) Recall: Data are in grids/blocks Inferential Issues Is there spatial pattern? How strong is it? Units “near” to each other should be more similar How much (spatial) smoothing? Extreme cases: No smoothing observed data points (not much insight) Maximal smoothing only one single value for the units (lost info) What data values are expected if we have a new unit (or set of units)? Modifiable Areal Unit Problem (MAUP) Ex. From zip codes to census blocks Algorithmic approaches vs. model based approaches 4/22/2005 Introduction to Spatial Statistics 15 Modeling Areal Data Foundation Data: Yi ≡ Y(Bi), i = 1, …, n The Proximity Matrix, W = {wij} Spatially connects two units Common choices Binary: wij = 1 if i and j share a common boundary or vertex, 0 o.w. Decreasing function of inter-centroidal distance – Reflect “distance” between units Need not be symmetric (ex. Standardized by Σjwij = wi+) May be viewed as weights: more weights to areas that are “closer” Neighbors Discretize distances into bins: (0 < d1 < d2 < d3 < …) First-order wij(1) = 1 if within (0, d1] k-th order wij(k) = 1 if within (dk-1, dk] 4/22/2005 Introduction to Spatial Statistics 16 Modeling Areal Data Measures of Spatial Association Moran’s I I = [nΣiΣjwij(Yi – Ybar)(Yj – Ybar)]/[(Σi ≠ j wij) Σi(Yj – Ybar)2] Larger |value| stronger association; Not strictly supported on [-1, 1] Geary’s C C = [(n-1)ΣiΣjwij(Yi – Yj)2]/[(Σi ≠ j wij) Σi(Yj – Ybar)2] C > 0; null value = 1; small values (0,1) positive spatial association For significance testing use Monte Carlo Approach Correlogram In I or C, replace wij with wij(r) I(r) or C(r) Plot I(r) (or C(r)) vs. r similar to autocorrelation plot Smoothing Replace Yi by (W-) weighted average of the other Y’s Use convex combo of Yi and weighted average of the other Y’s 4/22/2005 Introduction to Spatial Statistics 17 Modeling Areal Data Common Models Conditionally Autoregressive (CAR) Models Assume: Yi|yj, j≠i ~ N(Σjbijyj, τi2), i = 1, …, n (*) By Brook’s Lemma: p(y1, …, yn) ∝ exp{-0.5(y’D-1(I – B)y)} where B = {bij}, D = diag{τi2} and Σy-1 = D-1(I – B) Do we have a joint multivariate normal for Y? Σy-1 must be symmetric set bij = wij/wi+ and τi2 = τ2/wi+ so that we have p(y1, …, yn) ∝ exp{-(y’(Dw – W)y)/(2τ2)} But (Dw – W)1 = 0 i.e. Σy-1 is singular (improper) constraint Intrinsically Autoregressive (IAR): set ΣiYi = 0 (centered; still improper) For propriety: Use Σy-1 = Dw – ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W Use W* = diag{1/wi+}W; so that we have Σy-1 = M-1(I – αW*) is p.d., where M is diagonal and |α| < 1 4/22/2005 Introduction to Spatial Statistics 18 Modeling Areal Data Common Models Conditionally Autoregressive (CAR) Models From (*), Y = BY + ε ⇔ (I – B)Y = ε i.e. Y induces a distribution on ε If p(y) is proper we can readily interpret the entries in Σy-1 Extensions When covariate information is available same covariance structure When we have vector of dependent random effects MCAR When we have non-Gaussian data (exponential family) For details Banerjee et al. (2004) Besag (1974) 4/22/2005 Introduction to Spatial Statistics 19 Modeling Areal Data Common Models Simultaneous Autoregressive (SAR) Models Let ε induce a distribution on Y Suppose ε = {εi} where εi’s are independent innovations ε ~ Nn(0, D*) where D* = diag{σi2} Yi = ΣjbijYj + εi for i = 1, …, n; let B = {bij} If (I – B) is full rank, Y ~ Nn(0, (I – B)-1D* ((I – B)-1)’) Cov(ε, Y) = D* (I – B)-1 To make (I – B) full rank Use B = ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W Use W* = diag{wij/wi+}; so that we have B = αW* and |α| < 1 Remarks Very efficient likelihood approaches but not MCMC and Gibbs Commonly used in Econometrics 4/22/2005 Introduction to Spatial Statistics 20 Modeling Areal Data Example The above two models are usually placed on the spatial frailties (random effects) rather on the response variables themselves. Ex. Fig. 6. Posterior mean spatial frailties, Iowa cancer data (Banerjee et al., 2004) 4/22/2005 Introduction to Spatial Statistics 21 Hierarchical Bayesian Modeling Can be applied to any type of spatial data All that you need Data: Y(s) with covariates X(s) Likelihood for the data: f(Y(s)|X(s), θ), usually exponential family Prior on θ: π(θ), usually θ = (β, σ2, τ2, φ) Posterior distributions: π(θ|Y(s), X(s)) ∝ f(Y(s)| X(s), θ) π(θ) Just make sure that the posterior distribution is proper to be able to do Bayesian inference Do model checking and comparison Done!!! It is not that easy to setup the computational part for most models. 4/22/2005 Introduction to Spatial Statistics 22 Illustration of Bayesian Approach: Modeling Spatially Correlated Survival Data for Individuals with Multiple Cancer Joint work with Dr. Dey and Prof. Sudipto Banerjee (UMN) Data Source: Surveillance, Epidemiology and End Results (SEER) Registry Data (1973-2001) Variables of interest Patient Level: Patient ID, County, Gender (2 levels), Marital Status (3 levels), Number of Primaries, Survival status at study cutoff (Dead or Alive) Patient-Cancer Level: Cancer type, Age at diagnosis, Survival Time (in months) at cutoff Modeling Framework Cox Proportional Hazards Model Semi-parametric (mixture of betas) specification of baseline hazard Full MCAR for the spatial frailties 4/22/2005 Introduction to Spatial Statistics 23 Modeling Spatially Correlated Survival Data for Individuals with Multiple Cancer Proportional Hazards Model: h(tijk) = h0(tijk)exp(xijkTβ + u(i,j) + νk + φik) i = 1, …, I (= 99 counties) j = 1, …, ni (patients in county i) k ∈{Colon, Rectum} h0(t) ≡ h0(t|η) is the baseline hazard (modeled as mixture of betas, Gelfand and Mallick, 1995) β regression coefficients u(i,j) patient specific frailty νk cancer-specific frailty φik spatial frailty associated to the kth cancer type 4/22/2005 Introduction to Spatial Statistics 24 Modeling Spatially Correlated Survival Data for Individuals with Multiple Cancer Likelihood Function Normal prior for the patient and cancer frailties MCAR prior for the spatial frailties 4/22/2005 Introduction to Spatial Statistics 25 Modeling Spatially Correlated Survival Data for Individuals with Multiple Cancer Dirichlet prior for the mixing weights η ~ Dirichlet(θ1) Inverted Wishart prior for Λ Uniform(0, 1) prior for α All full-conditional distributions are log-concave 4/22/2005 Introduction to Spatial Statistics 26 Modeling Spatially Correlated Survival Data for Individuals with Multiple Cancer 4/22/2005 Introduction to Spatial Statistics 27 Remarks Spatial data arises very often in practice Ignoring spatial information may lead to spurious results There are a lot of established modeling techniques for spatial data but they are mostly limited by computational complexity and interpretability Ex. Moran’s I and Geary’s C The Bayesian approach is very straightforward and can “easily” be extended to accommodate more complex models Care has to be taken in interpreting the results Ex. Interpretation of the spatial autocorrelation parameter can vary across models. 4/22/2005 Introduction to Spatial Statistics 28 Introduction to Spatial Data Statistics Thank you.

DOCUMENT INFO

Shared By:

Categories:

Tags:
Introduction to Spatial Data Analysis, spatial data, Remote Sensing, Spatial Data Mining, Spatial Databases, spatial data infrastructures, spatial analysis, Spatial Data Models, Geographic Information, coastal zone

Stats:

views: | 252 |

posted: | 11/21/2008 |

language: | English |

pages: | 29 |

OTHER DOCS BY gregorio11

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.