Introduction to Spatial Data Analysis

Document Sample
Introduction to Spatial  Data Analysis Powered By Docstoc
					Introduction to Spatial Data Analysis



              Ulysses Diva
      Student Seminar Spring 2005
          Department of Statistics
         University of Connecticut
            Storrs, CT 06269
             March 20, 2005
                               Outline of Talk

    Background
    Types of Spatial Data
    Classical Approach to Spatial Modeling
            Point-Level Data
            Areal Data
    Hierarchical Bayesian Approach
    My Current Research
    Remarks




4/22/2005                       Introduction to Spatial Statistics   2
                                     Background

    What is spatial statistics?
            Refers to a very broad collection of methods and techniques of
            visualization, exploration and analysis applied to data with spatial
            structure.
            Spatiotemporal – when time is involved
    Why is it useful?
            A lot of data contain geographic information
            Interest in studying response patterns over a particular region
            Ignoring the spatial structure    spurious results.
    A Few Examples
            Meteorology – weather patterns over a country/region
            Environmental Science – pollutant concentrations over an area
            Epidemiology – disease monitoring
4/22/2005                            Introduction to Spatial Statistics            3
                              Basic Notation

    Data/Response: Y(s)
    Covariates: X(s)
    Location Index: s ∈D ⊂ ℜk
            k=1   uni-dimensional (ex. Time series)
            k=2   two-dimensional (ex. Geographic area)
            k=3   three-dimensional (ex. Spatiotemporal, Long*Lat*Alt)
    Sample Points/Locations: si , i = 1, …, n




4/22/2005                       Introduction to Spatial Statistics       4
                             Types of Spatial Data

    Point-Level Data
            Y(s) is a random vector
            s varies continuously over D (fixed); depends on the coordinate system
            ex. Amount of snowfall at fixed locations (coordinates) across CT
    Areal Data
            D is fixed but partitioned into a finite number of units/sections/blocks
            with well-defined boundaries
            Y(Bi), i = 1, …, n to distinguish from continuous coordinates
            ex. Number of republican votes from each state last November
    Point Pattern Data
            Y(s) fixed (indicating occurrence of event, may contain covariate info)
            D is random
            ex. Locations of a particular species of trees in a forest region

4/22/2005                            Introduction to Spatial Statistics                5
                  Sample Plot: Point-Level Data

    Fig. 1. Map of PM2.5
    (particulate matter less
    than 2.5 microns in
    diameter) sampling
    sites over Illinois,
    Indiana, and Ohio;
    plotting character
    indicates range of
    average monitored
    PM2.5 level (in ppb)
    over year 2001.
    Source: Banerjee, et al.
    (2004)



4/22/2005                      Introduction to Spatial Statistics   6
                      Sample Plot: Areal Data




    Fig. 2. Map of percent of surveyed population with household income
    below 200% of the federal poverty limit, regional survey units in Hennepin
    County, MN.
    Source: Banerjee, et al. (2004)
4/22/2005                      Introduction to Spatial Statistics                7
            Sample Plot: Combination of Point-Level
                         and Areal Data
    Fig. 3. Zip code
    boundaries in the
    Atlanta metropolitan
    area and 8-hour
    maximum ozone levels
    (ppm) at 10 monitoring
    sites for July 15, 1995.
    Source: Banerjee, et al.
    (2004)




4/22/2005                      Introduction to Spatial Statistics   8
                        Modeling Point-Level Data
                          (Classical Approach)
    Fundamental concept: (Spatial) Stochastic Process
            {Y(s): s ∈ D}, D is fixed subset of r-dim (Euclidean) space
            Data only on {s1, s2, …, sn} (partial realization)
    Objectives:
            Make inferences about the process (parameters)
            Predict response at new locations
    Note: Point-level data is also known as point-referenced or
    geostatistical data




4/22/2005                           Introduction to Spatial Statistics    9
                               Modeling Point-Level Data
                             Common Approach

    Assume that the covariance between the random variables at
    two locations depends on the distance between them
    Exponential model for the covariance function
      Cov(Y(s i ), Y(s i ' )) ≡ C (d ii ' ) = σ 2 e −φdii ' + τ 2 I (i = i ')
       .
            dii’ = distance between si, si’
            σ2 (>0) – partial sill parameter
            φ (>0) – decay parameter (1/φ - range parameter)
            τ2 (>0) – nugget effect
             τ2 + σ2 is called the sill
    Several other covariance models are available
    Covariogram – plot of covariance versus distance


4/22/2005                          Introduction to Spatial Statistics           10
                                 Modeling Point-Level Data
                              Common Approach

    Assume Gaussian process for the random variable
            (.Y (si ), i = 1,K, n ) ≡ Y | µ, θ ~ N n (µ, Σ(θ ))
            Nn is the n-dim normal distribution
            µ – (stationary) mean (= 0)
            θ = (τ2, σ2, φ)T
            Σ(θ) – variance-covariance matrix
    Stationary
            Strictly Stationary – the distribution is lag-invariant
            Weakly Stationary – Cov(Y(s),Y(s+h)) = C(h), ∀ h ∈ℜk
            Intrinsically Stationary – E[Y(s+h) – Y(s)]2 = V[Y(s+h) – Y(s)] = 2γ(h)
               γ(h) – semi-variogram; 2γ(h) – variogram
    Isotropic – if γ(h) depends only on ||h|| and not on h itself

4/22/2005                            Introduction to Spatial Statistics               11
                                  Modeling Point-Level Data
                                Common Approach

    Exploratory Data Analysis
            Basic principle – Spatial process is composed of a global (first-order
            behavior; mean) and local (second-order; error covariance function)
            Graphical approaches
            Estimated/empirical variogram and discretizing the coordinates into
            (uniform-sized bins)
    Classical Prediction – Kriging
            Named by Matheron (1963) in honor of D.G. Krige
            Goal: Given a random field Y = {Y(si), i = 1,…,n}   predict Y(s0)
            where s0 is unobserved location
            Approach: Consider the linear model Y = Xβ + ε, ε ~ Nn(0, Σ)
            MMSE Estimate = WLS Estimate (complicated!)
               EM algorithm is used if covariate info, x0, is also not available
               REML possesses “optimal” properties

4/22/2005                              Introduction to Spatial Statistics            12
                               Modeling Point-Level Data
                                       Example




    Fig. 4. Left panel: Sampled sites over the Atlantic Ocean for the 1990 scallops data.
    Right panel: Same data with contour plots of log catch.
    Source: Banerjee et al.(2004)
4/22/2005                          Introduction to Spatial Statistics                       13
                                Modeling Point-Level Data
                                        Example




    Fig. 5. Ordinary (left) and robust (right) empirical variograms for the scallops data.
    Source: Banerjee et al.(2004)

4/22/2005                           Introduction to Spatial Statistics                       14
                              Modeling Areal Data
                              (Classical Approach)
    Recall: Data are in grids/blocks
    Inferential Issues
            Is there spatial pattern? How strong is it?
               Units “near” to each other should be more similar
            How much (spatial) smoothing? Extreme cases:
               No smoothing observed data points (not much insight)
               Maximal smoothing only one single value for the units (lost info)
            What data values are expected if we have a new unit (or set of units)?
               Modifiable Areal Unit Problem (MAUP)
               Ex. From zip codes to census blocks
    Algorithmic approaches vs. model based approaches


4/22/2005                            Introduction to Spatial Statistics              15
                                     Modeling Areal Data
                                       Foundation

    Data: Yi ≡ Y(Bi), i = 1, …, n
    The Proximity Matrix, W = {wij}
            Spatially connects two units
            Common choices
               Binary: wij = 1 if i and j share a common boundary or vertex, 0 o.w.
               Decreasing function of inter-centroidal distance – Reflect “distance”
               between units
               Need not be symmetric (ex. Standardized by Σjwij = wi+)
            May be viewed as weights: more weights to areas that are “closer”
    Neighbors
            Discretize distances into bins: (0 < d1 < d2 < d3 < …)
            First-order    wij(1) = 1 if within (0, d1]
            k-th order    wij(k) = 1 if within (dk-1, dk]
4/22/2005                             Introduction to Spatial Statistics               16
                                    Modeling Areal Data
                    Measures of Spatial Association

    Moran’s I
            I = [nΣiΣjwij(Yi – Ybar)(Yj – Ybar)]/[(Σi ≠ j wij) Σi(Yj – Ybar)2]
            Larger |value| stronger association; Not strictly supported on [-1, 1]
    Geary’s C
            C = [(n-1)ΣiΣjwij(Yi – Yj)2]/[(Σi ≠ j wij) Σi(Yj – Ybar)2]
            C > 0; null value = 1; small values (0,1) positive spatial association
    For significance testing use Monte Carlo Approach
    Correlogram
            In I or C, replace wij with wij(r) I(r) or C(r)
            Plot I(r) (or C(r)) vs. r similar to autocorrelation plot
    Smoothing
            Replace Yi by (W-) weighted average of the other Y’s
            Use convex combo of Yi and weighted average of the other Y’s

4/22/2005                            Introduction to Spatial Statistics              17
                                      Modeling Areal Data
                                  Common Models

    Conditionally Autoregressive (CAR) Models
            Assume: Yi|yj, j≠i ~ N(Σjbijyj, τi2), i = 1, …, n    (*)
            By Brook’s Lemma: p(y1, …, yn) ∝ exp{-0.5(y’D-1(I – B)y)}
            where B = {bij}, D = diag{τi2} and Σy-1 = D-1(I – B)
            Do we have a joint multivariate normal for Y?
               Σy-1 must be symmetric set bij = wij/wi+ and τi2 = τ2/wi+
               so that we have p(y1, …, yn) ∝ exp{-(y’(Dw – W)y)/(2τ2)}
               But (Dw – W)1 = 0
               i.e. Σy-1 is singular (improper) constraint
               Intrinsically Autoregressive (IAR): set ΣiYi = 0 (centered; still improper)
            For propriety:
               Use Σy-1 = Dw – ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W
               Use W* = diag{1/wi+}W; so that we have Σy-1 = M-1(I – αW*) is p.d.,
               where M is diagonal and |α| < 1

4/22/2005                             Introduction to Spatial Statistics                     18
                                        Modeling Areal Data
                                  Common Models

    Conditionally Autoregressive (CAR) Models
            From (*), Y = BY + ε ⇔ (I – B)Y = ε
            i.e. Y induces a distribution on ε
            If p(y) is proper we can readily interpret the entries in Σy-1
            Extensions
               When covariate information is available same covariance structure
               When we have vector of dependent random effects MCAR
               When we have non-Gaussian data (exponential family)
            For details
               Banerjee et al. (2004)
               Besag (1974)




4/22/2005                               Introduction to Spatial Statistics         19
                                       Modeling Areal Data
                                   Common Models

    Simultaneous Autoregressive (SAR) Models
            Let ε induce a distribution on Y
            Suppose ε = {εi} where εi’s are independent innovations
               ε ~ Nn(0, D*) where D* = diag{σi2}
               Yi = ΣjbijYj + εi for i = 1, …, n; let B = {bij}
               If (I – B) is full rank, Y ~ Nn(0, (I – B)-1D* ((I – B)-1)’)
               Cov(ε, Y) = D* (I – B)-1
            To make (I – B) full rank
               Use B = ρW, ρ ∈(1/λ(1), 1/λ(n)) and λ(i) ordered eigenvalues of W
               Use W* = diag{wij/wi+}; so that we have B = αW* and |α| < 1
            Remarks
               Very efficient likelihood approaches but not MCMC and Gibbs
               Commonly used in Econometrics


4/22/2005                               Introduction to Spatial Statistics         20
                                    Modeling Areal Data
                                         Example
      The above two models are usually placed on the spatial frailties (random effects)
      rather on the response variables themselves.
      Ex. Fig. 6. Posterior mean spatial frailties, Iowa cancer data (Banerjee et al., 2004)




4/22/2005                            Introduction to Spatial Statistics                        21
                    Hierarchical Bayesian Modeling

    Can be applied to any type of spatial data
    All that you need
            Data: Y(s) with covariates X(s)
            Likelihood for the data: f(Y(s)|X(s), θ), usually exponential family
            Prior on θ: π(θ), usually θ = (β, σ2, τ2, φ)
            Posterior distributions: π(θ|Y(s), X(s)) ∝ f(Y(s)| X(s), θ) π(θ)
            Just make sure that the posterior distribution is proper to be able to do
            Bayesian inference
            Do model checking and comparison
            Done!!!
            It is not that easy to setup the computational part for most models.



4/22/2005                            Introduction to Spatial Statistics                 22
                      Illustration of Bayesian Approach:
                 Modeling Spatially Correlated Survival Data for
                       Individuals with Multiple Cancer
    Joint work with Dr. Dey and Prof. Sudipto Banerjee (UMN)
    Data Source: Surveillance, Epidemiology and End Results
    (SEER) Registry Data (1973-2001)
    Variables of interest
            Patient Level: Patient ID, County, Gender (2 levels), Marital Status (3
            levels), Number of Primaries, Survival status at study cutoff (Dead or
            Alive)
            Patient-Cancer Level: Cancer type, Age at diagnosis, Survival Time (in
            months) at cutoff
    Modeling Framework
            Cox Proportional Hazards Model
            Semi-parametric (mixture of betas) specification of baseline hazard
            Full MCAR for the spatial frailties
4/22/2005                           Introduction to Spatial Statistics                23
                 Modeling Spatially Correlated Survival Data for
                       Individuals with Multiple Cancer

    Proportional Hazards Model:
    h(tijk) = h0(tijk)exp(xijkTβ + u(i,j) + νk + φik)
            i = 1, …, I (= 99 counties)
            j = 1, …, ni (patients in county i)
            k ∈{Colon, Rectum}
            h0(t) ≡ h0(t|η) is the baseline hazard (modeled as mixture of betas,
            Gelfand and Mallick, 1995)
            β regression coefficients
            u(i,j) patient specific frailty
            νk cancer-specific frailty
            φik spatial frailty associated to the kth cancer type



4/22/2005                            Introduction to Spatial Statistics            24
            Modeling Spatially Correlated Survival Data for
                  Individuals with Multiple Cancer

    Likelihood Function



    Normal prior for the patient and cancer frailties




    MCAR prior for the spatial frailties



4/22/2005                   Introduction to Spatial Statistics   25
                 Modeling Spatially Correlated Survival Data for
                       Individuals with Multiple Cancer

    Dirichlet prior for the mixing weights
            η ~ Dirichlet(θ1)
    Inverted Wishart prior for Λ
    Uniform(0, 1) prior for α
    All full-conditional distributions are log-concave




4/22/2005                       Introduction to Spatial Statistics   26
            Modeling Spatially Correlated Survival Data for
                  Individuals with Multiple Cancer




4/22/2005                  Introduction to Spatial Statistics   27
                                        Remarks

    Spatial data arises very often in practice
    Ignoring spatial information may lead to spurious results
    There are a lot of established modeling techniques for spatial
    data but they are mostly limited by computational complexity
    and interpretability
            Ex. Moran’s I and Geary’s C
    The Bayesian approach is very straightforward and can
    “easily” be extended to accommodate more complex models
    Care has to be taken in interpreting the results
            Ex. Interpretation of the spatial autocorrelation parameter can vary
            across models.


4/22/2005                           Introduction to Spatial Statistics             28
Introduction to Spatial Data Statistics


              Thank you.