# Exploratory Data Analysis Charts Histograms and Correlation

```					Spatial Exploratory Data Analysis (SEDA):
Maps, charts and statistical relationships
Overview
• Spatial Exploratory Data Analysis
– Introduction
• EDA
• Spatial Analysis
• SEDA
– Maps
– Distributions
– Relationships

• Outline of 2nd computer workshop

• Requirements for 1 page report

• NOT a review of statistics!!!
Introduction: Doing research

• Research is a process. It is iterative. It is messy and full of uncertainty
and false steps.

• „Scientific process‟ (inductive method – from empirical data to theory):
– what are you interested in knowing?
– how can this be formalised (hypotheses)
– how can this operationalised (testable model)
– gather the most appropriate data
– validate the model (data analysis - this may or may not include formal
statistical testing)
– „laws‟ (firm, generalisable conclusions) -> Theory
– in reality you will often use both induction and deduction

• Application of formal statistical techniques (excel, Minitab, SPSS, SAS,
GEODA,..) are just a small part of this process
Exploratory Data Analysis

• “Exploratory data analysis is an attitude, NOT a bundle of
techniques” (Tukey 1977).

• “Let data speak for themselves”

• “Get a feel for the data”

• Basically inductive (from data to hypotheses, theory,…)

• Characteristics: Concentration on graphical procedures
Stem and Leaf Diagram
Spatial analysis
• In essence geographical problems are about human
activities which vary in space

• This does not mean that we should ignore things that do
not vary spatially!

•   But we are „experts‟ in looking at a phenomena spatially

•   We need to find out if there is or is not any spatial pattern

• The essence of testing a geographical hypothesis is to find
out whether or not there is any plausible reason why the
phenomena varies in space
Spatial analysis

• Quantitative analysis can tell us if patterns we see are
(statistically) significant (not “right” or “wrong”, only
degrees of uncertainty)
• Is a trend in our sample „real‟ or is it just a chance
occurrence?
• Positive and negative relationships are interesting
• No discernible spatial causes (i.e. not statistically
significant patterns) are also interesting because this will
guide you to inquire further
• Does where something happens influence why and/or how
it happens?
Exploratory Spatial Data Analysis

• Exploring spatial patterns through maps,
histograms, boxplots, scatterplots…
• Identify outliers
• Find “hotspots”
• Formulate hypotheses
• Look for statistical relationships between variables
• Search for spatial spillovers
Describing distributions

•   Mean
•   Standard deviation
•   Variance
•   Skewness
•   Kurtosis
•   Median
•   Quartiles, Percentiles
•   Inter-quartile Range (IQR)
•   Maximum, minimum values
Looking for outliers: Boxplot
Number of observations (London Wards)

Outliers

Hinge (1.5 times IQR)

75th percentile
Median             IQR
25th percentile

Hinge (1.5 times IQR)
Variable name: Percent students
Looking for outliers: Histogram
Mapping
Distributions and mapping                    BOXMAP

Percent White British

Standard Deviation Map
Distributions and mapping
Distributions and mapping
Relationships: Choosing a statistic depends

•   What is A like?
•   Is A similar to B?
•   Is A different from B?
•   How much is A better/worse/different than B?
•   Are A and B related?     [correlation]
•   Does A affect B?
•   Does A cause B?          [regression]
Are A and B related?
• For example, you might want to know if the level of illness
in an area is associated with poverty
• Is there a relationship between health and wealth?
• Do increasing poverty levels lead to increasing ill-health?
• Do the variables co-vary consistently across space?
• Easiest approach is to graph one variable against each
other and look for associations
• Association is seen in the pattern of points
• Simplest pattern to spot and analyse is a linear
relationship (i.e. resembles a straight line), although
relationships could be curvilinear
Is there a relationship? How strong? Does
one cause the other?
Identifying relationships with scatterplots
Strong positive   Strong negative      Random

Negative        No relationship?
Positive
Relationship between variables
(y = dependent; x = independent)

Long term unemployment share = Constant + 0.3162*no qualification share + Error term
Correlation between unemployment rate and long-
term share of unemployed (standardized data)

Correlation
Coefficient

now, how to objectively measure the strength
and direction of these relationships?
What is correlation?

0.73

• Correlation statistics allow you to measure the strength
and direction of a association between two variables
• Correlation provides a single number (correlation
coefficient) that summarises level of variation between
points (It is a standardised measure of covariance)
• If a relationship is found, variables are said to be
correlated
• Useful for description, but also inferential (significance)
Types of correlation
Data type   Nominal          Ordinal         Interval/ Ratio

Display     2-way table      2-way table     Scatterplot

Direction   Not applicable   Sign of         Sign of Pearson or
Spearman        correlation
correlation     (Spearman if not linear
normal)

Strength    Size Cramer’s    Size Spearman   Size of Pearson Correlation
V or lambda      correlation     (Spearman if not linear
normal)

Test        Pearson, chi     Test if         Test if Pearson r = 0
square or        Spearman rho    (Test spearman r if non-
Fisher’s exact   =0              normal)
Correlation
• Assumes a linear association between variables
• Pearson‟s correlation coefficient (known as r) is most
commonly used correlation measure of linear relationships
between 2 variables
– (Spearman‟s rank correlation for non-linear ordered relationships)
• Statistic measuring relationships between variables of
interval (continuous) data, (e.g. census)
• Census variables are interval data. the values are
continuous, ranging from 0 - maximum
• Generally put the „explanatory‟ (independent) variable as
the x-axis
• The variable you want to „explain‟ (dependent) is on the y-
axis
How to interpret Pearson’s correlation?

measure is of how tightly
the points cluster
around an imaginary
straight line through the
scatterplot

• r is ‘dimensionless’ number and can only be between 1 and -1
– an r of 1 = perfect positive relationship
– an r of -1 = perfect negative relationship
– an r of 0 = indicates no relationship
Rule of thumb for interpreting
the the magnitude of r
Negative          Description   Positive Range
Range

0.00            None            0.00
extent to
which points      -0.19 - -0.01   ‘Very weak’     0.01 - 0.19
cluster tightly
around the        -0.39 - -0.20   ‘Weak’          0.20 - 0.39
straight line     -0.69 - -0.40   ‘Modest’        0.40 – 0.69
-0.89 - -0.70   ‘Strong’        0.70 – 0.89
-0.99 - -0.90   ‘Very strong’   0.90 – 0.99
-1.00           Perfect         1.00
Significance testing
• Can test to see whether the r is statistically significant
• Key is the size of r and the size of sample
• Seeking to reject the null hypothesis that the correlation
coefficient is zero
• Pearson‟s correlation coefficient can be tested only if both
variables are normally distributed
• (If not – test Spearman‟s correlation coefficient)
• Look up r against a table of critical values for given
degrees of freedom. If bigger, can reject H0
• Statistics package will report a p-value, a measure of
significance
• If p-value is less than 0.05 the correlation is significantly
different to zero (with 95% certainty)
• Can also use a t-statistic. again checking if critical value is
exceeded
Correlation limitations
• With big sample sizes, almost everything is significantly related in
purely statistical terms
• Only works with linear relationships
• Correlation is not causation. A high r may mean any one of these :
–   A causes B
–   some other factor causes A and B
–   B causes A
–   its just chance. another sample will be different
• Need to use your knowledge, experience and common-sense as to
likely underlying process. Is the relationship what you expect? Is it
plausible?
• Correlation is only concerned with the direction and strength of the
relationship between values of two variables
• Regression analysis determines the nature of that relationship and
enables us to make predictions from it
Statistics Sense
• Danny Dorling‟s „Five Rules‟
don‟t be - you are normal - just try to use a few more simple facts to
complex methods.”
• 1. often there is little point in using statistics
• 2. If you do use statistics make sure they can be understood
• 3. do not overuse statistics in your work
• 4. If you find a complex statistics useful, explain it clearly
• 5. recognise and harness the power of statistics in geography

• (Source: Chapter 21, “Using statistics to describe and explore data” in
Clifford and Valentine (2003) Key Methods in Geography)

• Danny Dorling‟s chapter in Clifford and Valentine (2003)
Key Methods in Geography (chapter 21, “Using statistics
to describe and explore data”)
• 2 good stats books without equations!
– Derek Rowntree (1981) Statistics without Tears: An Introduction for
Non-Mathematicians                   (Science MATHEMATICS L5
ROW)
– Michael Wood (2003) Making Sense of Statistics: A Non-
Mathematical Approach (Main)
• I strongly recommend David Ebdon, Statistics in
Geography (Science GEOGRAPHY D 62 EBD)
• Peter Rogerson (2001), Statistical Methods for Geography
(Science GEOGRAPHY D 62 ROG)
• Kenneth Berk and Patrick Carey, Data analysis with
Microsoft Excel (Bartlett ARCHITECTURE BA 4.2 BER)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 14 posted: 7/22/2011 language: English pages: 31