Exploratory Data Analysis Charts Histograms and Correlation

Document Sample
Exploratory Data Analysis Charts Histograms and Correlation Powered By Docstoc
					Spatial Exploratory Data Analysis (SEDA):
Maps, charts and statistical relationships
• Spatial Exploratory Data Analysis
    – Introduction
        • EDA
        • Spatial Analysis
        • SEDA
    – Maps
    – Distributions
    – Relationships

• Outline of 2nd computer workshop

• Requirements for 1 page report

• NOT a review of statistics!!!
Introduction: Doing research

• Research is a process. It is iterative. It is messy and full of uncertainty
  and false steps.

• „Scientific process‟ (inductive method – from empirical data to theory):
    – what are you interested in knowing?
    – how can this be formalised (hypotheses)
    – how can this operationalised (testable model)
    – gather the most appropriate data
    – validate the model (data analysis - this may or may not include formal
      statistical testing)
    – „laws‟ (firm, generalisable conclusions) -> Theory
    – in reality you will often use both induction and deduction

• Application of formal statistical techniques (excel, Minitab, SPSS, SAS,
  GEODA,..) are just a small part of this process
Exploratory Data Analysis

• “Exploratory data analysis is an attitude, NOT a bundle of
  techniques” (Tukey 1977).

• “Let data speak for themselves”

• “Get a feel for the data”

• Basically inductive (from data to hypotheses, theory,…)

• Characteristics: Concentration on graphical procedures
Stem and Leaf Diagram
Spatial analysis
• In essence geographical problems are about human
  activities which vary in space

• This does not mean that we should ignore things that do
  not vary spatially!

•   But we are „experts‟ in looking at a phenomena spatially

•   We need to find out if there is or is not any spatial pattern

• The essence of testing a geographical hypothesis is to find
  out whether or not there is any plausible reason why the
  phenomena varies in space
Spatial analysis

• Quantitative analysis can tell us if patterns we see are
  (statistically) significant (not “right” or “wrong”, only
  degrees of uncertainty)
• Is a trend in our sample „real‟ or is it just a chance
• Positive and negative relationships are interesting
• No discernible spatial causes (i.e. not statistically
  significant patterns) are also interesting because this will
  guide you to inquire further
• Does where something happens influence why and/or how
  it happens?
Exploratory Spatial Data Analysis

• Exploring spatial patterns through maps,
  histograms, boxplots, scatterplots…
• Identify outliers
• Find “hotspots”
• Formulate hypotheses
• Look for statistical relationships between variables
• Search for spatial spillovers
Describing distributions

•   Mean
•   Standard deviation
•   Variance
•   Skewness
•   Kurtosis
•   Median
•   Quartiles, Percentiles
•   Inter-quartile Range (IQR)
•   Maximum, minimum values
Looking for outliers: Boxplot
                   Number of observations (London Wards)


                                Hinge (1.5 times IQR)

                                75th percentile
                                Median             IQR
                                25th percentile

                            Hinge (1.5 times IQR)
                          Variable name: Percent students
Looking for outliers: Histogram
Linked windows / brushing
Distributions and mapping                    BOXMAP

    Percent White British

                    Standard Deviation Map
Distributions and mapping
Percent Bangladeshi   Percent White British
Distributions and mapping
Relationships: Choosing a statistic depends
on the questions you ask

•   What is A like?
•   Is A similar to B?
•   Is A different from B?
•   How much is A better/worse/different than B?
•   Are A and B related?     [correlation]
•   Does A affect B?
•   Does A cause B?          [regression]
Are A and B related?
• For example, you might want to know if the level of illness
  in an area is associated with poverty
• Is there a relationship between health and wealth?
• Do increasing poverty levels lead to increasing ill-health?
• Do the variables co-vary consistently across space?
• Easiest approach is to graph one variable against each
  other and look for associations
• Association is seen in the pattern of points
• Simplest pattern to spot and analyse is a linear
  relationship (i.e. resembles a straight line), although
  relationships could be curvilinear
Is there a relationship? How strong? Does
one cause the other?
Identifying relationships with scatterplots
      Strong positive   Strong negative      Random

                          Negative        No relationship?
   Relationship between variables
   (y = dependent; x = independent)

Long term unemployment share = Constant + 0.3162*no qualification share + Error term
Correlation between unemployment rate and long-
term share of unemployed (standardized data)


now, how to objectively measure the strength
and direction of these relationships?
What is correlation?


• Correlation statistics allow you to measure the strength
  and direction of a association between two variables
• Correlation provides a single number (correlation
  coefficient) that summarises level of variation between
  points (It is a standardised measure of covariance)
• If a relationship is found, variables are said to be
• Useful for description, but also inferential (significance)
 Types of correlation
Data type   Nominal          Ordinal         Interval/ Ratio

Display     2-way table      2-way table     Scatterplot

Direction   Not applicable   Sign of         Sign of Pearson or
                             Spearman        correlation
                             correlation     (Spearman if not linear

Strength    Size Cramer’s    Size Spearman   Size of Pearson Correlation
            V or lambda      correlation     (Spearman if not linear

Test        Pearson, chi     Test if         Test if Pearson r = 0
            square or        Spearman rho    (Test spearman r if non-
            Fisher’s exact   =0              normal)
• Assumes a linear association between variables
• Pearson‟s correlation coefficient (known as r) is most
  commonly used correlation measure of linear relationships
  between 2 variables
   – (Spearman‟s rank correlation for non-linear ordered relationships)
• Statistic measuring relationships between variables of
  interval (continuous) data, (e.g. census)
• Census variables are interval data. the values are
  continuous, ranging from 0 - maximum
• Generally put the „explanatory‟ (independent) variable as
  the x-axis
• The variable you want to „explain‟ (dependent) is on the y-
  How to interpret Pearson’s correlation?

                                                   measure is of how tightly
                                                   the points cluster
                                                    around an imaginary
                                                   straight line through the

• r is ‘dimensionless’ number and can only be between 1 and -1
   – an r of 1 = perfect positive relationship
   – an r of -1 = perfect negative relationship
   – an r of 0 = indicates no relationship
     Rule of thumb for interpreting
         the the magnitude of r
                  Negative          Description   Positive Range

                  0.00            None            0.00
extent to
which points      -0.19 - -0.01   ‘Very weak’     0.01 - 0.19
cluster tightly
around the        -0.39 - -0.20   ‘Weak’          0.20 - 0.39
straight line     -0.69 - -0.40   ‘Modest’        0.40 – 0.69
                  -0.89 - -0.70   ‘Strong’        0.70 – 0.89
                  -0.99 - -0.90   ‘Very strong’   0.90 – 0.99
                  -1.00           Perfect         1.00
Significance testing
• Can test to see whether the r is statistically significant
• Key is the size of r and the size of sample
• Seeking to reject the null hypothesis that the correlation
  coefficient is zero
• Pearson‟s correlation coefficient can be tested only if both
  variables are normally distributed
• (If not – test Spearman‟s correlation coefficient)
• Look up r against a table of critical values for given
  degrees of freedom. If bigger, can reject H0
• Statistics package will report a p-value, a measure of
• If p-value is less than 0.05 the correlation is significantly
  different to zero (with 95% certainty)
• Can also use a t-statistic. again checking if critical value is
Correlation limitations
• With big sample sizes, almost everything is significantly related in
  purely statistical terms
• Only works with linear relationships
• Correlation is not causation. A high r may mean any one of these :
    –   A causes B
    –   some other factor causes A and B
    –   B causes A
    –   its just chance. another sample will be different
• Need to use your knowledge, experience and common-sense as to
  likely underlying process. Is the relationship what you expect? Is it
• Correlation is only concerned with the direction and strength of the
  relationship between values of two variables
• Regression analysis determines the nature of that relationship and
  enables us to make predictions from it
Statistics Sense
• Danny Dorling‟s „Five Rules‟
• “If you have been concerned about your insecurities with statistics,
  don‟t be - you are normal - just try to use a few more simple facts to
  strengthen your arguments and try to feel less intimidated about the
  complex methods.”
• 1. often there is little point in using statistics
• 2. If you do use statistics make sure they can be understood
• 3. do not overuse statistics in your work
• 4. If you find a complex statistics useful, explain it clearly
• 5. recognise and harness the power of statistics in geography

• (Source: Chapter 21, “Using statistics to describe and explore data” in
  Clifford and Valentine (2003) Key Methods in Geography)
Further reading

• Danny Dorling‟s chapter in Clifford and Valentine (2003)
  Key Methods in Geography (chapter 21, “Using statistics
  to describe and explore data”)
• 2 good stats books without equations!
   – Derek Rowntree (1981) Statistics without Tears: An Introduction for
     Non-Mathematicians                   (Science MATHEMATICS L5
   – Michael Wood (2003) Making Sense of Statistics: A Non-
     Mathematical Approach (Main)
• I strongly recommend David Ebdon, Statistics in
  Geography (Science GEOGRAPHY D 62 EBD)
• Peter Rogerson (2001), Statistical Methods for Geography
  (Science GEOGRAPHY D 62 ROG)
• Kenneth Berk and Patrick Carey, Data analysis with
  Microsoft Excel (Bartlett ARCHITECTURE BA 4.2 BER)

Shared By: