Document Sample
s3.amazonaws.comcramster-resource2429_n_12517.ppt Powered By Docstoc
					             Descriptive Statistics and
            Exploratory Data Analysis -

  • Quantitative (continuous) variables
      • Scatterplots (two variables; use color or symbol to add 3rd variable)
      • Starplots
      • Correlation coefficient
  • Qualitative (categorical) variables
      • Contingency (two-way) tables
      • Joint, marginal, conditional distribution
      • Simpson’s paradox (confounding)
      • Interaction

Fall 2002                        Biostat 511                               50

      A scatterplot offers a convenient way of
      visualizing the relationship between pairs of
      quantitative variables.

     Many interesting features can be seen in a
     scatterplot including the overall pattern (i.e.
     linear, nonlinear, periodic), strength and direction
     of the relationship, and outliers (values which are
     far from the bulk of the data).

Fall 2002                Biostat 511                        51
            Scatterplot showing nonlinear relationship



            Scatterplot showing daily rainfall amount (mm)
            at nearby stations in SW Australia. Note
            outliers (O). Are they data errors … or
            interesting science?!
Fall 2002                 Biostat 511                    52
            Presentation matters!

Fall 2002         Biostat 511       53
   - Important information can be seen in two
     dimensions that isn’t obvious in one dimension

Fall 2002              Biostat 511                    54
       Use symbols or colors to add a third

Fall 2002               Biostat 511           55
              Plots for Multivariate data

     Star plots are used to display multivariate data

      • Each ray corresponds to a variable
      • Rays scaled from smallest to largest value in

Fall 2002                Biostat 511                    56

   How can we summarize the “strength of
   association” between two variables in a

Fall 2002              Biostat 511           57

   When two variables are measured on a scale in
   which order is meaningful, you can calculate a
   correlation coefficient that measures the strength
   of the association between the two variables.

   There are two common correlation measures:

   1. Pearson Correlation Coefficient: Based on
   the actual data values. Measure of linear
   association. Natural when each variable has a
   normal distribution.
   2. Spearman Rank Correlation: Based on ranks
   of each variable (ranks assigned separately).
   Useful measure of the monotone association,
   which may not be linear.

Fall 2002               Biostat 511                     58
            Pearson’s Correlation Coefficient

   The correlation between two variables X and Y is:

   • No distinction between x and y.
   • The correlation is constrained: -1 £ R £ +1
   • | R | = 1 means “perfect linear relationship”
   • The correlation is a scale free measure
     (correlation doesn’t change if there is a linear
     change in units).
   • Pearson’s correlation only measures strength of
     linear relationship.
   • Pearson’s correlation is sensitive to outliers.

Fall 2002                 Biostat 511                   59
            Perfect positive correlation (R = 1)

            Perfect negative correlation (R = -1)

            Uncorrelated (R = 0) but dependent
Fall 2002               Biostat 511                 60
Fall 2002   Biostat 511   61
            Pearson’s Correlation Coefficient

                Correlation = .8776

            Suppose we restrict the range of X …

                         Correlation = .5111

            • relationship between LSAT and GPA
              among law school students
            • relationship between height and
              basketball ability among NBA players

Fall 2002                Biostat 511                 62
              Spearman Rank Correlation

        • A nonparametric analogue to Pearson’s
          correlation coefficient is Spearman’s rank
          correlation coefficient. Use Spearman’s
          correlation when the assumption of
          normality of X and Y is not met.
        • A measure of monotonic association (not
          necessarily linear)
        • Based on the ranked data
            • Rank each sample separately (1 … N)
            • Compute Pearson’s correlation on the
        • -1 < Rs < 1

Fall 2002                Biostat 511                   63
            Two-way (Contingency) Tables

 Now we turn our attention to relationships between
 pairs of qualitative (categorical, discrete) measures.

 Types of Categorical Data:

 Often we wish to assess whether two factors are
 related. To do so we construct an R x C table that
 cross-classifies the observations according to the
 two factors. Such a table is called a two-way or
 contingency table.

Fall 2002               Biostat 511                       64
                    Two-way tables

  Example. Education versus willingness to
  participate in a study of a vaccine to prevent HIV
  infection if the study was to start tomorrow. Counts,
  percents and row and column totals are given.

    The table displays the joint distribution of
    education and willingness to participate.

Fall 2002               Biostat 511                   65
                    Two-way tables

   The marginal distributions of a two-way table
   are simply the distributions of each measure
   summed over the other.
   E.g. Willingness to participate

Fall 2002               Biostat 511                66
                    Two-way tables

   A conditional distribution is the distribution of
   one measure conditional on (given the) value of
   the other measure.
   E.g. Willingness to participate among those with a
   college education.

Fall 2002               Biostat 511                     67
                    Two-way tables

 What proportion of individuals …
 • will definitely participate?
 • have less than college education?
 • will probably or definitely participate given less
   than college education?
 • who will probably or definitely participate have
   have less than college education?
 • have a graduate/prof degree and will definitely
   not participate?

Fall 2002                Biostat 511                    68
                  Three-way tables

  There are two phenomena that can confuse our
  interpretation of two-way tables. In each case a
  third measure is involved.

  Simpson’s Paradox - Also known as confounding
  in the epidemiology literature. MM refer to this as
  the “lurking variable” problem. Aggregating over a
  third (lurking) variable results in incorrect
  interpretation of the association between the two
  primary variables of interest.

  Interaction - Also known as effect modification in
  the epidemiology literature. The degree of
  association between the two primary variables
  depends on a third variable.

Fall 2002               Biostat 511                     69
            Simpson’s Paradox (aka Confounding)

        “Condom Use increases the risk of STD”

                         BUT ...

    Explanation: Individuals with more partners are
    more likely to use condoms. But individuals with
    more partners are also more likely to get STD.

Fall 2002                 Biostat 511                  70
            Interaction (aka Effect Modification)

Fall 2002                  Biostat 511              71

  • Quantitative (continuous) variables
     Scatterplots - display relationship between two
     quantitative measures. Use colors or symbols to
     add a third (categorical) dimension.
     Starplots - display multivariate data.
     Correlation coefficient - summarizes the
     strength of the linear (Pearson’s) or monotonic
     (Spearman’s) relationship between two
     quantitative measures.

  • Qualitative (categorical) variables
     Contingency table – shows the joint distribution
     of the two variables, the marginal distributions of
     each variable and the conditional distribution of
     one variable for a fixed level of the other variable.
     Simpson’s paradox and interactions can occur if
     a third variable influences the association between
     the two variables of interest.

Fall 2002                Biostat 511                     72
            Guidelines for Tables and

     • Tables
         • Good for showing exact values, small amounts of data
         • Guidelines
     • Graphs
        •   Good for showing qualitative trends, large amounts of data
        •   Guidelines for graphical integrity

Fall 2002                      Biostat 511                               73
                   Tables and Graphs

        • Compact presentation of data
        • Visual appeal; readers feel that they are
          “seeing the data”
        • Tables are better for showing exact
          numerical values, small amounts of data
          and/or multiple localized comparisons
        • Graphs are better for highlighting
          qualitative aspects of the data and
          displaying large amounts of data.

Fall 2002                 Biostat 511                 74
        Guidelines for Tables (Ehrenberg, 1977)

     1. Give marginal averages to provide a visual
     2. Order rows/columns by marginal averages or
       some other measure of size.
     3. Put groups to be compared in rows (i.e.
        scanning down columns for comparisons)
     4. Round to 2 effective digits
     5. Use layout to facilitate comparisons
     6. Give brief verbal summaries to lead reader to
        patterns and exception.
     7. Clearly label rows and columns, give units,
       source (if appropriate), title.

Fall 2002                Biostat 511                    75
  Unemployment in Great Britain(source: Facts in
  Focus, CSO, 1974).

  Note use of marginal averages and rounding.
  Table has been reordered so the reader can
  scan down the column for a time trend.

Fall 2002              Biostat 511                 76
Fall 2002   Biostat 511   77
                    Statistical Graphics

      “Modern data graphics can do much more
      than simply substitute for small statistical
      tables. At their best, graphics are instruments
      for reasoning about quantitative information.
      Often the most effective way to describe,
      explore, and summarize a set of numbers -
      even a very large set - is to look at pictures of
      those numbers.”

       Edward R. Tufte
       The Visual Display of Quantitative Information
       Graphics Press, 1983

Fall 2002                 Biostat 511                     78
                 Graphical Integrity

   1. The representation of numbers, as physically
      measured on the surface of the graphic, should
      be directly proportional to the numerical
      quantities represented (e.g. purchasing power).
   2. Clear, detailed and thorough labeling should be
      used to defeat graphical distortion and
      ambiguity. Write out explanations of the data on
      the graphic itself. Label important events in the
      data. (e.g. Minard’s graphic)
   3. Focus on the data, not the design and maximize
      the data:ink ratio (counter e.g. USA Today)
   4. The number of information-carrying (variable)
      dimensions depicted should not exceed the
      number of dimensions in the data (e.g. OPEC
   5. Do not quote data out of context (e.g. traffic

Fall 2002                Biostat 511                      79
Fall 2002   Biostat 511   80
       A less distorted view …

Fall 2002               Biostat 511   81
Fall 2002   Biostat 511   82
            Data density - Compare ...

Fall 2002               Biostat 511      83
Fall 2002   Biostat 511   84
Fall 2002   Biostat 511   85
Fall 2002   Biostat 511   86

     • Tables
                • Good for showing exact values, small
                  amounts of data
                • Guidelines

    • Graphs
            •     Good for showing qualitative trends, large
                  amounts of data
            •     Guidelines for graphical integrity

Fall 2002                       Biostat 511                    87
                   Designing Studies

    • Design issues
       • Types of studies
            a. Experimental studies - Control, randomization, replication
            b. Observational
       • Controls
       • Blinding
       • Hawthorne effect
       • Longitudinal/cross-sectional
       • Dropout
    • Population vs Sample
       • Bias
       • Variability

Fall 2002                       Biostat 511                                 88
                     Experimental Design

            “Obtaining valid results from a test program
            calls for commitment to sound statistical
            design. In fact, proper experimental design
            is more important than sophisticated
            statistical analysis. Results of a well-
            planned experiment are often evident from
            simple graphical analyses. However, the
            world’s best statistical analysis cannot
            rescue a poorly planned experiment.”

   Gerald Hahn, Encyclopedia of Statistical Science,
   page 359, entry for Design of Experiments

Fall 2002                   Biostat 511                    89
                     Types of Studies

     Most scientific studies can be classified into one
     of two broad categories:
     1) Experimental Studies
            The investigator deliberately sets one or
            more factors to a specific level.
     2) Observational Studies
            The investigator collects data from an
            existing situation and does not
            (intentionally) interfere with the running of
            the system.

Fall 2002                  Biostat 511                      90
    Experimental Studies
    • Sources of (major) variability are controlled
      by the researcher
    • Randomization is often used to ensure that
      uncontrolled factors do not bias results
    • The experiment is replicated on many subjects
      (units) to reduce the effect of chance variation
    • Easier to make the case for causation

    • effect of pesticide exposure on hatching of
    • comparison of two treatments for preventing
      perinatal transmission of HIV

Fall 2002                Biostat 511                     91
            Example: control of variability by

  Hypothesis: Lotions A and B equally effective at
  softening skin

Fall 2002                Biostat 511                 92
 Design 1: Ignore pairing, randomly assign half of the
 hands to each lotion. What is the distribution of the
 sample mean difference in softness, if the “true”
 difference is 3?

Design 2: Randomly assign lotion to one hand within each
pair. What is the distribution of the sample mean
difference in softness, if the true difference is 3?

 Fall 2002              Biostat 511                      93
    Observational Studies
    • Sources of variability (in the outcome) are not
      controlled by the researcher
    • Adjustment for imbalances between groups, if
      possible, occurs at the analysis phase
    • Randomization usually not an option; samples
      are assumed to be “representative”
    • Can identify association, but usually difficult to
      infer causation

   • natural history of HIV infection
   • study of partners of individuals with gonorrhea
   • condom use and STD prevention
   • association between chess playing and reading
     skill in elementary school children

Fall 2002                Biostat 511                       94
                 Other Study Design Issues

     •Selection of controls
     •Hawthorne effect
     •Longitudinal vs Cross-sectional

Fall 2002                Biostat 511         95
            Longitudinal vs Cross-sectional Studies

       • Longitudinal studies are more expensive and
         involve additional analytical complications.
       • Longitudinal studies allow one to study
         changes over time in individuals and
         populations (similar to idea of pairing or

Fall 2002                  Biostat 511                  96
Reading Ability

                       Age              Age               Age

                  Hypothetical data on the relationship between
                  reading ability and age.

Fall 2002                           Biostat 511                   97
                   Populations vs Samples

  So far we haven’t thought very hard about where
  our data come from. However, in almost all cases
  there is an implicit assumption that the conclusions
  we draw from our data analysis apply to some
  larger group than just the individuals we measured.

  Population                        Sample
  •set of all “units”               •a subset of “units”
  •real or hypothetical             •estimates/statistics

    e.g. population - all US households with a
         TV(~95 million)
            sample - Nielsen sample (~5000)

        The objective of statistics is to make valid
        inferences about the population from the

Fall 2002                  Biostat 511                      98
Fall 2002
               samp                   en
                     le of
              sample of size n

                          of size

Biostat 511
                              iz   en
                                             Population of X’s

                                           (true proportion = p)


    In making such inferences, there are two ways
    we can go wrong …
   • Do I expect that, on average, the estimate from
     my sample will equal the parameter of the
     population of interest? If so, the estimate is
        e.g. Ann Landers survey
             Pap smear study

   • In general, statistical methods do not
     correct for bias

    (Sampling) Variability
   •If I repeat an experiment (draw a new
    sample), I don’t expect to get exactly the same
    results. The sample estimates are variable.
   •The aim of experimental design and statistical
    analysis is to quantify/control effects of

Fall 2002                Biostat 511                   100
Fall 2002   Biostat 511   101
                 Types of samples in medical
                    studies - a hierarchy

            1) Probability samples (e.g. simple
               random sample, stratified samples,
               multistage samples)
            2) Representative samples (no obvious
               bias, but …)
            3) Convienence samples (biases likely …)
            4) Anecdotal, Case reports

Fall 2002                   Biostat 511                102
            Problems in Design/Data Collection

     33% reduction in blood pressure after treatment
     with medication in a sample of 60 hypertensive
     Daytime telephone interview of voting
     Higher proportion of “abnormal” values on tests
     performed in 1990 than a comparable sample
     taken in 1980.

Fall 2002                 Biostat 511                  103

            1. Statistics plays a role from study
               conception to study reporting.
            2. Statistics is concerned with making
               valid inferences about populations
               from samples that are subject to
               various sources of variability.
            3. Different studies require different
               statistical approaches. You must
               understand the study design and
               sampling procedures before you can
               hope to interpret the data!!

Fall 2002                    Biostat 511             104

Shared By: