# s3.amazonaws.comcramster-resource2429_n_12517.ppt

Document Sample

```					             Descriptive Statistics and
Exploratory Data Analysis -
Bivariate

• Quantitative (continuous) variables
• Scatterplots (two variables; use color or symbol to add 3rd variable)
• Starplots
• Correlation coefficient
• Qualitative (categorical) variables
• Contingency (two-way) tables
• Joint, marginal, conditional distribution
• Interaction

Fall 2002                        Biostat 511                               50
Scatterplot

A scatterplot offers a convenient way of
visualizing the relationship between pairs of
quantitative variables.

Many interesting features can be seen in a
scatterplot including the overall pattern (i.e.
linear, nonlinear, periodic), strength and direction
of the relationship, and outliers (values which are
far from the bulk of the data).

Fall 2002                Biostat 511                        51
Scatterplot showing nonlinear relationship

O

O
O

Scatterplot showing daily rainfall amount (mm)
at nearby stations in SW Australia. Note
outliers (O). Are they data errors … or
interesting science?!
Fall 2002                 Biostat 511                    52
Presentation matters!

Fall 2002         Biostat 511       53
- Important information can be seen in two
dimensions that isn’t obvious in one dimension

Fall 2002              Biostat 511                    54
Use symbols or colors to add a third
variable

Fall 2002               Biostat 511           55
Plots for Multivariate data

Star plots are used to display multivariate data

• Each ray corresponds to a variable
• Rays scaled from smallest to largest value in
dataset

Fall 2002                Biostat 511                    56
Correlation

How can we summarize the “strength of
association” between two variables in a
scatterplot?

Fall 2002              Biostat 511           57
Correlation

When two variables are measured on a scale in
which order is meaningful, you can calculate a
correlation coefficient that measures the strength
of the association between the two variables.

There are two common correlation measures:

1. Pearson Correlation Coefficient: Based on
the actual data values. Measure of linear
association. Natural when each variable has a
normal distribution.
2. Spearman Rank Correlation: Based on ranks
of each variable (ranks assigned separately).
Useful measure of the monotone association,
which may not be linear.

Fall 2002               Biostat 511                     58
Pearson’s Correlation Coefficient

The correlation between two variables X and Y is:

Properties:
• No distinction between x and y.
• The correlation is constrained: -1 £ R £ +1
• | R | = 1 means “perfect linear relationship”
• The correlation is a scale free measure
(correlation doesn’t change if there is a linear
change in units).
• Pearson’s correlation only measures strength of
linear relationship.
• Pearson’s correlation is sensitive to outliers.

Fall 2002                 Biostat 511                   59
Perfect positive correlation (R = 1)

Perfect negative correlation (R = -1)

Uncorrelated (R = 0) but dependent
Fall 2002               Biostat 511                 60
Fall 2002   Biostat 511   61
Pearson’s Correlation Coefficient

Correlation = .8776

Suppose we restrict the range of X …

Correlation = .5111

• relationship between LSAT and GPA
among law school students
• relationship between height and

Fall 2002                Biostat 511                 62
Spearman Rank Correlation

• A nonparametric analogue to Pearson’s
correlation coefficient is Spearman’s rank
correlation coefficient. Use Spearman’s
correlation when the assumption of
normality of X and Y is not met.
• A measure of monotonic association (not
necessarily linear)
• Based on the ranked data
• Rank each sample separately (1 … N)
• Compute Pearson’s correlation on the
ranks
• -1 < Rs < 1

Fall 2002                Biostat 511                   63
Two-way (Contingency) Tables

Now we turn our attention to relationships between
pairs of qualitative (categorical, discrete) measures.

Types of Categorical Data:
•Nominal
•Ordinal

Often we wish to assess whether two factors are
related. To do so we construct an R x C table that
cross-classifies the observations according to the
two factors. Such a table is called a two-way or
contingency table.

Fall 2002               Biostat 511                       64
Two-way tables

Example. Education versus willingness to
participate in a study of a vaccine to prevent HIV
infection if the study was to start tomorrow. Counts,
percents and row and column totals are given.

The table displays the joint distribution of
education and willingness to participate.

Fall 2002               Biostat 511                   65
Two-way tables

The marginal distributions of a two-way table
are simply the distributions of each measure
summed over the other.
E.g. Willingness to participate

Fall 2002               Biostat 511                66
Two-way tables

A conditional distribution is the distribution of
one measure conditional on (given the) value of
the other measure.
E.g. Willingness to participate among those with a
college education.

Fall 2002               Biostat 511                     67
Two-way tables

What proportion of individuals …
• will definitely participate?
• have less than college education?
• will probably or definitely participate given less
than college education?
• who will probably or definitely participate have
have less than college education?
• have a graduate/prof degree and will definitely
not participate?

Fall 2002                Biostat 511                    68
Three-way tables

There are two phenomena that can confuse our
interpretation of two-way tables. In each case a
third measure is involved.

Simpson’s Paradox - Also known as confounding
in the epidemiology literature. MM refer to this as
the “lurking variable” problem. Aggregating over a
third (lurking) variable results in incorrect
interpretation of the association between the two
primary variables of interest.

Interaction - Also known as effect modification in
the epidemiology literature. The degree of
association between the two primary variables
depends on a third variable.

Fall 2002               Biostat 511                     69

“Condom Use increases the risk of STD”

BUT ...

Explanation: Individuals with more partners are
more likely to use condoms. But individuals with
more partners are also more likely to get STD.

Fall 2002                 Biostat 511                  70
Interaction (aka Effect Modification)

Fall 2002                  Biostat 511              71
Summary

• Quantitative (continuous) variables
Scatterplots - display relationship between two
quantitative measures. Use colors or symbols to
Starplots - display multivariate data.
Correlation coefficient - summarizes the
strength of the linear (Pearson’s) or monotonic
(Spearman’s) relationship between two
quantitative measures.

• Qualitative (categorical) variables
Contingency table – shows the joint distribution
of the two variables, the marginal distributions of
each variable and the conditional distribution of
one variable for a fixed level of the other variable.
Simpson’s paradox and interactions can occur if
a third variable influences the association between
the two variables of interest.

Fall 2002                Biostat 511                     72
Guidelines for Tables and
Graphs

• Tables
• Good for showing exact values, small amounts of data
• Guidelines
• Graphs
•   Good for showing qualitative trends, large amounts of data
•   Guidelines for graphical integrity

Fall 2002                      Biostat 511                               73
Tables and Graphs

• Compact presentation of data
• Visual appeal; readers feel that they are
“seeing the data”
• Tables are better for showing exact
numerical values, small amounts of data
and/or multiple localized comparisons
• Graphs are better for highlighting
qualitative aspects of the data and
displaying large amounts of data.

Fall 2002                 Biostat 511                 74
Guidelines for Tables (Ehrenberg, 1977)

1. Give marginal averages to provide a visual
focus.
2. Order rows/columns by marginal averages or
some other measure of size.
3. Put groups to be compared in rows (i.e.
scanning down columns for comparisons)
4. Round to 2 effective digits
5. Use layout to facilitate comparisons
patterns and exception.
7. Clearly label rows and columns, give units,
source (if appropriate), title.

Fall 2002                Biostat 511                    75
Unemployment in Great Britain(source: Facts in
Focus, CSO, 1974).

Note use of marginal averages and rounding.
Table has been reordered so the reader can
scan down the column for a time trend.

Fall 2002              Biostat 511                 76
Fall 2002   Biostat 511   77
Statistical Graphics

“Modern data graphics can do much more
than simply substitute for small statistical
tables. At their best, graphics are instruments
Often the most effective way to describe,
explore, and summarize a set of numbers -
even a very large set - is to look at pictures of
those numbers.”

Edward R. Tufte
The Visual Display of Quantitative Information
Graphics Press, 1983

Fall 2002                 Biostat 511                     78
Graphical Integrity

1. The representation of numbers, as physically
measured on the surface of the graphic, should
be directly proportional to the numerical
2. Clear, detailed and thorough labeling should be
used to defeat graphical distortion and
ambiguity. Write out explanations of the data on
the graphic itself. Label important events in the
data. (e.g. Minard’s graphic)
3. Focus on the data, not the design and maximize
the data:ink ratio (counter e.g. USA Today)
4. The number of information-carrying (variable)
dimensions depicted should not exceed the
number of dimensions in the data (e.g. OPEC
Oil)
5. Do not quote data out of context (e.g. traffic
deaths)

Fall 2002                Biostat 511                      79
Fall 2002   Biostat 511   80
A less distorted view …

Fall 2002               Biostat 511   81
Fall 2002   Biostat 511   82
Data density - Compare ...

Fall 2002               Biostat 511      83
Fall 2002   Biostat 511   84
Fall 2002   Biostat 511   85
Fall 2002   Biostat 511   86
Summary

• Tables
• Good for showing exact values, small
amounts of data
• Guidelines

• Graphs
•     Good for showing qualitative trends, large
amounts of data
•     Guidelines for graphical integrity

Fall 2002                       Biostat 511                    87
Designing Studies

• Design issues
• Types of studies
a. Experimental studies - Control, randomization, replication
b. Observational
• Controls
• Blinding
• Hawthorne effect
• Longitudinal/cross-sectional
• Dropout
• Population vs Sample
• Bias
• Variability

Fall 2002                       Biostat 511                                 88
Experimental Design

“Obtaining valid results from a test program
calls for commitment to sound statistical
design. In fact, proper experimental design
is more important than sophisticated
statistical analysis. Results of a well-
planned experiment are often evident from
simple graphical analyses. However, the
world’s best statistical analysis cannot
rescue a poorly planned experiment.”

Gerald Hahn, Encyclopedia of Statistical Science,
page 359, entry for Design of Experiments

Fall 2002                   Biostat 511                    89
Types of Studies

Most scientific studies can be classified into one
1) Experimental Studies
The investigator deliberately sets one or
more factors to a specific level.
2) Observational Studies
The investigator collects data from an
existing situation and does not
(intentionally) interfere with the running of
the system.

Fall 2002                  Biostat 511                      90
Experimental Studies
• Sources of (major) variability are controlled
by the researcher
• Randomization is often used to ensure that
uncontrolled factors do not bias results
• The experiment is replicated on many subjects
(units) to reduce the effect of chance variation
• Easier to make the case for causation

Examples
• effect of pesticide exposure on hatching of
eggs
• comparison of two treatments for preventing
perinatal transmission of HIV

Fall 2002                Biostat 511                     91
Example: control of variability by
matching

Hypothesis: Lotions A and B equally effective at
softening skin

Fall 2002                Biostat 511                 92
Design 1: Ignore pairing, randomly assign half of the
hands to each lotion. What is the distribution of the
sample mean difference in softness, if the “true”
difference is 3?

Design 2: Randomly assign lotion to one hand within each
pair. What is the distribution of the sample mean
difference in softness, if the true difference is 3?

Fall 2002              Biostat 511                      93
Observational Studies
• Sources of variability (in the outcome) are not
controlled by the researcher
• Adjustment for imbalances between groups, if
possible, occurs at the analysis phase
• Randomization usually not an option; samples
are assumed to be “representative”
• Can identify association, but usually difficult to
infer causation

Examples
• natural history of HIV infection
• study of partners of individuals with gonorrhea
• condom use and STD prevention
• association between chess playing and reading
skill in elementary school children

Fall 2002                Biostat 511                       94
Other Study Design Issues

•Selection of controls
•Blinding
•Hawthorne effect
•Longitudinal vs Cross-sectional
•Dropouts

Fall 2002                Biostat 511         95
Longitudinal vs Cross-sectional Studies

• Longitudinal studies are more expensive and
• Longitudinal studies allow one to study
changes over time in individuals and
populations (similar to idea of pairing or
matching)

Fall 2002                  Biostat 511                  96

Age              Age               Age

Hypothetical data on the relationship between

Fall 2002                           Biostat 511                   97
Populations vs Samples

So far we haven’t thought very hard about where
our data come from. However, in almost all cases
there is an implicit assumption that the conclusions
we draw from our data analysis apply to some
larger group than just the individuals we measured.

Population                        Sample
•set of all “units”               •a subset of “units”
•real or hypothetical             •estimates/statistics
•parameters

e.g. population - all US households with a
TV(~95 million)
sample - Nielsen sample (~5000)

The objective of statistics is to make valid
inferences about the population from the
sample.

Fall 2002                  Biostat 511                      98
Fall 2002
sam
ple
of
siz
samp                   en
le of
size
n
sample of size n

n
of size
sample

Biostat 511
iz   en
Population of X’s

ofs
(true proportion = p)

le
samp

99
In making such inferences, there are two ways
we can go wrong …
Bias
• Do I expect that, on average, the estimate from
my sample will equal the parameter of the
population of interest? If so, the estimate is
unbiased.
e.g. Ann Landers survey
Pap smear study

• In general, statistical methods do not
correct for bias

(Sampling) Variability
•If I repeat an experiment (draw a new
sample), I don’t expect to get exactly the same
results. The sample estimates are variable.
•The aim of experimental design and statistical
analysis is to quantify/control effects of
variability.

Fall 2002                Biostat 511                   100
Fall 2002   Biostat 511   101
Types of samples in medical
studies - a hierarchy

1) Probability samples (e.g. simple
random sample, stratified samples,
multistage samples)
2) Representative samples (no obvious
bias, but …)
3) Convienence samples (biases likely …)
4) Anecdotal, Case reports

Fall 2002                   Biostat 511                102
Problems in Design/Data Collection

Example:
33% reduction in blood pressure after treatment
with medication in a sample of 60 hypertensive
men.
Problem:
Example:
Daytime telephone interview of voting
preferences
Problem:
Example:
Higher proportion of “abnormal” values on tests
performed in 1990 than a comparable sample
taken in 1980.
Problem:

Fall 2002                 Biostat 511                  103
Summary

1. Statistics plays a role from study
conception to study reporting.
2. Statistics is concerned with making
from samples that are subject to
various sources of variability.
3. Different studies require different
statistical approaches. You must
understand the study design and
sampling procedures before you can
hope to interpret the data!!

Fall 2002                    Biostat 511             104

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 7/20/2013 language: Latin pages: 55