# Graphical display of numerical variables by alicejenny

VIEWS: 5 PAGES: 62

• pg 1
```									‫حيم‬‫بسم هللا الرحمن الرّ‬
‫ّ‬

‫‪www.biostat.ir‬‬   ‫1‬
Descriptive Statistics

www.biostat.ir   2
What Is Statistics?
   Statistics is the science of describing or making
inferences about the world from a sample of
data.
   Descriptive statistics are numerical estimates
that organize and sum up or present the data.
   Inferential statistics is the process of inferring
from a sample to the population.

www.biostat.ir                  3
Statistics has two major chapters:

   Descriptive Statistics

   Inferential statistics

www.biostat.ir       4
Two types of Statistics
   Descriptive statistics
   Used to summarize, organize and simplify data
 What was the average height score?
 What was the highest and lowest score?
 What is the most common response to a question?

   Inferential statistics
   Techniques that allow us to study samples and then
make generalizations about the populations from
which they were selected
 Does a treatment suitable?

www.biostat.ir             5
Population and Samples

The Population under study is the set off all
individuals of interest for the research.

That part of the population for which we collect
measurements is called sample.

The number of individuals in a sample is
denoted by n.

www.biostat.ir              6
Variables

www.biostat.ir   7
Definitions

   Variable: a characteristic that changes or
varies over time and/or different subjects under
consideration.

   Changing over time
   Blood pressure, height, weight

   Changing across a population
   gender, race

www.biostat.ir    8
Types of variables

Data

Variables

Quantitative                           Qualitative
(numeric)                            (categorical)

Discrete     Continuous            Nominal            Ordinal

www.biostat.ir                        9
Types of variables :
Definitions
   Quantitative variables (numeric): measure a
numerical quantity of amount on each
experimental unit

   Qualitative variables (categorical): measure a
non numeric quality or characteristic on each
experimental unity by classifying each subject
into a category

www.biostat.ir             10
Types of variables :
Quantitative variables
   Discrete variables: can only take values from a
list of possible values
   Number of brushing per day

   Continuous variables: can assume the infinitely
many values corresponding to the points on a
line interval
   weight, height

www.biostat.ir          11
Types of variables :
Categorical variables

   Nominal: unordered categories
 Race
 Gender

   Ordinal: ordered categories
 likert scales( disagree, neutral, agree )
 Income categories

www.biostat.ir        12
Types of Variables
   A discrete variable has gaps between its values.
For example, number of brushing per day is a
discrete variable.

   A continuous variable has no gaps between its
values. All values or fractions of values have
meaning. Age is an example of continuous
variable.

www.biostat.ir                 13
Levels of Measurement
   Reflects type of information measured and helps
determine what descriptive statistics and which
statistical test can be used.

www.biostat.ir             14
Four Levels of Measurement

Nominal lowest level, categories, no rank
Ordinal second lowest, ranked categories
Interval next to highest, ranked categories with
known units between rankings
Ratio    highest level, ranked categories with
known intervals and an absolute zero

www.biostat.ir            15
Scales of Measurement
   Temperature                                Interval
   Men/Women                                  Nominal
   Good/Better/Best                           Ordinal
   Weight                                     Ratio
   Republicans/Democrats/ Independents        Nominal
   Volume                                     Ratio
   IQ                                         Interval
   Not at all/A little/A lot                  Ordinal

www.biostat.ir                  16
Descriptive
Statistics
Qualitative                           Quantitative

Frequency                    Measures of Central Tendency
Percentage                        Five number system

Tables
Tables                              Histograms
Pie Charts                             Box plots
Bar Graphs                            Bar charts
Line charts

www.biostat.ir                           17
Descriptive Measures

   Central Tendency measures. They are
computed in order to give a “center” around
which the measurements in the data are
distributed.
 Relative Standing measures. They
describe the relative position of a specific
measurement in the data.
   Variation or Variability measures.
They describe “data spread” or how far away
the measurements are from the center.

www.biostat.ir             18
Measures of Central Tendency

   Mean:
Sum of all measurements in the data divided by the
number of measurements.

   Median:
A number such that at most half of the measurements
are below it and at most half of the measurements are
above it.

   Mode:
The most frequent measurement in the data.

www.biostat.ir                 19
n
   xi
i 1
Summary Statistics:
Measures of central tendency (location)

   Mean: The mean of a data set is the sum of the
observations divided by the number of observation
   Population mean:      1 n               Sample mean:
   xi
1 n
x   xi
n i 1                              n i 1

   Median: The median of a data set is the “middle value”
   For an odd number of observations, the median is the
observation exactly in the middle of the ordered list
   For an even number of observation, the median is the mean
of the two middle observation is the ordered list

   Mode: The mode is the single most frequently
occurring data value
www.biostat.ir                              20
Skewness
The  skewness of a distribution is measured by comparing the relative
positions of the mean, median and mode.
       Distribution is symmetrical
   Mean = Median = Mode

      Distribution skewed right
   Median lies between mode and mean, and mode is
less than mean

      Distribution skewed left
   Median lies between mode and mean, and mode is
greater than mean

www.biostat.ir                          21
Relative positions of the mean and median for (a)
right-skewed, (b) symmetric, and
(c) left-skewed distributions

Note: The mean assumes that the data is normally distributed. If this is not the case it is
better to report the median as the measure of location.

www.biostat.ir                                     22
Frequency Distributions and Histograms

Histograms for symmetric and skewed distributions.

www.biostat.ir                  23
Normal curves
same mean but different standard deviation

www.biostat.ir             24
Further Notes

   When the Mean is greater than the Median the data
distribution is skewed to the Right.

   When the Median is greater than the Mean the data
distribution is skewed to the Left.

   When Mean and Median are very close to each other
the data distribution is approximately symmetric.

www.biostat.ir                  25
Summary statistics
   Variance: The average of the squared deviations of each
sample value from the sample mean, except that instead
of dividing the sum of the squared deviations by the
sample size N, the sum is divided by N-1.
1 n
s 
2
     xi  x 2
n  1 i 1

   Standard deviation: The square root of the sample
variance           s
1        n

 x  x 
2

n  1 i 1
i

   Range: the difference between the maximum and
minimum values in the sample.
www.biostat.ir               26
Summary statistics: measures of spread (scale)
   We can describe the spread of a distribution by using percentiles.

   The pth percentile of a distribution is the value such that p
percent of the observations fall at or below it.
   Median=50th percentile

   Quartiles divide data into four equal parts.
   First quartile—Q1
   25% of observations are below Q1 and 75% above Q1
   Second quartile—Q2
   50% of observations are below Q2 and 50% above Q2
   Third quartile—Q3
   75% of observations are below Q3 and 25% above Q3

www.biostat.ir                27
Quartiles

Q1            Q2                  Q3

25%        25%                    25%        25%

www.biostat.ir                    28
Five number system
   Maximum
   Minimum
   Median=50th percentile
   Lower quartile Q1=25th percentile
   Upper quartile Q3=75th percentile

www.biostat.ir   29
Graphical display of numerical variables
(histogram)

Class Interval Frequency

20
20-under 30         6
30-under 40        18

Frequency
40-under 50        11

10
50-under 60        11
60-under 70         3
70-under 80         1
0

0   10 20 30 40 50 60 70 80
Years

www.biostat.ir                              30
Frequency Distributions and Histograms

A histogram of the compressive strength data with 17 bins.
www.biostat.ir                     31
Frequency Distributions and Histograms

A histogram of the compressive strength data with nine bins.
www.biostat.ir                    32
Frequency Distributions and Histograms

Histogram of compressive strength data.
www.biostat.ir             33
Graphical display of numerical variables
(box plot)

Median

Minimum   Q1    Q2                Q3   Maximum

www.biostat.ir                  34
Graphical display of numerical variables
(box plot)
S<0      S=0              S>0

Negatively    Symmetric           Positively
Skewed      (Not Skewed)          Skewed

www.biostat.ir                 35
Univariate statistics
(categorical variables)
   Summary measures
 Count=frequency
 Percent=frequency/total sample

   The distribution of a categorical variable lists the
categories and gives either a count or a percent
of individuals who fall in each category

www.biostat.ir                36
Displaying categorical variables
Rank    Cause of    Frequency
Death       (%)
1       Heart       710,760
Disease     (43%)
2       Cancer      553,091
(33%)
3       Stroke      167,661                      heart      cancer     stroke   CLRD     accident
(11%)
4       CLRD        122,009                      60
( 7%)
40
5       Accidents    97,900
20
( 6%)
Total   All five    1,651,421                     0
heart   cancer stroke CLRD accident
causes

www.biostat.ir                                                 37
Response and explanatory variables
   Response variable: the variable which we intend
to model.
   we intend to explain through statistical modeling

   Explanatory variable: the variable or variables
which may be used to model the response
variable
   values may be related to the response variable

www.biostat.ir                  38
Bivariate relationships
   An extension of univariate descriptive statistics

   Used to detect evidence of association in the
sample
   Two variables are said to be associated if the
distribution of one variable differs across groups or
values defined by the other variable

www.biostat.ir                     39
Bivariate Relationships
   Two quantitative variables
   Scatter plot
   Side by side stem and leaf plots

   Two qualitative variables
   Tables
   Bar charts

   One quantitative and one qualitative variable
   Side by side box plots
   Bar chart
www.biostat.ir     40
Two quantitative variables
Correlation
A relationship between two variables.

Explanatory                               Response
(Independent)Variable                     (Dependent)Variable
x                                         y
Hours of Training                         Number of Accidents
Shoe Size                                 Height
Cigarettes smoked per day                 Lung Capacity

Height                                     IQ

What type of relationship exists between the two variables
and is the correlation significant?
www.biostat.ir                         41
Scatter Plots and Types of Correlation

x = hours of training
Accidents                       y = number of accidents
60

50

40

30

20

10

0

0    2   4   6    8       10      12   14   16   18   20
Hours of Training

Negative Correlation as x increases, y decreases
www.biostat.ir                            42
Scatter Plots and Types of Correlation
x = SAT score
GPA
4.00
y = GPA
3.75
3.50
3.25
3.00
2.75
2.50
2.25
2.00
1.75
1.50

300    350   400   450   500     550   600   650   700   750   800
Math SAT

Positive Correlation as x increases y increases
www.biostat.ir                              43
Scatter Plots and Types of Correlation
x = height y = IQ
IQ
160

150

140

130

120

110

100

90

80

60     64          68           72   76       80
Height

No linear correlation
www.biostat.ir                 44
Correlation Coefficient
A measure of the strength and direction of a linear relationship
between two variables
nxy  xy
r
nx 2  x         ny 2  (y ) 2
2

The range of r is from -1 to 1.

-1                                  0                                     1
If r is close to -1          If r is close to             If r is close to 1
there is a strong             0 there is no               there is a strong
negative                          linear                            positive
correlation                    correlation                      correlation

www.biostat.ir                                  45
Positive and negative correlation
1 If two variables x and y are positively correlated this means that:
 large values of x are associated with large values of y, and

 small values of x are associated with small values of y

2 If two variables x and y are negatively correlated this means
that:
 large values of x are associated with small values of y, and

 small values of x are associated with large values of y

www.biostat.ir                        46
Positive correlation

www.biostat.ir   47
Negative correlation

www.biostat.ir   48
Two qualitative variables
(Contingency Tables)

   Categorical data is usually displayed using a
contingency table, which shows the frequency of
each combination of categories observed in the
data value
   The rows correspond to the categories of the
explanatory variable

   The columns correspond the categories of the
response variable
www.biostat.ir              49
Example
   Aspirin and Heart Attacks
 placebo
 Aspirin

   Response variable=heart attach status
 yes
 no

www.biostat.ir      50
Contingency table:
heart attack example
Heart Attack No Heart     Total
Attack
Aspirin   104              10,933   11,037

placebo   189              10,845   11,034

Total     293              21,778   22,071

www.biostat.ir            51
Two qualitative variables
Marijuana Use in College: x=parental use, y=student use

Both    Neither   One                      60

50
Never        17      141       68       226
40

Occasional   11      54        44       109             30

20
Regular      19      40        51       110             10

0
Total        47      235       163      445                      Both     Neither     One

Never   Occasional   Regular
www.biostat.ir                                   52
One quantitative, One qualitative
Box plot of age by low birth weight
Mean age by low birth weight

50

24                             23.66
40                                                          23.5

23
a
g   30
22.31
22.5
e

20
22

21.5
yes                      no
10

0                 1                                      low birth weight
l bw

low birth weight

www.biostat.ir                                           53
Trivariate Relationships
   An extension of bivariate descriptive statistics

   We focus on description that helps us decide
about the role variables might play in the
ultimate statistical analyses

   Identify variables that can increase the precision
of the data analysis used to answer associations
between two other variables
www.biostat.ir                54
Confounding and effect modification
   A factor, Z, is said to confound a relationship between
a risk factor, X, and an outcome, Y, if it is not an effect
modifier and the unadjusted strength of the relationship
between X and Y differs from the common strength of
the relationship between X and Y for each level of Z.

   A factor, Z, is said to be an effect modifier of a
relationship between a risk factor, X, and an outcome
measure, Y, if the strength of the relationship between
the risk factor, X, and the outcome, Y, varies among
the levels of Z.
www.biostat.ir                    55
Example: confounding
   In our low birth weight data suppose we wish to
investigate the association between race and low
birth weight.

   Our ability to detect this association might be
affected by:
 Smoking status being associated with low birth
weight
 Smoking status being associated with race

www.biostat.ir               56
Multiple Models
   Allows one to calculated the association between
and response and outcome of interest, after
controlling for potential confounders.

   Allows for one to assess the association between
an outcome and multiple response variables of
interest.

www.biostat.ir             57
Time Sequence Plots

• A time series or time sequence is a data set in
which the observations are recorded in the order in
which they occur.
• A time series plot is a graph in which the vertical
axis denotes the observed value of the variable (say x)
and the horizontal axis denotes the time (which could
be minutes, days, years, etc.).
• When measurements are plotted as a time series, we
often see
•trends,
•cycles, or
•other broad features of the data
www.biostat.ir                    58
Time Sequence Plots

Company sales by year (a) and by quarter (b).

www.biostat.ir                59
Tests comparing difference between 2 or more groups

Test                   Dependent              Independent
variable               variable
Paired                 Interval/ratio pre     Nominal
and post tests
(dependent t-test)

Unpaired               Interval/ratio         Nominal (2 grps)
(independent t-test)

ANOVA F-test           Interval/ratio         Nominal (>2
grps)
Chi-Square             Nominal                Nominal
(Dichotomous)
(Nonparametric)
www.biostat.ir                      60
Tests demonstrating association
between two groups

Test             Dependent var.      Independent var.
Spearman rho     Ordinal             Ordinal

Mann-Whitney U   Ordinal             Nominal
Non-parametric
Pearson’s r      Interval/ratio      Interval/ratio

www.biostat.ir                    61
Tests demonstrating association
between two groups, controlling for
third variable
Test                Dependent            Independent
Logistic          Nominal                Nominal
regression
Linear regression Interval/ratio         Interval/ratio
Pearson partial r   Interval/ratio       Interval/ratio
Kendall’s partial r Ordinal              Ordinal

www.biostat.ir                    62

```
To top