Graphical display of numerical variables by alicejenny

VIEWS: 5 PAGES: 62

									‫حيم‬‫بسم هللا الرحمن الرّ‬
            ‫ّ‬


         ‫‪www.biostat.ir‬‬   ‫1‬
Biostatistics Academic Preview
     Descriptive Statistics




             www.biostat.ir   2
              What Is Statistics?
   Statistics is the science of describing or making
    inferences about the world from a sample of
    data.
   Descriptive statistics are numerical estimates
    that organize and sum up or present the data.
   Inferential statistics is the process of inferring
    from a sample to the population.



                         www.biostat.ir                  3
Statistics has two major chapters:




   Descriptive Statistics

   Inferential statistics



                www.biostat.ir       4
              Two types of Statistics
   Descriptive statistics
       Used to summarize, organize and simplify data
          What was the average height score?
          What was the highest and lowest score?
          What is the most common response to a question?

   Inferential statistics
       Techniques that allow us to study samples and then
        make generalizations about the populations from
        which they were selected
          Are 5th grade boys taller than 5th grade girls?
          Does a treatment suitable?

                                  www.biostat.ir             5
  Population and Samples

The Population under study is the set off all
individuals of interest for the research.


That part of the population for which we collect
measurements is called sample.

The number of individuals in a sample is
denoted by n.


                    www.biostat.ir              6
Variables



   www.biostat.ir   7
                            Definitions

   Variable: a characteristic that changes or
    varies over time and/or different subjects under
    consideration.

       Changing over time
            Blood pressure, height, weight



       Changing across a population
            gender, race

                                 www.biostat.ir    8
           Types of variables

                      Data

                    Variables

     Quantitative                           Qualitative
      (numeric)                            (categorical)

Discrete     Continuous            Nominal            Ordinal




                          www.biostat.ir                        9
             Types of variables :
                Definitions
   Quantitative variables (numeric): measure a
    numerical quantity of amount on each
    experimental unit

   Qualitative variables (categorical): measure a
    non numeric quality or characteristic on each
    experimental unity by classifying each subject
    into a category

                       www.biostat.ir             10
               Types of variables :
              Quantitative variables
   Discrete variables: can only take values from a
    list of possible values
       Number of brushing per day


   Continuous variables: can assume the infinitely
    many values corresponding to the points on a
    line interval
       weight, height

                          www.biostat.ir          11
              Types of variables :
              Categorical variables

   Nominal: unordered categories
     Race
     Gender



   Ordinal: ordered categories
     likert scales( disagree, neutral, agree )
     Income categories



                            www.biostat.ir        12
             Types of Variables
   A discrete variable has gaps between its values.
    For example, number of brushing per day is a
    discrete variable.

   A continuous variable has no gaps between its
    values. All values or fractions of values have
    meaning. Age is an example of continuous
    variable.

                        www.biostat.ir                 13
         Levels of Measurement
   Reflects type of information measured and helps
    determine what descriptive statistics and which
    statistical test can be used.




                       www.biostat.ir             14
   Four Levels of Measurement

Nominal lowest level, categories, no rank
Ordinal second lowest, ranked categories
Interval next to highest, ranked categories with
         known units between rankings
Ratio    highest level, ranked categories with
         known intervals and an absolute zero


                     www.biostat.ir            15
           Scales of Measurement
   Temperature                                Interval
   Men/Women                                  Nominal
   Good/Better/Best                           Ordinal
   Weight                                     Ratio
   Republicans/Democrats/ Independents        Nominal
   Volume                                     Ratio
   IQ                                         Interval
   Not at all/A little/A lot                  Ordinal




                           www.biostat.ir                  16
                     Descriptive
                      Statistics
   Qualitative                           Quantitative



    Frequency                    Measures of Central Tendency
Relative frequency                   Measures of spread
   Percentage                        Five number system

                                           Tables
     Tables                              Histograms
   Pie Charts                             Box plots
   Bar Graphs                            Bar charts
                                         Line charts



                       www.biostat.ir                           17
          Descriptive Measures

   Central Tendency measures. They are
  computed in order to give a “center” around
  which the measurements in the data are
  distributed.
 Relative Standing measures. They
  describe the relative position of a specific
  measurement in the data.
   Variation or Variability measures.
    They describe “data spread” or how far away
    the measurements are from the center.

                       www.biostat.ir             18
       Measures of Central Tendency

   Mean:
    Sum of all measurements in the data divided by the
    number of measurements.

   Median:
    A number such that at most half of the measurements
    are below it and at most half of the measurements are
    above it.

   Mode:
    The most frequent measurement in the data.

                             www.biostat.ir                 19
     n
   xi
    i 1
                        Summary Statistics:
               Measures of central tendency (location)

              Mean: The mean of a data set is the sum of the
               observations divided by the number of observation
                  Population mean:      1 n               Sample mean:
                                         xi
                                                                             1 n
                                                                          x   xi
                                         n i 1                              n i 1

              Median: The median of a data set is the “middle value”
                  For an odd number of observations, the median is the
                   observation exactly in the middle of the ordered list
                  For an even number of observation, the median is the mean
                   of the two middle observation is the ordered list

              Mode: The mode is the single most frequently
               occurring data value
                                          www.biostat.ir                              20
                          Skewness
The  skewness of a distribution is measured by comparing the relative
positions of the mean, median and mode.
       Distribution is symmetrical
          Mean = Median = Mode

      Distribution skewed right
          Median lies between mode and mean, and mode is
           less than mean

      Distribution skewed left
          Median lies between mode and mean, and mode is
           greater than mean

                                 www.biostat.ir                          21
    Relative positions of the mean and median for (a)
    right-skewed, (b) symmetric, and
    (c) left-skewed distributions




Note: The mean assumes that the data is normally distributed. If this is not the case it is
better to report the median as the measure of location.



                                           www.biostat.ir                                     22
Frequency Distributions and Histograms




  Histograms for symmetric and skewed distributions.


                       www.biostat.ir                  23
           Normal curves
same mean but different standard deviation




                  www.biostat.ir             24
                 Further Notes

   When the Mean is greater than the Median the data
    distribution is skewed to the Right.

   When the Median is greater than the Mean the data
    distribution is skewed to the Left.

   When Mean and Median are very close to each other
    the data distribution is approximately symmetric.




                        www.biostat.ir                  25
                  Summary statistics
               Measures of spread (scale)
   Variance: The average of the squared deviations of each
    sample value from the sample mean, except that instead
    of dividing the sum of the squared deviations by the
    sample size N, the sum is divided by N-1.
                           1 n
                      s 
                       2
                                    xi  x 2
                          n  1 i 1

   Standard deviation: The square root of the sample
    variance           s
                           1        n

                              x  x 
                                                  2

                            n  1 i 1
                                         i




   Range: the difference between the maximum and
    minimum values in the sample.
                                 www.biostat.ir               26
Summary statistics: measures of spread (scale)
   We can describe the spread of a distribution by using percentiles.

   The pth percentile of a distribution is the value such that p
    percent of the observations fall at or below it.
        Median=50th percentile

   Quartiles divide data into four equal parts.
        First quartile—Q1
             25% of observations are below Q1 and 75% above Q1
        Second quartile—Q2
             50% of observations are below Q2 and 50% above Q2
        Third quartile—Q3
             75% of observations are below Q3 and 25% above Q3


                                      www.biostat.ir                27
           Quartiles


      Q1            Q2                  Q3


25%        25%                    25%        25%




                 www.biostat.ir                    28
           Five number system
   Maximum
   Minimum
   Median=50th percentile
   Lower quartile Q1=25th percentile
   Upper quartile Q3=75th percentile




                       www.biostat.ir   29
Graphical display of numerical variables
              (histogram)

Class Interval Frequency




                                        20
20-under 30         6
30-under 40        18


                            Frequency
40-under 50        11


                                        10
50-under 60        11
60-under 70         3
70-under 80         1
                                        0


                                             0   10 20 30 40 50 60 70 80
                                                        Years


                           www.biostat.ir                              30
Frequency Distributions and Histograms




A histogram of the compressive strength data with 17 bins.
                          www.biostat.ir                     31
   Frequency Distributions and Histograms




A histogram of the compressive strength data with nine bins.
                             www.biostat.ir                    32
Frequency Distributions and Histograms




      Histogram of compressive strength data.
                     www.biostat.ir             33
Graphical display of numerical variables
               (box plot)


                Median




 Minimum   Q1    Q2                Q3   Maximum




                  www.biostat.ir                  34
Graphical display of numerical variables
               (box plot)
          S<0      S=0              S>0




   Negatively    Symmetric           Positively
    Skewed      (Not Skewed)          Skewed

                   www.biostat.ir                 35
            Univariate statistics
           (categorical variables)
   Summary measures
     Count=frequency
     Percent=frequency/total sample



   The distribution of a categorical variable lists the
    categories and gives either a count or a percent
    of individuals who fall in each category

                         www.biostat.ir                36
    Displaying categorical variables
Rank    Cause of    Frequency
        Death       (%)
1       Heart       710,760
        Disease     (43%)
2       Cancer      553,091
                    (33%)
3       Stroke      167,661                      heart      cancer     stroke   CLRD     accident
                    (11%)
4       CLRD        122,009                      60
                    ( 7%)
                                                 40
5       Accidents    97,900
                                                 20
                    ( 6%)
Total   All five    1,651,421                     0
                                                         heart   cancer stroke CLRD accident
        causes

                                www.biostat.ir                                                 37
Response and explanatory variables
   Response variable: the variable which we intend
    to model.
       we intend to explain through statistical modeling


   Explanatory variable: the variable or variables
    which may be used to model the response
    variable
       values may be related to the response variable

                            www.biostat.ir                  38
             Bivariate relationships
   An extension of univariate descriptive statistics

   Used to detect evidence of association in the
    sample
       Two variables are said to be associated if the
        distribution of one variable differs across groups or
        values defined by the other variable



                             www.biostat.ir                     39
            Bivariate Relationships
   Two quantitative variables
       Scatter plot
       Side by side stem and leaf plots

   Two qualitative variables
       Tables
       Bar charts

   One quantitative and one qualitative variable
       Side by side box plots
       Bar chart
                                 www.biostat.ir     40
         Two quantitative variables
               Correlation
          A relationship between two variables.

     Explanatory                               Response
(Independent)Variable                     (Dependent)Variable
         x                                         y
Hours of Training                         Number of Accidents
Shoe Size                                 Height
Cigarettes smoked per day                 Lung Capacity

Height                                     IQ

 What type of relationship exists between the two variables
            and is the correlation significant?
                         www.biostat.ir                         41
  Scatter Plots and Types of Correlation

                                  x = hours of training
  Accidents                       y = number of accidents
  60


  50


  40


  30


  20

  10


   0

         0    2   4   6    8       10      12   14   16   18   20
                               Hours of Training



Negative Correlation as x increases, y decreases
                          www.biostat.ir                            42
 Scatter Plots and Types of Correlation
             x = SAT score
 GPA
4.00
             y = GPA
3.75
3.50
3.25
3.00
2.75
2.50
2.25
2.00
1.75
1.50

       300    350   400   450   500     550   600   650   700   750   800
                                                      Math SAT


Positive Correlation as x increases y increases
                            www.biostat.ir                              43
     Scatter Plots and Types of Correlation
                     x = height y = IQ
IQ
160

150

140

130

120

110

100

 90

 80

         60     64          68           72   76       80
                                              Height


                No linear correlation
                        www.biostat.ir                 44
            Correlation Coefficient
A measure of the strength and direction of a linear relationship
                    between two variables
                                    nxy  xy
                      r
                           nx 2  x         ny 2  (y ) 2
                                            2



                      The range of r is from -1 to 1.

-1                                  0                                     1
If r is close to -1          If r is close to             If r is close to 1
there is a strong             0 there is no               there is a strong
negative                          linear                            positive
correlation                    correlation                      correlation

                               www.biostat.ir                                  45
 Positive and negative correlation
1 If two variables x and y are positively correlated this means that:
    large values of x are associated with large values of y, and

    small values of x are associated with small values of y


2 If two variables x and y are negatively correlated this means
  that:
    large values of x are associated with small values of y, and

    small values of x are associated with large values of y




                              www.biostat.ir                        46
Positive correlation




       www.biostat.ir   47
Negative correlation




        www.biostat.ir   48
            Two qualitative variables
                (Contingency Tables)

   Categorical data is usually displayed using a
    contingency table, which shows the frequency of
    each combination of categories observed in the
    data value
       The rows correspond to the categories of the
        explanatory variable

       The columns correspond the categories of the
        response variable
                           www.biostat.ir              49
                       Example
   Aspirin and Heart Attacks
       Explanatory variable=drug received
          placebo
          Aspirin



       Response variable=heart attach status
          yes
          no




                            www.biostat.ir      50
           Contingency table:
          heart attack example
          Heart Attack No Heart     Total
                       Attack
Aspirin   104              10,933   11,037


placebo   189              10,845   11,034


Total     293              21,778   22,071


                   www.biostat.ir            51
                    Two qualitative variables
  Marijuana Use in College: x=parental use, y=student use

             Both    Neither   One                      60

                                                        50
Never        17      141       68       226
                                                        40

Occasional   11      54        44       109             30

                                                        20
Regular      19      40        51       110             10

                                                            0
Total        47      235       163      445                      Both     Neither     One

                                                                Never   Occasional   Regular
                                           www.biostat.ir                                   52
          One quantitative, One qualitative
         Box plot of age by low birth weight
                                                        Mean age by low birth weight

    50



                                                                 24                             23.66
    40                                                          23.5

                                                                 23
a
g   30
                                                                       22.31
                                                                22.5
e




    20
                                                                 22

                                                                21.5
                                                                       yes                      no
    10

                    0                 1                                      low birth weight
                            l bw




                   low birth weight


                                               www.biostat.ir                                           53
         Trivariate Relationships
   An extension of bivariate descriptive statistics

   We focus on description that helps us decide
    about the role variables might play in the
    ultimate statistical analyses

   Identify variables that can increase the precision
    of the data analysis used to answer associations
    between two other variables
                         www.biostat.ir                54
Confounding and effect modification
   A factor, Z, is said to confound a relationship between
    a risk factor, X, and an outcome, Y, if it is not an effect
    modifier and the unadjusted strength of the relationship
    between X and Y differs from the common strength of
    the relationship between X and Y for each level of Z.

   A factor, Z, is said to be an effect modifier of a
    relationship between a risk factor, X, and an outcome
    measure, Y, if the strength of the relationship between
    the risk factor, X, and the outcome, Y, varies among
    the levels of Z.
                            www.biostat.ir                    55
          Example: confounding
   In our low birth weight data suppose we wish to
    investigate the association between race and low
    birth weight.

   Our ability to detect this association might be
    affected by:
     Smoking status being associated with low birth
      weight
     Smoking status being associated with race



                          www.biostat.ir               56
              Multiple Models
   Allows one to calculated the association between
    and response and outcome of interest, after
    controlling for potential confounders.

   Allows for one to assess the association between
    an outcome and multiple response variables of
    interest.


                        www.biostat.ir             57
              Time Sequence Plots

• A time series or time sequence is a data set in
which the observations are recorded in the order in
which they occur.
• A time series plot is a graph in which the vertical
axis denotes the observed value of the variable (say x)
and the horizontal axis denotes the time (which could
be minutes, days, years, etc.).
• When measurements are plotted as a time series, we
often see
   •trends,
   •cycles, or
   •other broad features of the data
                        www.biostat.ir                    58
      Time Sequence Plots




Company sales by year (a) and by quarter (b).




                  www.biostat.ir                59
  Tests comparing difference between 2 or more groups

Test                   Dependent              Independent
                       variable               variable
Paired                 Interval/ratio pre     Nominal
                       and post tests
(dependent t-test)

Unpaired               Interval/ratio         Nominal (2 grps)
(independent t-test)

ANOVA F-test           Interval/ratio         Nominal (>2
                                              grps)
Chi-Square             Nominal                Nominal
                       (Dichotomous)
(Nonparametric)
                             www.biostat.ir                      60
Tests demonstrating association
      between two groups

Test             Dependent var.      Independent var.
Spearman rho     Ordinal             Ordinal

Mann-Whitney U   Ordinal             Nominal
Non-parametric
Pearson’s r      Interval/ratio      Interval/ratio



                    www.biostat.ir                    61
 Tests demonstrating association
between two groups, controlling for
          third variable
Test                Dependent            Independent
Logistic          Nominal                Nominal
regression
Linear regression Interval/ratio         Interval/ratio
Pearson partial r   Interval/ratio       Interval/ratio
Kendall’s partial r Ordinal              Ordinal


                        www.biostat.ir                    62

								
To top