Introduction to Biostatistics _ZJU_ 2008_ by chenmeixiu

VIEWS: 6 PAGES: 43

									    Introduction to
Biostatistics (ZJU 2008)
             Wenjiang Fu, Ph.D
             Associate Professor
   Division of Biostatistics, Department of
                Epidemiology
          Michigan State University
     East Lansing, Michigan 48824, USA
            Email: fuw@msu.edu
     www: http://www.msu.edu/~fuw
                          Introduction
   Biostatistics ? Why do we need to study Biostatistics? A test for myself !

   Statistics – Data science to help to decipher data collected in many aspects of
    events using probability theory and statistical principles with the help of
    computer.

   Statistics              Theoretical
                            Applied                     Biostats
                                                        Economics
                                                        Finance
                                                        Engineering
                                                        Sports
                                                        ……
   Data:          Events: party, disease, accident, award, game …
                   Subjects: human, animal …
                   Characteristics: sex, race, age, weight, height …
                      Statistics
Most commonly, statistics refers to numerical data or other data.
Statistics may also refer to the process of collecting, organizing,
presenting, analyzing and interpreting data for the purpose of making
inference, decision, policy and assisting scientific discoveries.




         population          sampling         sample
                                              statistic
         parameter
                                                           descriptive
                                                            statistics
                        frequency


                        probability
                                              Estimation
                                Inferential
                                 statistics
      Prediction                              Hypothesis
                                                testing
Grand challenges we are facing        …
                Knowledge
    “Data”           &          Decision
                Information


                 Statistics

    21st century will be the golden age
    of statistics !
     Grand challenges we are facing             …
1.   Data collection technology has advanced
     dramatically, but without sufficient statistical
     sampling design and experimental design.
2.   Advancement of technology for discovering and
     retrieving useful information has been lagging
     and has become the bottleneck.
3.   More sophisticated approaches are needed for
     decision making and risk management.
Statistical Challenges
  - Massive Amount of Data
Statistical Challenges – Image Data
        Statistics in Science



Cosmic microwave background radiation
                                         High Energy Physics




       Tick-by-tick stock data          Genomic/protomic data
     Statistics in Science




Finger Prints       Microarray
              What do we do?
   New ways of thinking and attacking problems
     Finding sub-optimal but computationally
      feasible solutions.
     New paradigm for new types of data
     Be satisfied with ‘very rough’
      approximations
     Turn research results into easy and publicly
      available software and programs
   Join force with computer scientists.
Some ‘hot’ research directions

 Dimension reduction
 Visualization
 Dynamic systems
 Simulation and real time computation
 Uncertainty and risk management
 Interdisciplinary research
    Reasons to Study Biostatistics I
   Biostatistics is everywhere around us:
     Our life: entertainment, sports game, shopping, party,
      communication (cell phone), travel …
     Our work: career, business, school …

     Our health: food, weather, disease …

     Our environment: safety, security, chemical, animal,

     Our well-being: physical examination, hospital, being
      happy, longevity.
    Reasons to Study Biostatistics I
   Entertainment - party: music / dance /food
       Alcohol, cigarette, drug, etc.
   Sports game
       Car racing, skiing (time to event – survival analysis).
   Shopping: diff taste /preference :
       Allergy to certain food /smell : peanut, flowers …
   Communication - cell phone use
       Potential hazard – leads to health problem (CA …)
   Travel – infectious diseases, safety, accident …
Reasons to Study Biostatistics II
   We care our society, our family, our environment, our
    school, scientific research …
   Major impact on society and communities.
       Disease transmission
       Healthcare benefit, health economics
       Quality of life (research, health improvement)
       Safety issue (outbreaks of diseases, etc.)
   Job market is very promising.
   Applications in a wide-range of areas.
       Healthcare, quality of life,
       Career – job market: scientific, public or private, industrial …
Reasons to Study Biostatistics III
   Biostatistics research and applications
   Major employers in the US
        Research universities, Hospitals, Institutes (NIH),
        CDC, DoD, NASA, pharmaceutical industry,
        biotech industry, banks and other data warehouse …
   Major universities having biostatistics
    department in the US
       Harvard U, U. Michigan, U. Washington (Seattle),
        UC (Berkeley, LA, SF), JHU, Yale U, Stanford U …
Reasons to Study Biostatistics IV
   New Biostatistics research areas (still growing)
   Medical research.
   Recent trend in employment
       Private industry: Google, Microsoft …
       Affymetrix, Illumina, Agilent, Golden Helix,
        23andMe …
       Investment – stock market, Capital One, Bank of America,
        Goldman Sack, etc.


   Nano tech, green energy (alternative energy) …
Example 1. Medical study data:
          Ob/Gyn
   Modeling of PlGF: Placental Growth Factor
      Example 2. Genomics study
Single Nucleotide Polymorphism (SNP)
   Homologous pairs of chromosomes

                            Paternal allele


                            Maternal allele



   Paternal allele   ACGAACAGCT
                      TGCTTGTCGA
                                              SNP A/G
                      ACGAGCAGCT
   Maternal allele   TGCTCGTCGA
Computational Genomics: SNP Genotype

Error rate : around 5% :
Genome-wide association studies – millions of SNPs
                       Applications
   Genetic counseling:
       gene expression + family medical history  disease
       Breast cancer (BRCA) …
   Achieve accurate estimation and prediction
       Early detection / early treatment (cancer, …)
       Accurate diagnosis (HIV +)


   Help development of new drugs for treatment.
   Help to protect environment, live longer and happier,
    improve quality of life.
            Did I pass my test?
   I hope I have convinced you to study
    biostatistics.
    Chapter 2. Descriptive Statistics
   First important thing to do is to visualize data.
   Plot of data
       Scatter plot – pair-wise (var 1 vs. var 2)
Scatter plot
            Descriptive Statistics
   Summarize data using statistics
     Central location (mean, median)
     Range (min, max)
     Variability (variance, standard deviation)
     Mode
     Quantiles (percentiles)


   Rank data, but avoid long listing (use grouping,
    instead)
   Measure of Location

Mean

The mean is the sum of all the observations divided by the number
of observations.

     Population mean :
             N
      1
   
      N
            x
            i 1
                   i   N  The number of observations
                            in the population.
     Sample mean :

       1 n
    x   xi           n  The number of observations
       n i 1              in the sample.
   Properties of the mean

The mean is the most widely used measure of location
and has the following properties :
    N            n

     (x  )   (x  x)  0
    i 1
           i
                i 1
                       i




    yi  axi  b, i  1,, n           y  ax  b

 The mean is oversensitive to extreme values in the
 sample.
Translation of data
          Measure of Location
Median and Mode

 The median is the value of the “middle” point of samples,
 when samples are arranged in ascending order.
  Median = The [(n+1)/2]th largest observation if n is odd.
         = The average of the (n/2)th and (n/2+1)th
           largest observation if n is even.


 The mode is the most frequently occurring value among all
 the observations in a sample. It is the most probable value
 that would be obtained if one data point is selected at
 random from a population.
Example: Median and Mode

Calculate the median and mode of the following data:

            12, 24, 36, 25, 17, 19, 24, 11

    Sorted data : 11, 12, 17, 19, 24, 24, 25, 36


              19  24
     Median =          21.5,      Mode = 24
                 2
The mean is influenced by outliers
while the median is not.

                                        bimodal
 The mode is very unstable.
 Minor fluctuations in the data
 can change it substantially;
 for this reason it is seldom
                                                   mode   mode
 calculated.




    ≤≤                     ==                       ≤≤

                    Mean     Median      Mode
Symmetry and Skewness in Distribution

  When the shape of a distribution to the left and the right is mirror
  image of each other, the distribution is symmetrical. Examples of
  symmetrical distribution are shown below :




  A skewed distribution is a distribution that is not symmetrical .
  Examples of skewed distributions are shown below :




             Positively skewed          Negatively skewed
       Measure of Dispersion

Range and Mean Absolute Deviation (MAD)


 The Range is the simplest measure of dispersion. It is
 simply the difference between the largest and smallest
 observations in a sample.
        Range  xm ax  xm in
 The mean absolute deviation is the average of the
 absolute values of the deviations of individual
 observations from the mean.
                         n

                        | x  x |
                               i
                MAD    i 1
                               n
                Measure of Dispersion

 Quantiles or Percentiles



Quantile (percentile) is the general term for a value at or
below which a stated proportion (p/100) of the data in a
distribution lies.

 Quartiles: p = .25, .50, .75
 Quantile / Percentile : p is any probability value
Calculating Quantiles or Percentiles

Let [k] denote the largest integer k.
For example, [3]=3, [4.7]=4.

The p-th percentile is defined as follows:

    • Find k = np/100.

    • If k is an integer, the p-th percentile is the mean of
      the k-th and (k+1)-th observations (in the ascending
      sorted order).

    • If k is NOT an integer, the p-th percentile is the
      [k]+1-th observation.
Example
Calculate the 10th percentile and the 75th percentile
of the following data:
         7, 12, 16, 2, 8, 4, 20, 14, 19, 17


 Sorted data : 2, 4, 7, 8, 12, 14, 16, 17, 19, 20
                     (n = 10)
10th percentile: k = np/100 = 10×10/100 = 1
     Average of 1st and 2nd observations = (2+4)/2 = 3

75th percentile: k = np/100 = 10×75/100 = 7.5
     [7.5]+1 = 7+1 = 8th observation = 17
        Measure of Dispersion
    Variance and Standard Deviation

The variance is a measure of how spread out a distribution
is. It is computed as the average squared deviation of each
number from its mean. The standard deviation is the square
root of the variance. It is the most commonly used measure
of spread.                              n

                                           ( xi  x ) 2
       sample variance            sx 
                                    2     i 1
                                                 n 1

   sample standard deviation        sx  sx
                                           2



        yi  axi  b, i  1,, n     s y  a 2 sx , sy | a | sx ,
                                       2        2
Example
 Five people have their body mass index (BMI) calculated as

          [body weight (kg)] / [height] 2
          18, 20, 22, 25, 24




       1 n     109
    X   xi       21.8
       n i 1   5


          1 n                32.8
    s 
     2
     x         ( xi  X )  5  1  8.2
        n  1 i 1
                          2




    s x  8.2  2.86
   Relative Dispersion – Coefficient of Variation


A direct comparison of two or more measures of dispersion
may be difficult because of difference in their means.
A relative dispersion is the amount of variability in a
distribution relative to a reference point or benchmark.
A common measure of relative dispersion is the coefficient of
variation (CV).
                                   sx
                      CV  100 
                                   x
This measure remains the same regardless of the units used
when only scaling applies. Very useful !
Good Example: Weight, Kg versus Lb.
Bad Example: Temperature: C vs F.
          Frequency Distribution

Long list of data collection can be confusing, and
need to be grouped in moderate intervals, rather
than listed as raw data point.




                              Hospital Length of Stay (LOS)
__________________________________________________________________________________________
81   44   29   23   16   13   12   11   11   64   43   28   22   16   13   12   11   11   12   12
63   43   28   21   16   13   12   11   11   58   42   28   21   15   13   12   11   10   11   10
98   58   42   28   20   15   13   12   11   10   93   56   36   28   20   15   12   12   11   10
86   55   36   27   19   15   12   12   11   10   83   50   32   27   18   14   12   12   11   10
83   50   32   26   27   14   12   12   11   10   81   48   30   23   17   14
A summary table works better
  than raw data.
   Interval   Frequency   Relative Frequency

    LOS
    LOS
    LOS

    LOS
    LOS
    LOS
    LOS
    LOS
    LOS
    LOS
                     Graphic Methods
Bar Graph
A bar graph is simply a bar chart of data that has been
classified into a frequency distribution. The attractive feature
of a bar graph is that it allows us to quickly see where the
most of the observations are concentrated.



       Interval   Frequency
        LOS
        LOS
        LOS
        LOS
        LOS
        LOS
        LOS
        LOS
        LOS
        LOS
          Graphic Methods


Histogram
Histogram provides a distribution plot, where the bars are not
necessarily of the same length. The area of each bar is
proportional to the density of the data or percentage of data
points within the bar.
             Graphic Methods
Box Plot
 The box Plot is summary plot based on the median and
 interquartile range (IQR) which contains 50% of the
 values. Whiskers extend from the box to the highest and
 lowest values, excluding outliers. A line across the box
 indicates the median.

             IQR  Q3  Q1
             MIN  Q1  1.5  IQR, MAX  Q3  1.5  IQR




       MIN                                                MAX

								
To top