Biostat 200 Introduction to Biostatistics by aWSv41S

VIEWS: 25 PAGES: 37

									        Biostat 200

Introduction to Biostatistics
Lecture 1
           Course instructors
 Judy   Hahn, M.A., Ph.D.
     Judy.hahn@ucsf.edu
     (415) 206-4435
 TAs
  Michelle Odden, Ph.D., M.S.
  Megumi Okumura, M.D.
  Maya Vijayaraghavan, M.D.
  Robin Wallace. M.D.
                 The details

 Lectures:   Tuesdays 10:30-12:30

 Labs:    Thursday 10:30-12
     Lab 1: Room CB 6702

     Lab 2: Room CB 6704

 Office   hrs: Thursday 12-1 Room CB 5715

 Course    credits: 3
                  The details

 Readings

     Required readings will be from Principles of
      Biostatistics by M. Pagano and K.
      Gauvreau. Duxbury. 2nd edition.

     Please read the assigned chapters before
      lecture, and review them after lecture
                  The details
 Assignments will be posted on Thursdays
 with due dates Sunday at 5 p.m. 1.5
 weeks later
     Data collection (Assignment 1 only)
     Data analysis and interpretation
     Exercises in the book
     Reading and interpretation of scientific
      publications
 Youmust attend Lab 1 to receive
 assignment 1
                  The details
 Grading:
     Homework (75%)
      • 5 Assignments
      • Varying in length; each homework problem is
        worth (usually 10) points toward final homework
        score
     Final exam (25%)
     LATE ASSIGNMENTS WILL NOT BE
      ACCEPTED!!!
               Assigments
 Send   to your TAs

    Lab 1: Megan Okumura, Robin Wallace
  ticr.biostat200.1@gmail.com

    Lab 2: Michelle Odden, Maya Vijayaraghavan
  ticr.biostat200.2@gmail.com
What I do and why
                 Course goals
   Familiarity with basic biostatistics terms and
    nomenclature
   Ability to summarize data and do basic statistical
    analyses using STATA
   Ability to understand basis statistical analyses in
    published journals
   Understanding of key concepts including
    statistical hypothesis testing – critical
    quantitative thinking
   Foundation for more advance analyses
               Today’s topics
 Variables-   numerical versus categorical
 Tables (frequencies)
 Graphs (histograms, box plots, scatter
  plots, line graphs)
 Required reading: Pagano Chapter 2
                                 Types of data
     Data  are made up of a set of
      variables
     Categorical variables: any variable
      that is not numerical (values have no
      numerical meaning) (e.g. gender,
      race, drug, disease status)
             Nominal variables
             Ordinal variables

Pagano and Gauvreau, Chapter 2
                                 Types of data
     Categorical                variables
     Nominal variables:
                • The data are unordered (e.g. RACE: 1=Caucasian,
                  2=Asian American, 3=African American)
                • A subset of these variables are Binary or
                  dichotomous variables: have only two categories
                  (e.g. GENDER: 1=male, 2=female)
             Ordinal variables:
                • The data are ordered (e.g. AGE: 1=10-19 years,
                  2=20-29 years, 3=30-39 years; likelihood of
                  participating in a vaccine trial)

Pagano and Gauvreau, Chapter 2
                                 Types of data
       Numerical (quantitative) variables: naturally
        measured as numbers for which meaningful
        arithmetic operations make sense (e.g. height,
        weight, age, salary, viral load, CD4 cell counts)

             Discrete variables: can be counted (e.g. number of
              children in household: 0, 1, 2, 3, etc.)

             Continuous variables: can take any value within a
              given range (e.g. weight: 2974.5 g, 3012.6 g)

Pagano and Gauvreau, Chapter 2
                                 Types of data
    Manipulation of variables
      Continuous variables can be discretized

             • E.g., age can be rounded to whole numbers
          Continuous or discrete variables can be
           categorized
             • E.g., age categories
          Categorical variables can be re-categorized
             • E.g., lumping from 5 categories down to 2




Pagano and Gauvreau, Chapter 2
                            Frequency tables
      Categorical variables are summarized by
            Frequency counts – how many are in each category
            Relative frequency or percent (a number from 0 to 100)
            Or proportion (a number from 0 to 1)

                           Gender of new HIV clinic patients, 2006-2007,
                           Mbarara, Uganda.
                                                        n (%)
                           Male                         415 (39)
                           Female                       645 (61)
                           Total                        1060 (100)

Pagano and Gauvreau, Chapter 2
                            Frequency tables
     Continuous  variables can categorized in
      meaningful ways
     Choice of cutpoints
             Even intervals
             Meaningful cutpoints related to a health
              outcome or decision
             Equal percentage of the data falling into each
              category


Pagano and Gauvreau, Chapter 2
                            Frequency tables
                 CD4 cell counts (mm3) of newly diagnosed HIV
                 positives at Mulago Hospital, Kampala (N=268)
                                                    n (%)
                 ≤50                                40 (14.9)
                 50-200                             72 (26.9)
                 201-350                            58 (21.6)
                 ≥350                               98 (36.6)




Pagano and Gauvreau, Chapter 2
                                              Bar charts
     General graph for categorical variables
     Graphical equivalent of a frequency table
     The x-axis does not have to be numerical
                                                    Alcohol consumption in Mulago Hospital
                                                     patients enrolling in VCT study, n=929

                                              0.5

                                              0.4
                                 Proportion




                                              0.3

                                              0.2
                                              0.1

                                               0
                                                        Never       >1 year ago   Within the past
                                                                                       year
Pagano and Gauvreau, Chapter 2
                                         Histograms
         Bar chart for numerical data – The number of
          bins and the bin width will make a difference in
          the appearance of this plot and may affect
          interpretation
            CD4 among new HIV positives at Mulago           histogram cd4count,
                                                            fcolor(blue) lcolor(black)
15




                                                            width(50) name(cd4_by50)
                                                            title(CD4 among new HIV
                                                            positives at Mulago)
10




                                                            xtitle(CD4 cell count)
                                                            percent
  5
  0




      0                500                    1000   1500
                             CD4 cell count


Pagano and Gauvreau, Chapter 2
                                       Histograms
       This histogram has less detail but gives us
        the % of persons with CD4 <350 cells/mm3
               CD4 among new HIV positives at Mulago
60
40




                                                              histogram cd4count,
20




                                                              fcolor(blue) lcolor(black)
                                                              width(350) name(cd4_by350)
                                                              title(CD4 among new HIV
                                                              positives at Mulago)
  0




           0                500               1000     1500
                             CD4 cell count                   xtitle(CD4 cell count)
                                                              percent
Pagano and Gauvreau, Chapter 2
      What            does this graph tell us?

                .25           Days drank alcohol among current drinkers
                  .2
Relative freq




                .15
                  .1
                .05
                      0




                          0              10               20              30
                                                Days
                                             Box plots
   Middle line=median




                                              30
    (50th percentile)
   Middle box=25th to
    75th percentiles
    (interquartile range)

                                              20
                        Days drank alcohol




   Bottom whisker:
    Data point at or
    above 25th percentile
                                              10




    – 1.5*IQR
   Top whisker: Data
    point at or below 75th
                                               0




    percentile +
    1.5*IQR

Pagano and Gauvreau, Chapter 2
                                               Box plots
                                    CD4 count among new HIV positives at Mulago
                  1,500
                  1,000
       cd4count



                          500
                                0




                                                   graph box cd4count, box(1, fcolor(blue)
                                                   lcolor(black) fintensity(inten100))
                                                   title(CD4 count among new HIV positives
Pagano and Gauvreau, Chapter 2                     at Mulago)
                         Box plots by another variable
                      We can divide up our graphs by another variable
                      What type of variable is gender?

                                                 male   female
                             30
Days drank alcohol




                             20
                             10
                              0




                             Graphs by a1. sex
Histograms by another variable
                           male                                  female
  .3
  .2
  .1
   0




       0              10          20        30   0         10             20   30
                             Days consumed alcohol of prior 30
  Graphs by a1. sex
        Numerical variable summaries
     Mode – the value (or range of values) that
      occurs most frequently
     Sometimes there is more than one mode,
      e.g. a bi-modal distribution (both modes do
      not have to be the same height)
     The mode only makes sense when the
      values are discrete, rounded off, or binned
     30
     25
     20
   f 15
     10
      5
      0
               62   67   72    77   82   87   92   97

                              Grades
Pagano and Gauvreau, Chapter 3
                                                   Scatter plots
                                                       CD4 cell count versus age
                             1500
                             1000
CD4 cell count




                                500
                                      0




                                          10      20         30             40     50   60
                                                            a4. how old are you?

                 Pagano and Gauvreau, Chapter 2
The importance
of good graphs




http://niemann.blogs.nytimes.com/2009/
09/14/good-night-and-tough-luck/
       Numerical variable summaries
     Measures of central tendency – where is
      the center of the data?
           Median – the 50th percentile == the middle value
              • If n is odd: the median is the (n+1)/2 observations
                (e.g. if n=31 then median is the 16th highest
                observation)
              • If n is even: the median is the average of the two
                middle observations (e.g. if n=30 then the median is
                the average of the 15th and16th observation
           Median CD4 cell count in previous data
            set = 234.5

Pagano and Gauvreau, Chapter 3
        Numerical variable summaries
       Range
             Minimum to maximum or difference (e.g. age
              range 15-58 or range=43)
                • CD4 cell count range: (0-1368)
       Interquartile range (IQR)
             25th and 75th percentiles (e.g. IQR for age: 23-
              36) or difference (e.g. 13)
             Less sensitive to extreme values
                • CD4 cell count IQR: (92-422)




Pagano and Gauvreau, Chapter 3
       Numerical variable summaries
     Measures of central tendency – where is
      the center of the data?
           Mean – arithmetic average
              • Means are sensitive to very large or small values
              • Mean CD4 cell count: 296.9
              • Mean age: 32.5



                            1 n
                  Mean : x  i 1 xi
                            n


Pagano and Gauvreau, Chapter 3
                      Interpreting the formula
     ∑ is the symbol for the sum of the elements immediately to
      the right of the symbol

     These elements are indexed (i.e. subscripted) with the letter i
               The index letter could be any letter, though i is commonly used)


     The elements are lined up in a list, and the first one in the list
      is denoted as x1 , the second one is x2 , the third one is x3
      and the last one is xn .

     n is the number of elements in the list.


      
            n
                  x  x1 x2  ... xn                                     1 n
            i 1 i
                                                                 Mean : x  i 1 xi
                                                                           n
Pagano and Gauvreau, Chapter 3
        Numerical variable summaries
       Sample variance                                     n
             Amount of spread around the mean,             (x  x)   i
                                                                                  2

              calculated in a sample by             s2    i 1
                                                                       n 1


       Sample standard deviation (SD) is                        n


        the square root of the variance
                                                              (x  x)     i
                                                                                      2


                                          s                    i 1

             The standard deviation has the same                          n 1
              units as the mean


       SD of CD4 cell count = 255.4
       SD of Age = 11.2
Pagano and Gauvreau, Chapter 3
        Numerical variable summaries
       Coefficient of variation
             For the same relative spread around a         s
              mean, the variance will be larger for a   CV  *100%
              larger mean                                   x
             Can use to compare variability across
              measurements that are on a different
              scale (e.g. IQ and head circumference)

             CV for CD4 cell count: 86.0%
             CV for age: 34.5%




Pagano and Gauvreau, Chapter 3
             Pocket/wallet change
   Histogram , boxplot
   Mode, Median, 25th percentile, 75th percentile
   Mean, SD
   Differ by gender?
                For next time
 Read    Pagano and Gauvreau
     Chapters 1-3 (Review of today’s material)
     Chapter 6

								
To top