VIEWS: 25 PAGES: 37 POSTED ON: 5/25/2012 Public Domain
Biostat 200 Introduction to Biostatistics Lecture 1 Course instructors Judy Hahn, M.A., Ph.D. Judy.hahn@ucsf.edu (415) 206-4435 TAs Michelle Odden, Ph.D., M.S. Megumi Okumura, M.D. Maya Vijayaraghavan, M.D. Robin Wallace. M.D. The details Lectures: Tuesdays 10:30-12:30 Labs: Thursday 10:30-12 Lab 1: Room CB 6702 Lab 2: Room CB 6704 Office hrs: Thursday 12-1 Room CB 5715 Course credits: 3 The details Readings Required readings will be from Principles of Biostatistics by M. Pagano and K. Gauvreau. Duxbury. 2nd edition. Please read the assigned chapters before lecture, and review them after lecture The details Assignments will be posted on Thursdays with due dates Sunday at 5 p.m. 1.5 weeks later Data collection (Assignment 1 only) Data analysis and interpretation Exercises in the book Reading and interpretation of scientific publications Youmust attend Lab 1 to receive assignment 1 The details Grading: Homework (75%) • 5 Assignments • Varying in length; each homework problem is worth (usually 10) points toward final homework score Final exam (25%) LATE ASSIGNMENTS WILL NOT BE ACCEPTED!!! Assigments Send to your TAs Lab 1: Megan Okumura, Robin Wallace ticr.biostat200.1@gmail.com Lab 2: Michelle Odden, Maya Vijayaraghavan ticr.biostat200.2@gmail.com What I do and why Course goals Familiarity with basic biostatistics terms and nomenclature Ability to summarize data and do basic statistical analyses using STATA Ability to understand basis statistical analyses in published journals Understanding of key concepts including statistical hypothesis testing – critical quantitative thinking Foundation for more advance analyses Today’s topics Variables- numerical versus categorical Tables (frequencies) Graphs (histograms, box plots, scatter plots, line graphs) Required reading: Pagano Chapter 2 Types of data Data are made up of a set of variables Categorical variables: any variable that is not numerical (values have no numerical meaning) (e.g. gender, race, drug, disease status) Nominal variables Ordinal variables Pagano and Gauvreau, Chapter 2 Types of data Categorical variables Nominal variables: • The data are unordered (e.g. RACE: 1=Caucasian, 2=Asian American, 3=African American) • A subset of these variables are Binary or dichotomous variables: have only two categories (e.g. GENDER: 1=male, 2=female) Ordinal variables: • The data are ordered (e.g. AGE: 1=10-19 years, 2=20-29 years, 3=30-39 years; likelihood of participating in a vaccine trial) Pagano and Gauvreau, Chapter 2 Types of data Numerical (quantitative) variables: naturally measured as numbers for which meaningful arithmetic operations make sense (e.g. height, weight, age, salary, viral load, CD4 cell counts) Discrete variables: can be counted (e.g. number of children in household: 0, 1, 2, 3, etc.) Continuous variables: can take any value within a given range (e.g. weight: 2974.5 g, 3012.6 g) Pagano and Gauvreau, Chapter 2 Types of data Manipulation of variables Continuous variables can be discretized • E.g., age can be rounded to whole numbers Continuous or discrete variables can be categorized • E.g., age categories Categorical variables can be re-categorized • E.g., lumping from 5 categories down to 2 Pagano and Gauvreau, Chapter 2 Frequency tables Categorical variables are summarized by Frequency counts – how many are in each category Relative frequency or percent (a number from 0 to 100) Or proportion (a number from 0 to 1) Gender of new HIV clinic patients, 2006-2007, Mbarara, Uganda. n (%) Male 415 (39) Female 645 (61) Total 1060 (100) Pagano and Gauvreau, Chapter 2 Frequency tables Continuous variables can categorized in meaningful ways Choice of cutpoints Even intervals Meaningful cutpoints related to a health outcome or decision Equal percentage of the data falling into each category Pagano and Gauvreau, Chapter 2 Frequency tables CD4 cell counts (mm3) of newly diagnosed HIV positives at Mulago Hospital, Kampala (N=268) n (%) ≤50 40 (14.9) 50-200 72 (26.9) 201-350 58 (21.6) ≥350 98 (36.6) Pagano and Gauvreau, Chapter 2 Bar charts General graph for categorical variables Graphical equivalent of a frequency table The x-axis does not have to be numerical Alcohol consumption in Mulago Hospital patients enrolling in VCT study, n=929 0.5 0.4 Proportion 0.3 0.2 0.1 0 Never >1 year ago Within the past year Pagano and Gauvreau, Chapter 2 Histograms Bar chart for numerical data – The number of bins and the bin width will make a difference in the appearance of this plot and may affect interpretation CD4 among new HIV positives at Mulago histogram cd4count, fcolor(blue) lcolor(black) 15 width(50) name(cd4_by50) title(CD4 among new HIV positives at Mulago) 10 xtitle(CD4 cell count) percent 5 0 0 500 1000 1500 CD4 cell count Pagano and Gauvreau, Chapter 2 Histograms This histogram has less detail but gives us the % of persons with CD4 <350 cells/mm3 CD4 among new HIV positives at Mulago 60 40 histogram cd4count, 20 fcolor(blue) lcolor(black) width(350) name(cd4_by350) title(CD4 among new HIV positives at Mulago) 0 0 500 1000 1500 CD4 cell count xtitle(CD4 cell count) percent Pagano and Gauvreau, Chapter 2 What does this graph tell us? .25 Days drank alcohol among current drinkers .2 Relative freq .15 .1 .05 0 0 10 20 30 Days Box plots Middle line=median 30 (50th percentile) Middle box=25th to 75th percentiles (interquartile range) 20 Days drank alcohol Bottom whisker: Data point at or above 25th percentile 10 – 1.5*IQR Top whisker: Data point at or below 75th 0 percentile + 1.5*IQR Pagano and Gauvreau, Chapter 2 Box plots CD4 count among new HIV positives at Mulago 1,500 1,000 cd4count 500 0 graph box cd4count, box(1, fcolor(blue) lcolor(black) fintensity(inten100)) title(CD4 count among new HIV positives Pagano and Gauvreau, Chapter 2 at Mulago) Box plots by another variable We can divide up our graphs by another variable What type of variable is gender? male female 30 Days drank alcohol 20 10 0 Graphs by a1. sex Histograms by another variable male female .3 .2 .1 0 0 10 20 30 0 10 20 30 Days consumed alcohol of prior 30 Graphs by a1. sex Numerical variable summaries Mode – the value (or range of values) that occurs most frequently Sometimes there is more than one mode, e.g. a bi-modal distribution (both modes do not have to be the same height) The mode only makes sense when the values are discrete, rounded off, or binned 30 25 20 f 15 10 5 0 62 67 72 77 82 87 92 97 Grades Pagano and Gauvreau, Chapter 3 Scatter plots CD4 cell count versus age 1500 1000 CD4 cell count 500 0 10 20 30 40 50 60 a4. how old are you? Pagano and Gauvreau, Chapter 2 The importance of good graphs http://niemann.blogs.nytimes.com/2009/ 09/14/good-night-and-tough-luck/ Numerical variable summaries Measures of central tendency – where is the center of the data? Median – the 50th percentile == the middle value • If n is odd: the median is the (n+1)/2 observations (e.g. if n=31 then median is the 16th highest observation) • If n is even: the median is the average of the two middle observations (e.g. if n=30 then the median is the average of the 15th and16th observation Median CD4 cell count in previous data set = 234.5 Pagano and Gauvreau, Chapter 3 Numerical variable summaries Range Minimum to maximum or difference (e.g. age range 15-58 or range=43) • CD4 cell count range: (0-1368) Interquartile range (IQR) 25th and 75th percentiles (e.g. IQR for age: 23- 36) or difference (e.g. 13) Less sensitive to extreme values • CD4 cell count IQR: (92-422) Pagano and Gauvreau, Chapter 3 Numerical variable summaries Measures of central tendency – where is the center of the data? Mean – arithmetic average • Means are sensitive to very large or small values • Mean CD4 cell count: 296.9 • Mean age: 32.5 1 n Mean : x i 1 xi n Pagano and Gauvreau, Chapter 3 Interpreting the formula ∑ is the symbol for the sum of the elements immediately to the right of the symbol These elements are indexed (i.e. subscripted) with the letter i The index letter could be any letter, though i is commonly used) The elements are lined up in a list, and the first one in the list is denoted as x1 , the second one is x2 , the third one is x3 and the last one is xn . n is the number of elements in the list. n x x1 x2 ... xn 1 n i 1 i Mean : x i 1 xi n Pagano and Gauvreau, Chapter 3 Numerical variable summaries Sample variance n Amount of spread around the mean, (x x) i 2 calculated in a sample by s2 i 1 n 1 Sample standard deviation (SD) is n the square root of the variance (x x) i 2 s i 1 The standard deviation has the same n 1 units as the mean SD of CD4 cell count = 255.4 SD of Age = 11.2 Pagano and Gauvreau, Chapter 3 Numerical variable summaries Coefficient of variation For the same relative spread around a s mean, the variance will be larger for a CV *100% larger mean x Can use to compare variability across measurements that are on a different scale (e.g. IQ and head circumference) CV for CD4 cell count: 86.0% CV for age: 34.5% Pagano and Gauvreau, Chapter 3 Pocket/wallet change Histogram , boxplot Mode, Median, 25th percentile, 75th percentile Mean, SD Differ by gender? For next time Read Pagano and Gauvreau Chapters 1-3 (Review of today’s material) Chapter 6