Document Sample

Multiple Regression Control of Confounding Variables • Randomization • Matching • Adjustment – Direct – Indirect Stratified methods – Mantel-Haenszel • Multiple Regression – Linear – Logistic – Poisson – Cox Limitations of the Stratified Methods • Can study only one independent variable at a time • Problematic when there are too many variables to adjust for (too many strata) • Limited to categorical variables (if continuous, can categorize, which may result in residual confounding) How to Investigate Associations Between Variables? • Between two categorical variables: – Contingency table, odds ratio, χ2 • Between a categorical and a continuous variable: – Compare means, t test, ANOVA • Between two continuous variables – Example: Relationship between air pollution and health status Measure of pollution Measure of health status 73 90 52 74 68 91 47 62 60 63 71 78 67 60 80 89 86 82 91 105 67 76 73 82 71 93 57 73 86 82 76 88 91 97 69 80 87 87 77 95 Scatter Plot of health status by pollution level in 20 geographic areas Health status 0 20 40 60 80 100 120 Pollution level Suppose we now wish to know whether our two variables are linearly related • The question becomes: – Are the data we observed compatible with the two variables being linearly related? That is, – Is the true association between the two variables defined by a straight line, and the scatter we see just random error around the truth? Scatter Plot of health status by pollution level in 20 geographic areas Health status r= ? 0 20 40 60 80 100 120 Pollution level Scatter Plot of health status by pollution level in 20 geographic areas Health status r0.7 0 20 40 60 80 100 120 Pollution level • Then, the next practical question in our evaluation of whether the relationship is linear: – How can the fit of the data to a straight line be measured? • Correlation Coefficient (Pearson): the extent to which the two variables vary together • Linear Regression Coefficient: most useful when we wish to know the strength of the association Correlation Coefficient (Pearson) R: ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) • • • ••• • • ••• • •• • • • •• •• • • •• •• • •• • • • •• • • • •• • r= 1.0 r= -0.8 r= 0 Linear Regression Coefficient of a Straight Line y 0 1 x y 1 0 0 x 1 unit x= 0, y= 0 1Linear regression coefficient : increase in y per unit increase in x : expresses strength of the association : allows prediction of the value of y, given x The trick is to find the “line” (0, 1) that best fits the observed da Y= Health status In linear regression, least square approach estimates the line that minimizes the square of the distance between each point and the line 0 20 40 60 80 100 120 X= Pollution level Health status= 0 + 1 (pollution) Health status= 30.8 + 0.71 (pollution) Simple Linear Regression • The “points” (observations) can be individuals, or conglomerates of individuals (e.g., regions, countries, families) in ecologic studies. • When X is inversely related to Y, b () is negative. Note: when estimating from samples, the notation “b” is used instead of • In epidemiologic studies, the value of the intercept (b0 or 0) is frequently irrelevant (X=0 is meaningless for many variables) – E.g. Relationship of weight (X) to systolic blood pressure (Y): SBP(mmHg) • • •• 200 • • • • • • • • • •• • • • •• • 100 • 0 50 100 150 200 ? WEIGHT (Lb) FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL: X and y are linearly related, i.e., the increase in y per unit increase of x () is constant across the entire range of x. E.g., The increase in health status index between pollution level 40 and 50 is the same as that between pollution level 90 and 100 Y= Health • status 0 20 40 60 80 100 120 X= Pollution level FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL: X and y are linearly related However…if the data look like this: y • • •• • •• “u-shaped” function • ••• • • •••• • Wrong model! • • • •• • • • ••• x BOTTOM LINE: LOOK AT THE DATA BEFORE YOU DECIDE ON THE BEST MODEL! - Plot yi vs. xi If non-linear patterns are present: - Use quadratic terms (e.g., age2), logarithmic terms --- e.g., log (x) --- etc. - Categorize and use dummy variables Other important points to keep in mind • Like any other “sample statistic”, b is subject to error. Formulas to calculate the standard error of b are available in most statistics textbooks. • “Statistical significance” of b (hypothesis testing): – H0: b=0 No association x y – H1: b=0 x and y are linearly related – Test statistic: Wald statistic (z-value) b/SE(b) • WARNING: THIS TEST IS ONLY SENSITIVE FOR LINEAR ASSOCIATIONS. A NON-SIGNIFICANT RESULT DOES NOT IMPLY THAT x AND y ARE NOT ASSOCIATED, BUT MERELY THAT THEY ARE NOT LINEARLY ASSOCIATED. • Confidence interval (precision) for b: – 95% CI= b ± 1.96 x SE(b) • The regression coefficient (b) is related to the correlation coefficient (r), but the former is generally preferable because: – It gives some sense of the strength of the association, not only the extent to each two variables vary concurrently in a linear fashion. – It allows prediction of Y as a function of X. Correlation Coefficient (Pearson) R: ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) •• • • • • • ••• • • ••• • • •• • • • •• •• • • •• • • • • •• • • • •• • •• • •• • r= 1.0 r= 1.0 r= -0.8 r= 0 Note: functions having different slopes may have the same correlation coefficient The equation: yi b0 b1x1 Naturally extends to multiple variables (multidimensional space): yi b0 b1x1 b2 x2 b3 x3 ... bk xk y yi b0 b1x1 b2 x2 x1 x2 Multiple regression coefficients: b1 -- increment in y per unit increment in x1, after the effect of x2 on y and on x1 has been removed, or -- effect of x1 on y, adjusted for x2 b2 – increment in y per unit increment in x2, after the effect of x1 on y and on x2 has been removed, or -- effect of x2 on y, adjusted for x1 (b0 – value of y when both x1 and x2 are equal to 0) Exposed: x1= 1 (smoker) Unexposed: x1= 0 (non-smoker) y: lung cancer Confounder present x2= 1 (drinker) Confounder absent x2= 0 (non-drinker) yexp osed b0 b1x1 b 2x 2 e - yun exp osed b0 b1x1 b 2x 2 e =0 ARexp = b1 yexp osed b0 b1x1 b 2x 2 e Same! (no - yun exp osed b0 b1x1 b 2x 2 e interaction) =0 ARexp = b1 y yi b0 b1 x1 b2 x2 b3 x1 x2 Interaction term x1 x2 Multiple regression coefficients: b1 -- increment in y per unit increment in x1 in individuals not exposed to x2 b2 – increment in y per unit increment in x2 in individuals not exposed to x1 b3 – increment in y per unit increment in the joint presence of x1 and x2, compared to individuals not exposed to x1 and x2 Multiple Linear Regression Notes • To obtain least square estimates of b’s, need to use matrix algebra…or computers! • Important assumptions: – Linearity – No (additive) interaction, i.e., • The absolute effect of x1 is independent of x2, or • The effects of x1 and x2 are merely additive (i.e., not “less, or more than additive”) • NOTE: if there is interaction, product terms can be introduced in the model to account for it (it is, however, better to do stratified analysis) • Other assumptions: – Observations (i’s) are independent – Homoscedasticity: variance of y is constant across x-values – Normality: for a given value of x, values of y are normally distributed In Linear Regression (simple or multiple), Independent Variables (x’s) can be: • Continuous • Pollution level (score) • BMI (kg/m2) • Blood pressure (mmHg) • Age (years) • Categorical – Dichotomous (conventionally, one of the values is coded as “1” and the other, as “0”) • Gender (male/female) • Treatment (yes/no) • Smoking (yes/no) • Ordinal • Any continuous variables categorized in percentiles (tertiles, quartiles, etc) In Linear Regression (simple or multiple), the Dependent Variable (y) can be: • Discrete (yes/no) – Incident cancer – Recurrent cancer • Continuous – Systolic blood pressure (mmHg) – Serum cholesterol (mg/dL) – BMI (kg/m2) Example of x as a discrete variable (obesity) and y as a continuous variable (systolic blood pressure, mmHg) 160 150 140 Average difference 130 (regression coefficient 120 or slope = b1) 110 0 Unit: from zero to 1 1 Obesity (x= 0 if “no”; x=1 if “yes”) When x= 1, SBP b0 b1 1 b0 b1 When x= 0, -SBP b0 b1 0 b0 SBP = b1 Thus, b1 = increase in SBP per unit increase in obesity = average difference in SBP between “obese” and “non-obese” individuals Example of x as a discrete variable with more than 2 categories (e.g., educational level) and y as a continuous variable (systolic blood pressure (mmHg) • Ordinal variables (x’s) can be entered into the regression equation as single x’s. Example: SBP b0 b1educ b2 age • Where education is categorized into “low”, “medium” and “high”. • Thus, x1= 1 when “low”, x1=2 when “medium” and x1=3 when “high” 160 SBP b0 b1educ b2 age 150 b1 140 130 b1 120 same 110 Low Medium High Educational Level HOWEVER, the model assumes that the difference in SBP (decrease) is the same between “low” (x1= 1) and “medium” (x1= 2), as that between “medium” (x1= 2) and “high” (x1= 3) assumption of linearity Alternative: it’s coming! Non-ordinal multilevel categorical variable • Race (Asian, Black, Hispanic, White) • Treatment (A, B, C, D) • Smoking (cigarette, pipe, cigar, nonsmoker) How to include these variables in a multiple regression model? “Dummy” or indicator variables: Define the number of dummy dichotomous variables as the number of categories minus one Use of dummy variables Example: “Race” categorized as Asian, Black, Hispanic and White. Thus, to model “race”: SBP b0 b1x1 b2 x2 b3 x3 Where X1= 1 if Asian, x1= 0 if otherwise X2= 1 if Black, x2= 0 if otherwise X3= 1 if Hispanic, x3= 0 if otherwise SBPASIANS b0 b1 SBPBLACKS b0 b2 SBPHISPAN b0 b3 SBPWHITES b0 Thus, what is the interpretation of b0, b1, b2, and b3? Definitions of Dummy Variables Dummy Variables Race x1 x2 x3 Asian 1 0 0 Black 0 1 0 Hispanic 0 0 1 White 0 0 0 • b0= average value of y in whites (reference category) • b1= average difference in y between Asians and Whites • b2= average difference in y between Blacks and Whites • b3= average difference in y between Hispanics and Whites Use of dummy variables when the function is not a straight line SBP 160 150 WRONG 140 MODEL!!! 130 120 110 1 2 3 4 5 BMI Quintile SBP 160 150 140 130 120 110 1 2 3 4 5 BMI Quintile Model SBP b0 b1x1 b2 x2 b3 x3 b4 x4 Where X1=1 if BMI quintile=2; x1=0 if otherwise X2=1 if BMI quintile=3; x2=0 if otherwise X3=1 if BMI quintile=4; x3=0 if otherwise X4=1 if BMI quintile=5; x4=0 if otherwise Note: each b represents the difference between each quintile (2, 3, 4 and 5) and the reference quintile (quintile 1). Thus, the difference is negative for 2, slightly negative for 3, and positive for 4 and 5. Can also obtain the difference between quintiles: for example, b4 – b3 is the difference between quintiles 5 and 4 Multiple linear regression models of leukocyte count (thousands/mm3) by selected factors, in never smokers, ARIC study, 1986-89 (Nieto et al, AJE 1992;136:525-37) Model 1* (R2=0.09) Model 2** (R2=0.21) Variable b SE(b) b SE(b) Age (5 years) -0.066 0.019 -0.066 0.018 Sex (male=1, fem=0) 0.478 0.065 0.030 0.073 Race(W=1, B=0) 0.495 0.122 0.333 0.117 Work activity score (1 unit) -0.065 0.021 -0.061 0.020 Subscapular skinfold (10 0.232 0.018 0.084 0.020 mm) SBP (10 mmHg) 0.040 0.011 0.020 0.011 FEV1 (1 liter) -0.208 0.047 -0.183 0.045 Heart rate (10 beats/minutes 0.206 0.020 0.128 0.019 *Model 1: adjusted for center, education, height, apolipoprotein A-I, glucose and for the other variables shown in the table. **Model 2: Adjusted for the same variables included in Model 1 plus hemoglobin, platelet, uric acid, insulin, HDL, apolipoprotein B, triglycerides, factor VIII, fibrinogen, antithrombin III, protein C antigen and APTT Control of Confounding Variables • Random allocation • Matching – Individual – Frequency – Restriction • Adjustment – Direct – Indirect – Mantel-Haenszel – MULTIPLE REGRESSION • Linear model • LOGISTIC MODEL AN ALTERNATIVE TO THE LINEAR MODEL When the dependent variable is dichotomous (1/0) The probability of disease (y) given exposure (x): 1.0 Probability of response (P) 1 P y / x 1 e b0 b1 x Or, simplifying: 0.5 1 P 1 e b 0 Dose (x) EXPONENTS AND LOGARITHMS: Brief Review log A B A 10 B E . g. log 100 2 100 102 ln A B A e B E . g.ln 5 1609 5 2.711.609 . Notation: e B exp B (Note: In most epidemiologic literature, lnA is written as logA) Logs: Brief Review (Cont.) log A log B log( A B) A log A log B log B 1 B A B A Example: 100= 1/0.01= 1/10-2= 102= 100 1 B e B e 1 P 1 e b B B 1 1 e 1 e 1 P 1 B B B 1 e 1 e 1 e 1 P 1 e B 1 eB 1 P e B e B 1 e B THUS: ODDS e B eb0 b1x P 1 P P log log(Odds) b0 b1 x 1 P b1 b0 x Unit increment in x b1= increment in log (Odds) per unit increment in x Remember that: (Odds) x 1 log(Odds) x 1 log(Odds) x log log(Odds Ratio) (Odds) x Thus, b1 is the log of the Odds Ratio!! Assume prospective data in which exposure (independent variable) is defined dichotomously (x): Disease Non disease Y=1 Y=0 Exposed X=1 p1 1 – p1 Unexposed X=0 P0 1 – p0 P b0 b1x b0 b1 1 For exposed (x=1): log 1 1 P 1 P0 For unexposed (x=0): log b0 b1 0 b0 1 P0 P 1 P P0 1 P b1 log 1 log log 1 logOR 1 P 1 P0 P0 1 1 P0 OR eb1 Antilog of b1 WITH CASE-CONTROL DATA: • Intercept (b0) is uninterpretable • Can obtain unbiased estimates of the regression coefficient (b1) (See Schlesselman, pp. 235-7) The logistic model extends to the multivariate situation: 1 P( y / x ) ( b0 b1 x1 b2 x 2 b3 x 3 ... bk x k ) 1 e P log b0 b1 x1 b2 x2 b3 x3 ... bk xk 1 P Interpretation of multiple logistic regression coefficients: Dichotomous x: b1: log(OR) for x=1 compared to x=0 after adjustment for the remaining x’s Continuous x: b1: log(OR) for an increment of 1 unit in x, after adjustment for the remaining x’s Thus: 10 x b1: log(OR) for an increment of 10 units of x, after adjustment for the remaining x’s CAUTION: Assumes linear increase in the log(OR) throughout the entire range of x values Logistic Regression Using Dummy Variables: Cross-Sectional Association Between Demographic Factors and Depressive State, NHANES, Mexican- Americans Aged 20-74 Years, 1982-4 Factor b OR P value Intercept -3.1187 - - Sex (female= 1, male= 0) 0.8263 2.28 0.00 Age 20-24 Reference 1.00 - 25-34 0.1866 1.20 0.11 35-44 -0.1112 0.89 0.60 45-54 -0.1264 0.88 0.52 55-64 -0.1581 0.85 0.32 65-74 -0.3555 0.70 0.19 Years of Education 0-6 0.8408 2.32 0.00 7-11 0.4470 1.56 0.01 12 0.2443 1.28 0.21 13 Reference 1.00 - Generalized Linear Models Model Equation Interpretation Linear (simple) y b0 b1x1 b2 x2 ... bk xk Increase in outcome y mean value per unit increase in x1, adjusted for all other variables in model Logistic Log (odds) b0 b1x1 b2 x2 ... bk xk Increase in log (odds) of outcome per unit increase in x1, adjusted for all other variables in model Cox Log (hazard ) b0 b1x1 b2 x2 ... bk xk Increase in log (hazard) of outcome per unit increase in x1, adjusted for all other variables in model Poisson Log (rate) b0 b1x1 b2 x2 ... bk xk Increase in log (hazard) of outcome per unit increase in x1, adjusted for all other variables in model Logistic Regression Notes • Popularity of logistic regression results from its predictive ability (values above 1.0 or below 0 are impossible with this model). • Least squares solution for logistic regression does not work. Need maximum likelihood estimates…I.e., computers! • 95% confidence limits for the Odds Ratio e b 1.96 SE (b) Logistic Regression on 7-Year Follow-Up, Washington County ARIC Cohort, Ages 45-64 Years at Baseline (1987-89) Factor (x) b Odds Ratio Intercept -4.5670 - Gender (male=1, 1.3106 3.71 female=0) Smoking (yes=1, no=0) 0.7030 2.02 Age (1 year) 0.1444 1.16 Systolic Blood Pressure 0.5103 1.67 (1 mmHg) Serum Cholesterol (1 0.4916 1.63 mg/dL) Body Mass Index (1 0.1916 1.21 kg/m2) What is the probability (P) that can be predicted from this model for a male smoker less than 55 years old, who is hypertensive, non- hypercholesterolemic and obese? Odds P 1 Odds Logistic Regression on 7-Year Follow-Up, Washington County ARIC Cohort, Ages 45-64 Years at Baseline (1987-89) Factor (x) b Odds Ratio Intercept -4.5670 - Gender (male=1, 1.3106 3.71 female=0) Smoking (yes=1, no=0) 0.7030 2.02 Age (1 year) 0.1444 1.16 Systolic Blood Pressure 0.5103 1.67 (1 mmHg) Serum Cholesterol (1 0.4916 1.63 mg/dL) Body Mass Index (1 0.1916 1.21 kg/m2) What is the probability (P) that can be predicted from this model for a male smoker less than 55 years old, who is hypertensive, non- hypercholesterolemic and obese? Odds 1 P [( 4 .5670 ) ( 1. 3106 1) ( 0 .703 1) ( 0 .1444 0 ) ( 0 .4916 0 ) ( 0 .1916 1) 1 Odds 1 e 0.1357 13.57% e b 1.96 SE (b) Example: b1= 1.1073; SE(b1 )= 0.1707 11073 1.96 0.1707 95% CL e . 2.17, 4.22 When multiplying the OR for increase in more than one unit of a continuous variable, must multiply both the coefficient and the SE by the number of units, to obtain CL’s. E.g., for an increase in 10 units: 10 11073 10 (1.96 0.1707 ) 95% CL e . • Hypothesis testing (H0: b= 0) – Wald statistic: b z value SE (b) • Example: b= 1.2163, SE(b)= 0.1752 1.2163 z 6.94 , p 0.05 0.1752 (Note that the square of this z value is the 2 ) • Assumptions: 1) Linearity in the log(odds) scale If not linear: use dummy variables or quadratic terms 2) No multiplicative interaction E.g., the relative effect of x1 is independent of x2 » or The effects of x1 and x2 are merely multiplicative (i.e., not “more, or less than, multiplicative”) Note: – This is the same assumption needed to calculate ORMH – If there is interaction, product terms can be introduced in the model to account for it Better still: do stratified analysis 3) Observations are independent Analytic Techniques for Assessment of Relationships Between Exposures (x) and Outcomes (y)- I Type of Study Type of outcome (y) Multivariate approach Adjusted measure of association Any Continuous (eg, BP ANOVA In means Linear regression Linear Cross-sectional Diseased/Non-diseased Direct adjustment Prevalence Rate Ratio Indirect adjustm. Stand. Prevalence Ratio Mantel-Haenszel Prevalence Odds Ratio (OR) Logistic regression Prevalence OR Case-control Diseased/Non-diseased Mantel-Haenszel OR Logistic regression OR (Adapted from Szklo & Nieto, Aspen, 2000, p. 338) Analytic Techniques for Assessment of Relationships Between Exposures (x) and Outcomes (y)- II Type of Type of y Multivariate approach Adjusted Measure of Study Association Cohort Cumulative incidence by the end of Direct Adjustment Incidence Proportion follow-up Ratio Indirect Adjustment SIR Mantel-Haenszel Probab. OR Logistic regression Probab. OR Cumulative incidence: time-to- Cox Proportional Hazards Hazard Rate Ratio event data Model Incidence Rate per Person-Time Mantel-Haenszel Rate Ratio Poisson regression Rate Ratio Nested case- Time-dependent disease status Conditional logistic Hazard Rate Ratio control (time to event data taken into regression account by density sampling) Case-cohort Time-dependent disease status Cox model with staggered Hazard Rate Ratio (time to event data) entries EPILOGUE: • Stratification Vs. Adjustment •Advantage of stratification: best way to understand the data, and examine the possibility of interaction. •Disadvantage of stratification: cumbersome if large number of variables. • If you use multiple regression models, do not let the data make a fool of you: Look at the data!! •Check the appropriateness of the model (Is it linear?) •Watch for outliers • Consider the possibility of residual confounding. Causes of Residual Confounding • Variables missing in model • Categories of the variables included in the model are too broad • Confounding variables are misclassified • Construct validity is not the same in groups under comparison Residual Confounding: Relationship Between Natural Menopause and Prevalent CHD, ARIC Study, Ages 45-64 Years, 1987-89 Model Odds Ratio (95% CI) 1 Crude 4.54 (2.67, 7.85) 2 Adjusted for age: 45-54 Vs. 55+ 3.35 (1.60, 6.01) (Mantel-Haenszel) 3 Adjusted for age: 3.04 (1.37, 6.11) 45-49, 50-54, 55-59, 60-64 (Mantel- Haenszel) 4 Adjusted for age: continous 2.47 (1.31, 4.63) (logistic regression) EPILOGUE (Cont.) • Statistical models and adjustment techniques can be used to explore causal pathways (intermediate variables). • Statistical models as “tools for science” rather than “laws of nature”: …Statistical models are sometimes misunderstood… Statistical models are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. (Zeger SL. Statistical reasoning in epidemiology. Am J Epidemiol 1991;134:1062-6)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 19 |

posted: | 8/22/2011 |

language: | English |

pages: | 61 |

OTHER DOCS BY hcj

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.