"Using Survival Analysis for Diffusion Studies"
Using Survival Analysis for Diffusion Studies Wynne W. Chin C.T. Bauer College of Business University of Houston Do Not Mistake This Technique With Survivor Analysis 1 Main problems dealing with longitudinal data such as diffusion 1.Censored Data 2.Time-varying explanatory variables 2 Censoring is the more common problem n The predictor or explanatory variables are usually measured at a specific time point. n But the dependent variable of interest over time may not have yet occurred. Example of problematic data n Study of recidivism of say 500 inmates released from prison. – After 12 months we obtain data as to whether and when any arrests occur. – We also have potential explanatory variables such as age at release, education, race, prior work experience. 3 Problem 1 - analyzing censored data n Nothing special about one year period as an end point – 6 months or 1.5 year may be just as good. n If we create a dichotomous 1 vs. 0 for arrested vs. not – cannot run regression and can lose much info on either side of the 1 year mark. n If we measure length before an arrest – we run into the censor problem for people who have not been arrested after 1 year. Might be ok if the number left over is small. 4 Problem 2 – time varying predictors (let’s assume no censored cases) n What if individuals were interviewed monthly follow- during a 1 year follow-up? n We obtain measures of income, marital status, employment status each month and can see changes over time. n Do we use 12 different income measures in a multiple regression? n Wouldn’t work for the person arrested during the first month – income drops to nil? If in prison, income is a consequence and not a cause of recidivism. What Do We Call This Phenomenon & associated techniques for evaluating? n Event History Analysis – Allison, P.D. (1982). Event History Analysis. Regression for Longitudinal Event Data, Sage monograph number 46. n Survival analysis – Cox, D.R. & Oakes, d. (1984). Survival Analysis. London: Chapman & Hall. 5 Background n Survival or event history techniques formed in part from demographic and medical research interested in analyzing survival data. n If an animal is exposed to different doses of toxic chemicals and observed as to how long the “event” of death occurs. Censoring occurs when the experiment ends before all the animals die. n In a different camp, we have engineers doing “reliability” or “failure time” analysis. Background n In the meantime – us social scientists were unaware and inappropriately running regressions, etc. until the mid 70s when Tuma integrated Markov theory with explanatory continuous- variables into a continuous-time model. 6 Additional Points to Consider n Distributional versus regression methods – early work studies the distribution of time to an event or time in between events (i.e., life tables or Markov processes). Recent work links occurrence of an event as a linear function of explanatory variables. non- events. n Repeated versus non-repeated events. Deaths non- represent single, non-repeatable events. Job changes or marriage can occur many times. Additional Points to Consider n Single vs. multiple kinds of events – easy to treat all events as the same. Yet job terminations are not all the same. IT usage, likewise, can be voluntary or involuntary. In evaluating cancer treatment effectiveness – need to separate death due to cancer versus deaths from other causes. non-parametric. Bio- n Parametric versus non-parametric. Bio- non- statisticians favor non-parametric. Engineers and social scientrists assume that time until an event come from specific distributional families Gompertz). (e.g., Weibull or Gompertz). 7 Additional Points to Consider n non- Parametric versus non-parametric (continued) – Cox (1972) provide a bridge between these two approach via the proportional hazards model – described as semi or partially parametric. The regression model follows a specific form, but the distributional form of event times does not. n time. Discrete versus continuous time. Are time of event assumed to be measured exactly – based on continuous methods? But if event occurrence is measured in larger time units like months or years – consider it discrete. Continuous time methods predominate among all disciplines (i.e., sociology, biostats) engineering, biostats) Discrete Time Example n Single – unrepeated event n 200 newly minted male professors who begin their careers as assistant professor in graduate university departments. n Observe them every year for 5 years. n Event of interest – switching jobs. n Although actually a repeatable event – will treat leaving the first job as different from that of later jobs. 8 Discrete Time Example Year # changing # at risk Estimated jobs hazard rate 1 11 200 .055 2 25 189 .132 3 10 164 .061 4 13 154 .084 5 12 141 .085 >5 129 Total 200 848 (From Allison, 1984) Discrete Time Example n Events are discrete time since we only know the year a job switch occurred. n Don’t know if voluntary or involuntary – thus event. treat as a single kind of event. n We see 129 professors did not change jobs – hence censored data. n Objective: Estimate regression model to determine probability of job change in a one year period based on 5 independent variables. 9 Discrete Time Example n 2 independent variables assumed constant – Prestige of the department – Fed funds allocated to the institution for research n 3 variables measured each year – Cumulative # of published articles – # of citations made by other researchers – Academic rank (0 = assistant, 1 = associate) Discrete Time Example n Two key concepts – risk set and hazard rate. n Risk set is the set of individuals at risk of event occurrence at each point in time (i.e., 200 professors during the first year). n Hazard rate (or simply hazard) – probability that an event will occur at a particular time to a particular individual (assuming at risk at the time). n Notion of hazard rate is latent – but viewed as controlling both occurrence and timing of events. 10 Discrete Time Example n Assume hazard rate can vary over time. But is the same for all individuals at each time period. n So – at year 2, we have 25 jumpers out of 189. The estimated hazard is 25/189 = 0.132 n If we look back at the previous table – we see that the hazard doesn’t seem to change with time. Of course, it is possible that the total number of jumps can decline over time, yet the hazard can increase since the risk set is also decreasing. Discrete Time Example Year # changing # at risk Estimated jobs hazard rate 1 11 200 .055 2 25 189 .132 3 10 164 .061 4 13 154 .084 5 12 141 .085 >5 129 Total 200 848 (From Allison, 1984) 11 Discrete Time Example n For simplicity, let’s assume 2 explanatory time- variables – one constant, one time-varying. n Start off with P(t) = hazard or probability an individual has an event. n P(t) = a + b1X1 + b2X2(t) for t = 1, 2, …, 5 n Now do a logit transformation to eliminate probabilities greater than one or less than zero. n log [ P(t)/(1-P(t)] = a + b1X1 + b2X2(t) P(t)/(1- n Now the left side varies from minus to positive infinity and b1 and b2 represent how one unit log- change in X1 or X2 affects the log-odds. Discrete Time Example n The formula still only has X2 the only time varying variable. n What about the possibility that the hazard rate changes automatically over time? n Maybe individuals beomce more invested in a job over time – the associated costs of moving increases (inertia argument). n You can allow for this with a time varying intercept. P(t)/(1- n log [ P(t)/(1-P(t)] = a(t) + b1X1 + b2X2(t) 12 Discrete Time Example n person- Estimating requires creating person-time data units. person- n Each individual contributes potentially 5 person- years. n For an individual who changed jobs at year 3, s/he person- contributes 3 person-years worth of cases. n For each person year, the person is coded 1 if changed jobs, 0 otherwise. n The explanatory variables are assigned the values person- they took on in each person-year. person- n In our example, we pool 848 person-years into a single sample – run ML logit analysis. Discrete Time Example n Two problems are solved via this procedure. n Individuals whose time to first job change is censored contribute exactly what is known about them (i.e., they didn’t change jobs in any of the five years of observation). n Time varying explanatory variables are easily included because each year at risk is included as a distinct observation. 13 Discrete Time Example Model 1 Model 2 Explanatory b t b t variable Prestige of dept. 0.045 -0.21 0.56 0.26 Funding -0.077 -2.45* -0.078 -2.47* Pubs -0.021 -0.75 -0.023 -0.79 Cites 0.0072 2.44* 0.0069 2.33* Rank -1.4 -2.86** -1.6 -3.12** Yr1 (D) -0.96 -2.11* Yr2 (D) -0.025 -0.06 Yr3(D) -0.74 -1.60 Yr4(D) -0.18 -.42 constant 4.95 -226.25 Log-likelihood -230.95 -226.25 Discrete Time Example n Model 2 allows for the hazard rate to change during each of the 5 years. n Done by 4 dummy variables. n Interpreted relative to the log odds of the 5th year. n No clear pattern found. n Check by examining twice the difference in the log- log-likelihoods – which is 9.4 and 4 degrees of chi- freedom. Using chi-square table, just below the 0.05 level of significance. 14 Problems with Discrete Approach n If large sample with more discrete units of time – can become unwieldy. n In our example, switching to person months would yield a sample of almost 10,000 cases. log- n Work around by using log-linear if all explanatory variables are categorical (computation based on # of cells in the contingency table). logit. n Use OLS instead of ML logit. Results from Discrete Approach n The discrete time approach described here will virtually always give results that are similar to continuous- continuous-time methods. n As the time units get smaller, the model and associated equation converges to the proportional hazard model discussed next. n Choice depends on computational costs and convenience. n Choose continuous if no time varying explanatory variables since it doesn’t require the observation period for each person be divided into distinct units. – Otherwise relative costs and convenience are comparable. 15 Proportional Hazard Models n Hazard h(t) = lim s→0 P(t,t+s)/s n This is not really a probability since it has no upper bound. It is the instantaneous rate. n Expected length of time until an event occurs 1/h(t). So h(t) of 1.25 implies an event will likely occur in 0.80 time units. n Think of hazard in terms of two people. If the first is 0.5 and the second is 1.5, the second person’s risk of the event occurring is 3 times more likely. Proportional Hazard Models n We almost always view the hazard rate as a function of time (e.g., time of last event, or age of the person). n Hazard for arrests decrease after age 25. Hazard for retirement increases with time. n Hazard can also be U shaped. Death is high right after birth – early years and begins to go up during late middle age. n Thus, the hazard rate function chosen is one of the key differences for continuous time data. 16 Parametric Proportional Hazard Models n Need to specify h(t) based on time and explanatory variables. n One approach is a linear function – but use log to ensure h(t) cannot be less than zero. n log h(t) = a + b1X1 + b2X2 n This is the exponential function – hazard is constant over time. n But we may want the hazard to increase or decrease linearly with time to conform to events like job switching (decreases due to job investment) or death (increases due to aging). Parametric Proportional Hazard Models n log h(t) = a + b1X1 + b2X2 + ct n Called the Gompertz regression model since it results in a Gompertz distribution for the time until event occurrence. Note c can be positive or negative n If we model the hazard as changing linearly with the log of time – get the Weibull distribution for time til next event. n log h(t) = a + b1X1 + b2X2 + c log t n with c constrained to be > -1. 17 Parametric Proportional Hazard Models n All three differ in how time enters the model. n Weibull and Gompertz require different estimation procedures. Both do not allow for U or inverted U shape hazards. n Also, none allow for a random disturbance term. n There is randomness in terms of the latent h(t) and observed length of the time interval. n Choice of models depends on substantive knowledge, theory, mathematical convenience, and empirical evidence. Proportional Hazard Models n Up to now – we needed to determine how the hazard rate depends on time. non- n Difficulty if the hazard is non-monotonic. n Previous models do not allow for explanatory variables that change over time. n David Cox solved the problem in 1972. n Referred to as simply “proportional hazards model”, Cox provided a simple generalization of the previous parametric models. 18 Proportional Hazard Models n time- Let’s start off without time-varying explanatory variables. time- n Use two time-constant variables n log h(t) = a(t) + b1X1 + b2X2 with a(t) being any function of time. n Because this function need not be specified, it is semi- considered partially parametric or semi- parametric. n It is considered a proportional hazards because for any two individuals at any point in time, the constant. ratio of their hazards is a constant. Proportional Hazard Models n In other words, for person j and k at any time t hj(t)/hk(t) = c (t)/h n Estimation done by a method called partial likelihood. n Separates model into two parts. n The first contains info only about b1 and b2. n The other contains info on b1, b2, and a(t) n Discard the second part and treat the first as an ordinary likelihood function. Now the first part depends on the order in which events occur, not on the exact times of occurrence. 19 Proportional Hazard Models n Estimates are asymptotically unbiased and normally distributed. n A model with 2 explanatory variables (1 constant and 1 time varying) would look as follows: n log h(t) = a(t) + b1X1 + b2X2 (t) with a(t) being any function of time. n You can also lag it if you believe there is a delay of say 2 months. n log h(t) = a(t) + b1X1 + b2X2 (t - 2) Proportional Hazard Models n The estimation uses the same partial likelihood approach, but the algorithm for maximizing the likelihood function are more complex. n CPU time increases by a factor of 10 for time- including one time-varying explanatory variable. 20 Example - Recidivism Financial aid (D) n n Age at release n Black (D) n Work experience (D) n Married (D) n Paroled (D) n Prior arrests n Age at earliest arrest n Education n Weeks worked n Worked (D) – number of weeks employed during first 3 months Proportional Hazard Models Exponential Proportional Time- Time-dependent Hazard Proportional Hazard Explanatory variables b t b t b t Financial aid (D) -0.325 -1.69 -0.337 -1.76 -0.333 -1.74 Age at release -0.067 -2.89** -0.069 -2.94** -0.064 -2.78** Black (D) 0.280 0.90 0.286 0.92 0.354 1.13 Work experience (D) -0.117 -0.53 -0.122 -0.55 -0.012 -0.06 Married (D) -0.414 -1.08 -0.426 -1.11 -0.334 -0.87 Paroled (D) -0.037 -0.19 -0.035 -0.18 -0.075 -0.38 Prior arrests 0.095 3.21** 0.101 3.36** 0.100 3.31** Age at earliest arrest 0.080 2.3* 0.071 2.35* 0.077 2.48* Education -0.263 -1.96* -0.264 -1.96* -0.293 -2.12* Weeks worked -0.039 -1.76 -0.039 -1.78 Worked (D) -1.392 -5.65** constant -3.869 21 Interpretation of coefficients n -0.067 for age at release means that each additional year of life reduces the log of the hazard by 0.067 controlling for other variables. n Or raise exp (i.e., 2.718) to the b power. n Now for each unit increase in explanatory variable, the hazard is multiplied by that exponentiated number. n Or computer 100 *[exp(b) –1] gives the percentage change in the hazard with each one unit change in the explanatory variable. n So 0.095 for prior arrest leads to exp(0.095) = 1.10 or 10 percent increase for each additional prior arrest. Difference For Time-dependent Proportional Hazards Model n Note the big difference in dropping weeks employed during the first three months after release. time- n It is likely a surrogate for a time-varying explanatory variable. n Replaced with a dummy variable indicating employment status at each time period. While results are the same for the other variables, we now see employment status as the most Exponentiating– important. Exponentiating–1.397 yields 0.25. 22 The End? 23