Document Sample

SURVIVAL ANALYSIS Introduction Survival analysis is concerned with modelling the time taken to a particular event. It arises in many different areas e.g. Medicine: time to death or recovery of patients with a particular illness Epidemiology: time to contraction of a disease by individuals at risk Economics: time to gaining employment for people out of work Industrial: time to failure of an electrical component Financial: time to defaulting on a mortgage for a customer Period of study Usually data are gathered over a fixed period of time. Within that time individuals will enter the study at different times and will be observed until either: Failure (or equivalent) occurs The study is terminated The individual is withdrawn from the study for some non-related reason Covariates In many studies there will be certain factors that may influence the time to failure. The purpose of the study is then to determine which factors affect survival times and in what way. These are known as covariates, e.g. Different treatments Intrinsic properties of the individuals, e.g. sex, age Extraneous variables such as environmental or economic conditions Some of these variables may be constant, e.g. the treatment the individual is receiving, others may be time-dependent e.g. weight, cumulative exposure to a risk factor, interest rates. Censoring There are three important requirements about survival time: A clearly defined time origin (t=0) A clearly defined time scale, and A clear definition of failure In most cases, only a certain time period is considered, in which not all individuals will experience failure. This raises the problem of incomplete survival times and leads to a definition of censoring. For each individual either the time to failure or the time to censoring is noted. For those observations that are censored because of either the study terminating or withdrawal, it is known that the time to failure must be in excess of the censored time. 1 The concept of censoring can be demonstrated by considering an hypothetical trial of a treatment program for people at high risk of a heart attack where the initiating event is entry into the trial, and the failure event is a heart attack. 5 4 3 2 1 1986 1987 1988 1989 1990 1991 1992 Patients 1 and 5 entered the trial at different points in calendar time and both experienced the failure event (i.e. heart attack) before the end of the trial. Patient 3 was censored due to possible effects of the treatment or was withdrawn from the trial for some unknown reason, such as relocation or death due to unrelated factors (e.g. a car accident) Patients 2 and 4 were censored due to the end of the trial or observation period. Survival Times Example 1 You are interested in replacement hip joints and for each patient who you have the following dates: –date of birth –date of first joining waiting list for hip replacement –date of operation (if it has happened yet) –date when the hip replacement failed (if it did) –date of death (if patient has died) –date when the information was collected or last updated How long will a replacement hip last? When will your zero time be? Which patients will have a measured time? Which patients will have a censored time? And how will it be calculated? What population would you like to have? How long is the wait for a new hip? When will your zero time be? Which patients will have a measured time? Which patients will have a censored time? And how will it be calculated? What sample should you have? 2 How does the hazard of needing a new hip change with age? When will your zero time be? Which patients will have a measured time? Which patients will have a censored time? And how will it be calculated? What sample should you have? Example 2 You are a car insurance firm interested in how long you retain your customers. You have: –date of starting current or most recent insurance –date of last renewal or cancellation (if cancelled) –date of death (if customer has died) –today’s date What factors affect customer retention? When will your zero time be? Which customers will have a measured time? Which cutsomers will have a censored time? And how will it be calculated? What sample should you have? Some customers are being offered incentives to stay loyal. How can you evaluate this? When will your zero time be? Which customers will have a measured time? Which customers will have a censored time? And how will it be calculated? What sample should you have? Distributions Of Survival Time Consider the distribution of the non-negative random variable T representing survival time. Usually, T is assumed to be continuous. The probability density function is P t T t f t lim 0 This is a measure of the probability of failure at time t. The cumulative probability function gives the probability of failure at or before time t. t Ft P(T t ) f ( t )dt o 3 However, since we are more interested in the proportion of survivors, the most common distribution is known as the survival function where S(t) = P (T t ) = 1 - F( t ) Finally, in studying survival we are often concerned with the risk of an individual dying given that they have survived so far. This quantity is known as the hazard function and is defined as h (t ) = f (t ) / S (t ) . The notation (t) is also often used to describe a hazard. Comparisons between groups of individuals are often best made via the comparison of their respective hazard functions. If we had complete survival data for all members of a population then there would be no difference between a survival time and any other measurement that we might make, e.g. height or weight. However, unless we have complete survival data (i.e. everyone dies or has a particular failure event), we cannot construct a histogram to represent a survival distribution and we cannot calculate the mean survival. Special methods are therefore needed to describe the data when some observations may be censored. Life Tables We can use LIFE TABLE methods to estimate S(t), the proportion surviving, up to the longest survival time observed, with censored data. Generally, with survival data it is the median rather than the mean that is used for the average time – why? If we have enough data on survival this may be able to give us a life table estimate of the median survival - equivalent to the time for 50% survival. SAS PROC LIFETEST can be used to calculate the life table’s estimates of survival. Kaplan Meier Estimator A non-parametric estimate of the survival function S(t) can be obtained directly from the observed survival times. It has the form of a step function, stepping down at each death. When we have some data that are censored at earlier times, as often is the case, we need to use a method called the product-limit or Kaplan Meier estimator to calculate the survival function (Kaplan-Meier, 1958). The estimate takes the following form: S ˆi 1 h ˆi t t The method works by estimating the proportion of those alive at the start of each interval that survive from the beginning of one period to the next, and then calculating the overall survival as the product of these probabilities. 4 Estimating the hazard At time ti we have fi the number of individuals experiencing failure at ti and ri the number of individuals who have not yet failed or been censored at time ti. If we assume that no explanatory variables are operating and hence all individuals independently have a probability h(ti) of dying at time ti. Then the number of failures given the size of the risk set has a binomial distribution fi | (Ri = ri ) ~ B ( ri , h(ti) ) It can then be shown that: f h ˆi i t ri Example A * in the following table indicates a censored time. time number at number failing hazard 1 – hazard product limit risk, ri fi fi/ri estimator of survival function 0 13 0 0 1 1 13 13 1 0.0769 0.9231 0.9231 18 12 1 0.0833 0.9167 0.8462 23 11 1 70* 10 0 76 9 1 180 8 1 195* 7 0 210 6 1 632 5 1 700 4 1 1296 3 1 1990* 2 0 2240* 1 0 Standard Errors for the Survival Function There are several different formulae for the standard error of which the most commonly used is Greenwood’s formula derived from the binomial distribution. fi s.e. (Estimate of S(ti) = pi ) = p i ri ri f i 5 Plotting the Survival Function The survival function is usually illustrated graphically as shown below. Lifetables on SAS Use PROC LIFETEST on SAS for calculating lifetables. When working from data files, we need a survival time and a censoring indicator (usually set to 0 if the person/item has been censored, and 1 otherwise). data example1; input time status; datalines; 13 1 18 1 23 1 70 0 76 1 180 1 195 0 210 1 632 1 700 1 1296 1 1990 0 2240 0 ; run; proc lifetest data= example1 plots=(s); time time*status(0); run; 6 Output The above code gives the survival plot already shown and the following output. The LIFETEST Procedure Product-Limit Survival Estimates Survival Standard Number Number time Survival Failure Error Failed Left 0.00 1.0000 0 0 0 13 13.00 0.9231 0.0769 0.0739 1 12 18.00 0.8462 0.1538 0.1001 2 11 23.00 0.7692 0.2308 0.1169 3 10 70.00* . . . 3 9 76.00 0.6838 0.3162 0.1315 4 8 180.00 0.5983 0.4017 0.1401 5 7 195.00* . . . 5 6 210.00 0.4986 0.5014 0.1480 6 5 632.00 0.3989 0.6011 0.1483 7 4 700.00 0.2991 0.7009 0.1408 8 3 1296.00 0.1994 0.8006 0.1242 9 2 1990.00* . . . 9 1 2240.00* . . . 9 0 NOTE: The marked survival times are censored observations. Summary Statistics for Time Variable time Quartile Estimates Point 95% Confidence Interval Percent Estimate [Lower Upper) 75 1296.00 210.00 . 50 210.00 76.00 1296.00 25 76.00 18.00 632.00 Mean Standard Error 567.49 165.65 NOTE: The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time Summary of the Number of Censored and Uncensored Values Percent Total Failed Censored Censored 13 9 4 30.77 7 Comparing The Survival of Two or More Groups The following data gives survival times for 12 patients diagnosed with brain tumours. The patients were randomly assigned receive radiation therapy or radiation plus chemotherapy. One year after the start of the study the survival times in weeks were: Group 1: Radiotherapy only 10 26 28 30 41 12* Group 2: Radiotherapy plus chemotherapy 24 30 42 15* 40* 42* Using SAS the following plot was obtained. The upper line is for groups 2. At the beginning the survival curves are close together suggesting that there is little difference between the two treatments with respect to survival. After about a year there is an indication of a difference between the two groups with those in group 2 having a higher chance of survival than those in group 1. Log rank test This test can be used to formally ascertain whether or not there are differences between two groups. This test defines intervals on the time scale corresponding to the observed survival times. For each time interval the number of observed events for each group is compared to the number expected if there were no differences between the groups. A chi-squared statistic is calculated from this information and compared to the 2 distribution with 1 d.f. Hypotheses For the log-rank test the null hypothesis is that the survival functions for the two groups are the same and the alternative is that, for some value of t, they differ. Formally: H0: S1(t) = S2(t) H1: S1(t) S2(t) 8 Results The result of carrying out the log rank test for these data on SAS is: Pr > Test Chi-Square DF Chi-Square Log-Rank 2.8823 1 0.0896 This indicates that there is no evidence of a significant difference between the two groups. Note: The test is based on the assumption that any difference between the two groups is not dependent on time. This is known as the proportional hazards assumption. The test is not reliable if this assumption is not valid. The test can be extended to compare more than 2 groups. The degrees of freedom will be the number of groups minus 1. For small data sets the test is not very powerful. More information on the log rank test is given in the Appendix. 9 Tutorial 1 Note: Text copies of some of the data files are given on the webpage Solutions to the tutorial can also be found there. 1. The following data gives the survival times, in years, after diagnosis with parathyroid cancer. A * indicates that the observation is censored. 1 1* 1* 2* 2* 3 5* 6* 7 7 8 9* 10 10 10* 11 14 17* Construct a Kaplan Meier survival plot for the data by hand and hence estimate the median survival time after diagnosis. 2. The following data set contains survival times for 25 cancer patients. The patients were randomly assigned to one of two drug treatments TREAT = 1 or 2. DUR is the time in days from treatment until death or censoring. STATUS has a value of 1 if the patient died and 0 if the observation is censored. ID DUR STATUS TREAT 1 8 1 1 2 180 1 2 3 632 1 2 4 852 0 1 5 52 1 1 6 2240 0 2 7 220 1 1 8 63 1 1 9 195 1 2 10 76 1 2 11 70 1 2 12 8 1 1 13 13 1 2 14 1990 0 2 15 1976 0 1 16 18 1 2 17 700 1 2 18 1296 0 1 19 1460 0 1 20 210 1 2 21 63 1 1 22 1328 0 1 23 1296 1 2 24 365 0 1 25 23 1 2 a) Enter the data into SAS. b) Obtain an estimate of the survival function, ignoring the different treatments, using the following commands. What is the median survival time? What can you say about the mean survival time? proc lifetest data = filename plots = (s); time dur*status(0); run; 10 c) To investigate if the different treatments lead to different survival times run the following program and comment on the results. proc lifetest data = filename plots = (s); time dur*status(0); strata treat; run; 3.Twenty volunteers, 10 of normal weight and 10 overweight, underwent an exercise stress test. In the test they had to lift a progressively increasing load for up to 12 minutes but were allowed to stop earlier if they had reached their limit. On two occasions the equipment failed before the end of the test. The times for the two groups were: Normal weight 4 10 12* 2 8 12* 8* 6 9 12* Overweight 7* 5 11 6 3 9 4 1 12* 7 Enter the data into SAS and obtain the survival plots and the log-rank test using the Lifetest procedure. Hence comment on the effect of being overweight on the results of the stress test. 4. The data below give the remission times (in weeks) of two groups of patients with leukaemia. The patients were randomly allocated to a new treatment or to a control group. An asterix indicates that the observation is censored. Drug-6-MP 6* 6 6 6 7 9* 10* 10 11* 13 16 17* 19* 20* 22 23 25* 32* 32* 34* 35* Control 1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23 Using SAS, obtain a plot of the Kaplan-Meier survival function for the data showing separate plots for each treatment. Test for differences between the two survival times and comment on the nature of any difference that exists. 5. A longitudinal survey conducted in America included information on times to divorce for married couples. The SAS data set marriage which can be found on the Medical Statistics web page contains the following variables: heduc: education of the husband, coded 0 (< 12 years), 1 (12-15), 2 (>15years) heblack: coded 1 if the husband is black, 0 otherwise mixed: coded 1 if the marriage is of mixed ethnicity, 0 otherwise 11 years: duration of the marriage from date of wedding to divorce, widowhood of end of study divorce: coded 1 for divorce of 0 for censoring (a) Use proc freq to ascertain the number of marriages which ended in divorce. (b) Carry out chi-squared tests to investigate whether the number of marriages ending in divorce is influenced by any of the three covariates, heduc, heblack, mixed. (c) Obtain survival plots and carry out the log-rank test for each of these variables and comment on the nature of any differences found. 12 Appendix 1: Handling dates in SAS In practice data sets will often be given in the form of dates rather than times. Try the following to find out some of the facilities for handling dates in SAS. Example 1 SAS records dates as the number of days from a reference date. The following should illustrate this. data test; datezero=0; dateone=1; datek=1000; proc print data=test; ; run; To find out what the reference date is: proc print data=test; format date: ddmmyy8.; run; Exanple 2 The following shows some of the ways dates can be entered and ways in which they can be printed. They will always be stored as the number of days from the reference date. Note that this allows lengths of times to be calculated e.g. survival times data new; input date1 date7. /date2 date9. / date3 ddmmyy8. / date4 ddmmyy12.; survival=date3-date1; datalines; 18Jan99 4Apr2003 18/12/02 20/05/2001 ; run; proc print data=new; run; proc print data=new; format date: date9.; run; Example 3 Finally – just in case you can’t remember what day today is! data day; date=today(); dayoweek=weekday(date); run; proc format; value wkday 1='Sun' 2 = 'Mon' 3 = 'Tues' 4 = 'Wed' 5 = 'Thurs' 6 = 'Fri' 7 = 'Sat'; proc print data=day; format date ddmmyy10.; format dayoweek wkday.; run; 13 Appendix 2: Log Rank Test The test compares the observed number of deaths at each point in time with the expected number of deaths if the null hypothesis is true. Thus the expected numbers of deaths at each time point are calculated by sharing the number of deaths at that point between the groups in proportion to the numbers at risk in each group. E.g. Using the data in tutorial question 3 as an example. To calculate the expected number of deaths at time t=10, assuming that H0 is true: There are 6 in each group at that time, so we would expect 1x6/12 of the deaths to be in the first group and 1x6/12 to be in the second group. Time Number at risk Number Observerd Expected number censored number of deaths of deaths r1i r2i ri d1i d2i di e1i e2i c1i c2i ci 10 6 6 12 0 0 0 1 0 1 ½ ½ 12 5 6 11 1 0 1 0 0 0 0 0 15 4 6 10 0 1 1 0 0 0 0 0 24 4 5 9 0 0 0 0 1 1 4/9 4/9 26 4 4 8 0 0 0 1 0 1 ½ ½ 28 3 4 7 0 0 0 1 0 1 3/7 4/7 30 2 4 6 0 0 0 1 1 2 2/3 4/3 40 1 3 4 0 1 1 0 0 0 0 0 41 1 2 3 0 0 0 1 0 1 1/3 2/3 42 0 2 2 0 1 1 0 1 1 0 1 5 3 2.87 5.13 The log-rank statistic is: O E 2 where, for this example O = 5, O = 3, E = 2.87 and E = 5.13 E 1 2 1 2 Thus the log rank statistic is (O1 – E1)2/E1 + (O2 – E2)2/E2 = 2.46. This is compared with with 1 degree of freedom. Since 2.46 < 3.84 there is no evidence of a difference in survivor functions at the 5% significance level. The value obtained here is slightly different to that given by SAS which calculates the statistics in a different way. This doesn’t affect the conclusions. 14

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 7 |

posted: | 10/3/2012 |

language: | English |

pages: | 14 |

OTHER DOCS BY alicejenny

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.