Docstoc

survival analysis

Document Sample
survival analysis Powered By Docstoc
					SURVIVAL ANALYSIS

Introduction
Survival analysis is concerned with modelling the time taken to a particular event. It
arises in many different areas e.g.
       Medicine: time to death or recovery of patients with a particular illness
       Epidemiology: time to contraction of a disease by individuals at risk
       Economics: time to gaining employment for people out of work
       Industrial: time to failure of an electrical component
       Financial: time to defaulting on a mortgage for a customer

Period of study
Usually data are gathered over a fixed period of time. Within that time individuals will
enter the study at different times and will be observed until either:
       Failure (or equivalent) occurs
       The study is terminated
       The individual is withdrawn from the study for some non-related reason

Covariates
In many studies there will be certain factors that may influence the time to failure. The
purpose of the study is then to determine which factors affect survival times and in
what way.

These are known as covariates, e.g.
      Different treatments
      Intrinsic properties of the individuals, e.g. sex, age
      Extraneous variables such as environmental or economic conditions

Some of these variables may be constant, e.g. the treatment the individual is receiving,
others may be time-dependent e.g. weight, cumulative exposure to a risk factor,
interest rates.

Censoring
There are three important requirements about survival time:

      A clearly defined time origin (t=0)
      A clearly defined time scale, and
      A clear definition of failure

In most cases, only a certain time period is considered, in which not all individuals
will experience failure. This raises the problem of incomplete survival times and leads
to a definition of censoring. For each individual either the time to failure or the time to
censoring is noted. For those observations that are censored because of either the
study terminating or withdrawal, it is known that the time to failure must be in excess
of the censored time.




                                             1
The concept of censoring can be demonstrated by considering an hypothetical trial of
a treatment program for people at high risk of a heart attack where the initiating event
is entry into the trial, and the failure event is a heart attack.



                       5



                       4



                       3



                       2



                       1

                           1986   1987   1988   1989   1990   1991   1992




Patients 1 and 5 entered the trial at different points in calendar time and both
experienced the failure event (i.e. heart attack) before the end of the trial.
Patient 3 was censored due to possible effects of the treatment or was withdrawn
from the trial for some unknown reason, such as relocation or death due to unrelated
factors (e.g. a car accident)
Patients 2 and 4 were censored due to the end of the trial or observation period.

Survival Times

Example 1

You are interested in replacement hip joints and for each patient who you have the
following dates:
      –date of birth
      –date of first joining waiting list for hip replacement
      –date of operation (if it has happened yet)
      –date when the hip replacement failed (if it did)
      –date of death (if patient has died)
      –date when the information was collected or last updated

How long will a replacement hip last?
   When will your zero time be?
   Which patients will have a measured time?
   Which patients will have a censored time? And how will it be calculated?
   What population would you like to have?

How long is the wait for a new hip?
   When will your zero time be?
   Which patients will have a measured time?
   Which patients will have a censored time? And how will it be calculated?
   What sample should you have?



                                                 2
How does the hazard of needing a new hip change with age?
   When will your zero time be?
   Which patients will have a measured time?
   Which patients will have a censored time? And how will it be calculated?
   What sample should you have?

Example 2

You are a car insurance firm interested in how long you retain your customers. You
have:
      –date of starting current or most recent insurance
      –date of last renewal or cancellation (if cancelled)
      –date of death (if customer has died)
      –today’s date
What factors affect customer retention?
   When will your zero time be?
   Which customers will have a measured time?
   Which cutsomers will have a censored time? And how will it be calculated?
   What sample should you have?

Some customers are being offered incentives to stay loyal. How can you evaluate
this?
     When will your zero time be?
     Which customers will have a measured time?
     Which customers will have a censored time? And how will it be calculated?
     What sample should you have?


Distributions Of Survival Time
Consider the distribution of the non-negative random variable T representing survival
time. Usually, T is assumed to be continuous.

The probability density function is

                                             P t  T  t   
                               f  t   lim
                                        0         

This is a measure of the probability of failure at time t.

The cumulative probability function gives the probability of failure at or before time t.

                                                        t
                                  Ft   P(T  t )   f ( t )dt
                                                       o




                                               3
However, since we are more interested in the proportion of survivors, the most
common distribution is known as the survival function where

                              S(t) = P (T  t ) = 1 - F( t )


Finally, in studying survival we are often concerned with the risk of an individual
dying given that they have survived so far. This quantity is known as the hazard
function and is defined as
                                  h (t ) = f (t ) / S (t ) .

The notation (t) is also often used to describe a hazard.

Comparisons between groups of individuals are often best made via the comparison of
their respective hazard functions.

If we had complete survival data for all members of a population then there would be
no difference between a survival time and any other measurement that we might
make, e.g. height or weight. However, unless we have complete survival data (i.e.
everyone dies or has a particular failure event), we cannot construct a histogram to
represent a survival distribution and we cannot calculate the mean survival.

Special methods are therefore needed to describe the data when some observations
may be censored.

Life Tables
We can use LIFE TABLE methods to estimate S(t), the proportion surviving, up to the
longest survival time observed, with censored data.

      Generally, with survival data it is the median rather than the mean that is used
       for the average time – why?
      If we have enough data on survival this may be able to give us a life table
       estimate of the median survival - equivalent to the time for 50% survival.
      SAS PROC LIFETEST can be used to calculate the life table’s estimates of
       survival.

Kaplan Meier Estimator
 A non-parametric estimate of the survival function S(t) can be obtained directly from
the observed survival times. It has the form of a step function, stepping down at each
death. When we have some data that are censored at earlier times, as often is the case,
we need to use a method called the product-limit or Kaplan Meier estimator to
calculate the survival function (Kaplan-Meier, 1958). The estimate takes the following
form:
                                                
                                   S ˆi   1  h ˆi
                                     t             t

The method works by estimating the proportion of those alive at the start of each
interval that survive from the beginning of one period to the next, and then calculating
the overall survival as the product of these probabilities.


                                            4
Estimating the hazard
At time ti we have
     fi the number of individuals experiencing failure at ti and
     ri the number of individuals who have not yet failed or been censored at time ti.

If we assume that no explanatory variables are operating and hence all individuals
independently have a probability h(ti) of dying at time ti. Then the number of failures
given the size of the risk set has a binomial distribution

                               fi | (Ri = ri ) ~ B ( ri , h(ti) )

It can then be shown that:
                                             f
                                         h ˆi  i
                                           t
                                                ri
Example
A * in the following table indicates a censored time.

time      number at          number failing        hazard           1 – hazard   product limit
          risk, ri                fi                 fi/ri                       estimator of
                                                                                 survival function
0               13                  0                   0                 1                1
13              13                  1                0.0769            0.9231          0.9231
18              12                  1                0.0833            0.9167          0.8462
23              11                  1
70*             10                  0
76               9                  1
180              8                  1
195*             7                  0
210              6                  1
632              5                  1
700              4                  1
1296             3                  1
1990*            2                  0
2240*            1                  0

Standard Errors for the Survival Function
There are several different formulae for the standard error of which the most
commonly used is Greenwood’s formula derived from the binomial distribution.

                                                                     fi
                     s.e. (Estimate of S(ti) = pi ) = p i
                                                               ri ri  f i 




                                               5
Plotting the Survival Function

The survival function is usually illustrated graphically as shown below.




Lifetables on SAS
Use PROC LIFETEST on SAS for calculating lifetables. When working from data
files, we need a survival time and a censoring indicator (usually set to 0 if the
person/item has been censored, and 1 otherwise).

data example1;
  input time status;
   datalines;
13    1
18    1
23    1
70    0
76    1
180 1
195 0
210 1
632 1
700 1
1296 1
1990 0
2240 0
;
run;

proc lifetest data= example1 plots=(s);
time time*status(0);
run;



                                           6
Output

The above code gives the survival plot already shown and the following output.

                                                The LIFETEST Procedure

                                      Product-Limit Survival Estimates

                                                                 Survival
                                                                 Standard      Number     Number
                    time        Survival          Failure         Error        Failed      Left

                   0.00          1.0000                 0              0          0        13
                  13.00          0.9231            0.0769         0.0739          1        12
                  18.00          0.8462            0.1538         0.1001          2        11
                  23.00          0.7692            0.2308         0.1169          3        10
                  70.00*              .                 .              .          3         9
                  76.00          0.6838            0.3162         0.1315          4         8
                 180.00          0.5983            0.4017         0.1401          5         7
                 195.00*              .                 .              .          5         6
                 210.00          0.4986            0.5014         0.1480          6         5
                 632.00          0.3989            0.6011         0.1483          7         4
                 700.00          0.2991            0.7009         0.1408          8         3
                1296.00          0.1994            0.8006         0.1242          9         2
                1990.00*              .                 .              .          9         1
                2240.00*              .                 .              .          9         0

                      NOTE: The marked survival times are censored observations.


                                 Summary Statistics for Time Variable time

                                                   Quartile Estimates

                                                   Point        95% Confidence Interval
                                Percent          Estimate         [Lower      Upper)

                                     75           1296.00         210.00           .
                                     50            210.00          76.00       1296.00
                                     25             76.00          18.00        632.00


                                                   Mean       Standard Error

                                                 567.49               165.65

     NOTE: The mean survival time and its standard error were underestimated because the
largest observation was censored and the estimation was restricted to the largest event time


                           Summary of the Number of Censored and Uncensored Values

                                                                             Percent
                                     Total        Failed       Censored     Censored

                                           13             9           4        30.77




                                                          7
Comparing The Survival of Two or More Groups
The following data gives survival times for 12 patients diagnosed with brain tumours.
The patients were randomly assigned receive radiation therapy or radiation plus
chemotherapy. One year after the start of the study the survival times in weeks were:

Group 1: Radiotherapy only
                   10     26           28       30     41     12*

Group 2: Radiotherapy plus chemotherapy
                   24     30    42    15*              40*    42*

Using SAS the following plot was obtained. The upper line is for groups 2. At the
beginning the survival curves are close together suggesting that there is little
difference between the two treatments with respect to survival. After about a year
there is an indication of a difference between the two groups with those in group 2
having a higher chance of survival than those in group 1.




Log rank test
This test can be used to formally ascertain whether or not there are differences
between two groups. This test defines intervals on the time scale corresponding to the
observed survival times. For each time interval the number of observed events for
each group is compared to the number expected if there were no differences between
the groups. A chi-squared statistic is calculated from this information and compared to
the 2 distribution with 1 d.f.

Hypotheses
For the log-rank test the null hypothesis is that the survival functions for the two
groups are the same and the alternative is that, for some value of t, they differ.
Formally:

H0:    S1(t) = S2(t)
H1:    S1(t)  S2(t)




                                            8
Results
The result of carrying out the log rank test for these data on SAS is:
                                                        Pr >
                       Test      Chi-Square DF Chi-Square
                       Log-Rank       2.8823       1    0.0896

This indicates that there is no evidence of a significant difference between the two
groups.

Note:
    The test is based on the assumption that any difference between the two groups
      is not dependent on time. This is known as the proportional hazards
      assumption. The test is not reliable if this assumption is not valid.
    The test can be extended to compare more than 2 groups. The degrees of
      freedom will be the number of groups minus 1.
    For small data sets the test is not very powerful.
    More information on the log rank test is given in the Appendix.




                                            9
Tutorial 1

Note: Text copies of some of the data files are given on the webpage

Solutions to the tutorial can also be found there.

1. The following data gives the survival times, in years, after diagnosis with
   parathyroid cancer. A * indicates that the observation is censored.

   1 1*         1*     2*        2*    3             5*    6*   7    7       8
   9* 10        10     10*       11    14            17*

   Construct a Kaplan Meier survival plot for the data by hand and hence estimate the
   median survival time after diagnosis.

2. The following data set contains survival times for 25 cancer patients. The patients
   were randomly assigned to one of two drug treatments TREAT = 1 or 2. DUR is
   the time in days from treatment until death or censoring. STATUS has a value of 1
   if the patient died and 0 if the observation is censored.

                ID    DUR     STATUS   TREAT

                  1       8       1         1
                  2     180       1         2
                  3     632       1         2
                  4     852       0         1
                  5      52       1         1
                  6    2240       0         2
                  7     220       1         1
                  8      63       1         1
                  9     195       1         2
                 10      76       1         2
                 11      70       1         2
                 12       8       1         1
                 13      13       1         2
                 14    1990       0         2
                 15    1976       0         1
                 16      18       1         2
                 17     700       1         2
                 18    1296       0         1
                 19    1460       0         1
                 20     210       1         2
                 21      63       1         1
                 22    1328       0         1
                 23    1296       1         2
                 24     365       0         1
                 25      23       1         2

   a) Enter the data into SAS.

   b) Obtain an estimate of the survival function, ignoring the different treatments,
   using the following commands. What is the median survival time? What can you
   say about the mean survival time?

             proc lifetest data = filename plots = (s);
             time dur*status(0);
             run;



                                                10
c) To investigate if the different treatments lead to different survival times run the
following program and comment on the results.

      proc lifetest data = filename plots = (s);
      time dur*status(0);
      strata treat;
      run;

3.Twenty volunteers, 10 of normal weight and 10 overweight, underwent an exercise
   stress test. In the test they had to lift a progressively increasing load for up to 12
   minutes but were allowed to stop earlier if they had reached their limit. On two
   occasions the equipment failed before the end of the test. The times for the two
   groups were:

       Normal weight

      4        10      12*     2       8          12*   8*     6       9      12*

       Overweight

          7*   5       11      6       3          9     4      1       12*    7

       Enter the data into SAS and obtain the survival plots and the log-rank test
       using the Lifetest procedure. Hence comment on the effect of being
       overweight on the results of the stress test.

4. The data below give the remission times (in weeks) of two groups of patients with
   leukaemia. The patients were randomly allocated to a new treatment or to a control
   group. An asterix indicates that the observation is censored.

       Drug-6-MP       6*      6       6          6     7      9*      10*    10
                       11*     13      16         17*   19*    20*     22     23
                       25*     32*     32*        34*   35*

       Control         1       1       2          2     3      4       4      5
                       5       8       8          8     8      11      11     12
                       12      15      17         22    23

       Using SAS, obtain a plot of the Kaplan-Meier survival function for the data
       showing separate plots for each treatment. Test for differences between the
       two survival times and comment on the nature of any difference that exists.

5. A longitudinal survey conducted in America included information on times to
   divorce for married couples. The SAS data set marriage which can be found on the
   Medical Statistics web page contains the following variables:

       heduc: education of the husband, coded 0 (< 12 years), 1 (12-15), 2 (>15years)
       heblack: coded 1 if the husband is black, 0 otherwise
       mixed: coded 1 if the marriage is of mixed ethnicity, 0 otherwise


                                             11
   years: duration of the marriage from date of wedding to divorce, widowhood of
   end of study
   divorce: coded 1 for divorce of 0 for censoring

(a) Use proc freq to ascertain the number of marriages which ended in divorce.
(b) Carry out chi-squared tests to investigate whether the number of marriages
    ending in divorce is influenced by any of the three covariates, heduc, heblack,
    mixed.
(c) Obtain survival plots and carry out the log-rank test for each of these variables
    and comment on the nature of any differences found.




                                       12
Appendix 1: Handling dates in SAS

In practice data sets will often be given in the form of dates rather than times. Try the
following to find out some of the facilities for handling dates in SAS.

Example 1
SAS records dates as the number of days from a reference date. The following should
illustrate this.

data test;
      datezero=0;
      dateone=1;
      datek=1000;
proc print data=test;
;
run;

To find out what the reference date is:

proc print data=test;
      format date: ddmmyy8.;
run;

Exanple 2
The following shows some of the ways dates can be entered and ways in which they
can be printed. They will always be stored as the number of days from the reference
date. Note that this allows lengths of times to be calculated e.g. survival times

data new;
input date1 date7. /date2 date9. / date3 ddmmyy8. / date4 ddmmyy12.;
survival=date3-date1;
      datalines;
            18Jan99
            4Apr2003
            18/12/02
            20/05/2001
;
run;
proc print data=new;
run;
proc print data=new;
   format date: date9.;
run;


Example 3
Finally – just in case you can’t remember what day today is!
data day;
      date=today();
      dayoweek=weekday(date);
run;
proc format;
 value wkday 1='Sun' 2 = 'Mon' 3 = 'Tues' 4 = 'Wed' 5 = 'Thurs' 6 =
'Fri' 7 = 'Sat';
proc print data=day;
      format date ddmmyy10.;
   format dayoweek wkday.;
run;



                                           13
Appendix 2: Log Rank Test

The test compares the observed number of deaths at each point in time with the
expected number of deaths if the null hypothesis is true. Thus the expected numbers of
deaths at each time point are calculated by sharing the number of deaths at that point
between the groups in proportion to the numbers at risk in each group. E.g. Using the
data in tutorial question 3 as an example. To calculate the expected number of deaths
at time t=10, assuming that H0 is true:

There are 6 in each group at that time, so we would expect 1x6/12 of the deaths to be
in the first group and 1x6/12 to be in the second group.

Time     Number at risk      Number                  Observerd          Expected number
                             censored                number of deaths   of deaths
         r1i    r2i    ri                            d1i d2i     di     e1i       e2i
                             c1i    c2i    ci
10       6      6     12     0      0     0          1    0      1        ½       ½
12       5      6     11     1      0     1          0    0      0        0        0
15       4      6     10     0      1     1          0    0      0        0        0
24       4      5      9     0      0     0          0    1      1        4/9      4/9
26       4      4      8     0      0     0          1    0      1        ½       ½
28       3      4      7     0      0     0          1    0      1        3/7      4/7
30       2      4      6     0      0     0          1    1      2        2/3      4/3
40       1      3      4     0      1     1          0    0      0        0        0
41       1      2      3     0      0     0          1    0      1       1/3       2/3
42       0      2      2     0      1     1          0    1      1        0        1
                                                     5    3             2.87      5.13

The log-rank statistic is:
        O  E 2 where, for this example O = 5, O = 3, E = 2.87 and E = 5.13
      E                                   1      2      1            2



Thus the log rank statistic is (O1 – E1)2/E1 + (O2 – E2)2/E2 = 2.46. This is compared
with with 1 degree of freedom. Since 2.46 < 3.84 there is no evidence of a
difference in survivor functions at the 5% significance level.



        The value obtained here is slightly different to that given by SAS which
         calculates the statistics in a different way. This doesn’t affect the conclusions.




                                                14

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:10/3/2012
language:English
pages:14