Regression Spline-Based Flexible Relative Survival Model

Document Sample
Regression Spline-Based Flexible Relative Survival Model Powered By Docstoc
                                    Contents                                                                  1 of 17


                        Regression Spline-Based Flexible Relative Survival Model:
                                  Estimation, Inference and Validation
               Michal Abrahamowicz1; Amel Mahboubi1; Jean Faivre2; Claire Bonithon-Kopp3; Catherine Quantin34
                 Department of Epidemiology and Biostatistics, McGill University, Montreal, Que., Canada H3A 1A1.
                 Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Registre Bourguignon des Cancers Digestifs,
               Dijon, F-21079, France.
                 Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Dijon, F-21079, France.
                 Department of Biostatistics and Medical Informatics. Dijon University Hospital, France.

               Funding: A Mahboubi was supported by a research grant from the Regional Council of Burgundy and INSERM. M.
               Abrahamowicz is a James McGill Professor of Biostatistics at McGill University. This research was supported by the
               NSERC grant # 228203 and the CIHR grant # MOP-81275.

               1.   Introduction
               Relative survival is an increasingly popular approach to modelling censored survival data in applications
               where the individual causes of death remain unknown, such as cohort studies based on cancer registries
               (Dickman, Sloggett et al. 2004). Several relative survival regression models have been proposed to
               discriminate between the effects of covariates on disease-specific mortality from their effects on other-causes
               mortality (Hakulinen and Tenkanen 1987; Esteve, Benhamou et al. 1990; Giorgi, Abrahamowicz et al. 2003;
               Dickman, Sloggett et al. 2004; Stare, Henderson et al. 2005; Remontet, Bossard et al. 2007).
                                    Contents                                                              2 of 17


               The current study is motivated by our belief that, similar to crude survival, an accurate assessment of the
               effects of continuous prognostic factors on relative survival requires simultaneous modelling and testing of
               both time-dependent and non-loglinear effects. In particular, we expand the relative survival model originally
               proposed by Esteve et al (Esteve, Benhamou et al. 1990) in a similar way Abrahamowicz and MacKenzie
               (Abrahamowicz and MacKenzie 2007) recently expanded the Cox’s model for crude survival. In the flexible
               ‘product model’ developed by Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007), the
               effect of a continuous covariate on the logarithm of the hazard is modelled as a product of two ‘marginal’

               functions, which represent, respectively, its time-dependent effect and non-linear relationship with the log-


               2.   Joint Flexible Modelling of Time-Dependent and Non-Loglinear Effects in Relative Survival
               We propose a flexible extension of the parametric relative survival regression model proposed by Esteve et
               al. (Esteve, Benhamou et al. 1990). Their additive-hazards model assumes that the observed hazard for all-
               cause mortality, λt , at time t after diagnosis of an individual with age a at diagnosis and with a covariate
               vector z, is the sum of two hazards:
                                                      λt (t z, a ) = λe (t + a z s ) + λc (t z )                      (1)

                                     Contents                                                                  3 of 17


               The first component, λ e , represents the expected hazard function for all-causes mortality in the underlying
               general population, stratified on a limited subset zs of covariates, typically age and sex, and is obtained from
               relevant mortality statistics using external sources.
               The second component, λ c , is the hazard function for disease-related mortality, which is estimated from the
               data at hand. Disease-specific hazard, conditional on the covariate vector z is expressed as (Esteve,
               Benhamou et al. 1990):                                                p          r
                                                                λc ( t z ) = exp ( ∑ β i z i ) ∑τ k I k (t )               (2)
                                                                       th           i =1       k =1
               where τk is the baseline mortality hazard in the k time-segment for patients with z=0, Ik is an indicator

               function (Ik=1 if tk-1 ≤ t < tk, and 0 otherwise), and βi (i=1,…,p) are the log hazard ratios corresponding to
               each of the p covariates.

               Model (2) carries the assumption of proportionality of hazards (PH), as βi are constant, i.e. independent of
               time t. Moreover, for quantitative predictors, βi is multiplied by an un-transformed predictor value zi, which
               implies log-linearity (LG), i.e. linearity of the predictor effect on the log hazard.

               Our new model respects the assumption of the additivity of the hazards of (i) all-causes and (ii) disease-
               specific mortality, as postulated by equation (1). The difference between our new model and the model
               originally proposed by Esteve et al (Esteve, Benhamou et al. 1990) concerns equation (2). In particular, we
                                     Contents                                                                   4 of 17


               replace the parametric modelling of both baseline disease-specific hazard, and of covariates effects on this
               hazard, by a more flexible approach. Firstly, similar to relative survival models proposed by Giorgi et al
               (Giorgi, Abrahamowicz et al. 2003) and Remontet et al (Remontet, Bossard et al. 2007), to avoid clinically
               implausible ‘jumps’ in the absolute risks, we assume the baseline hazard is a smooth function of the follow-
               up time γ(t) rather than a step function as in (2). Secondly, when modelling the effects of continuous
               covariates on the baseline log hazard, we simultaneously relax both the PH and the log-linearity assumptions.
               Specifically, following the flexible product model proposed for crude survival by Abrahamowicz and

               MacKenzie (Abrahamowicz and MacKenzie 2007), we model the logarithm of the hazard ratio associated

               with a value zi of a (continuous) ith covariate, at time t, as a product of two functions: α i ( z i ) and β i (t ) .
               Accordingly, we write the disease-specific hazard, conditional on the covariate vector z, as:

                                                 λc (t z ) = exp(γ (t )) * exp(∑ α i ( z i ) * β i (t ))                     (3)
               As in (Abrahamowicz and MacKenzie 2007), α i ( z i ) in (3) represents a possibly non-linear dose-response
               function (drf), i.e. describes how the log hazard ratio changes with increasing value of the ith covariate.
               Because in (3), this function does not depend on t, the shape of the drf is assumed to remain unchanged over
               time. However, the impact of these differences in the covariate values does vary over time because α i ( z i ) is
               multiplied by the time-dependent function (tdf) β i (t ) . In other words, higher values of β i (t ) indicate that
               at a corresponding time t the differences in the values of zi have bigger impact on the disease-specific hazard.
                                     Contents                                                                  5 of 17


               Similar to Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007) and Giorgi et al (Giorgi,
               Abrahamowicz et al. 2003), all three functions in (3) are estimated by low-dimension quadratic regression
               splines, i.e. smooth piecewise quadratic polynomials, whose pieces join each other at pre-specified argument
               values termed knots (Ramsay 1988). Using quadratic splines ensures the continuity of the estimated function
               and its first derivative, while limiting the number of the estimated coefficients, which enhances statistical
               power for testing non-linear and/or time-dependent effects (Abrahamowicz, MacKenzie et al. 1996).
               Specifically, following (Abrahamowicz and MacKenzie 2007), we used 1 knot, i.e. a 2-pieces quadratic
               spline to model the non-linear drf α i ( z i ) , and 2 knots, corresponding to a 3-pieces spline for both functions

               of time: the time-dependent function βi(t) and the baseline log hazard γ(t). Finally, knots were positioned so
               as to ensure optimal and equal data support in each of the resulting intervals (Ramsay 1988). This implied a
               single knot at the sample median value of a continuous covariate for α i ( z i ) , and two knots at the terciles of
               the observed distribution of un-censored event times for both β i (t ) and γ(t) (Abrahamowicz, Ciampi et al.
               1992). Accordingly, in our model (3):
                                                          α i ( z i ) = ∑ α il * Bl ,3 ( z i )                            (4a)
                                                                        l =1

                                                           β i (t ) = ∑ β i , j * B j ,3 (t )                             (4b)
                                                                       j5 1
                                                            γ (t ) = ∑ γ k * Bk ,3 (t )                                   (4c)
           T                                                           k =1
                                    Contents                                                                  6 of 17


               where B.,3 are quadratic, i.e. order 3, B-splines (de Boor 1978).

               Estimation of model (3) poses an un-identifiability problem because the two functions of the same covariate
               α i ( z i ) and β i (t ) ‘share the scale’ (Abrahamowicz and MacKenzie 2007). Abrahamowicz and MacKenzie
               (Abrahamowicz and MacKenzie 2007) suggested to circumvent this problem by imposing a ‘scale’
               constraint on α i ( z i ) , so that the range of its estimated values is a priori constrained to 1. This implies a
               constraint that a value of β i (t ) , at a given time t, represents the adjusted difference in log hazards between

               the values of z i corresponding to, respectively, the highest and the lowest risk (Abrahamowicz and

               MacKenzie 2007). However, in the joint estimation procedure employed by Abrahamowicz and MacKenzie,
               implementation of the scale constraint required a complex trial-and-error procedure and caused occasional
               non-convergence problems (Abrahamowicz and MacKenzie 2007).

               To achieve an efficient full maximum likelihood estimation of model (3), we use an iterative alternating
               conditional algorithm. Alternating conditional algorithms are especially useful for estimation of complex
               models, involving different types of coefficients, some of which are subject to specific constraints that do not
               apply to other estimable coefficients (Ramsay and Abrahamowicz 1989; Abrahamowicz and Ramsay 1992).
               The basic idea is to partition the estimation into consecutive steps, each of which involves estimation of only
               a subset of all coefficients of interest. Simulations indicated very good performance of the alternating
                                      Contents                                                                 7 of 17


               conditional estimation of complex regression spline-based models with various constraints (Ramsay and
               Abrahamowicz 1989; Abrahamowicz and Ramsay 1992).

               To estimate model (3), we implement a 3-step alternating conditional full maximum likelihood estimation
               (mle) algorithm. The 3 steps involve flexible estimation of, respectively, (1) baseline log hazard function
               γ(t), defined in (4c); (2) non-linear dose-response functions α i ( z i ) in (4a), for each continuous covariate
               z i ; and (3) time-dependent effect β i (t ) in (4b). At each step, the estimation of respective coefficients is

               conditional on the most recent estimates of all the coefficients of the two other functions. The iterations stop

               after the total reduction in the negative log likelihood (LogL) relative to the previous iteration, i.e. the sum of
               the changes in LogL attained at each of the three steps, is below 0.1.

               3.       Application
               We re-assess the role of prognostic factors for mortality among 3367 patients diagnosed with colon cancer.
               Time 0 was the date of the cancer diagnosis. Time-to-event was defined as the time to death of any cause.
               The cause of death was not recorded. There were 1593 deaths during 5 years of follow-up (52.7% censoring

                                    Contents                                                               8 of 17


               Available prognostic factors included two continuous covariates: age at diagnosis and calendar period of
               diagnosis (1976 to 2000), as well as gender, tumour location (right vs left colon), and cancer stage at
               diagnosis, categorized into 4 stages (I-IV). Period of diagnosis was included among ‘prognostic factors’ to
               investigate if survival of colon cancer patients did change between 1976 and 2000.

               Parametric multivariable regression models, which impose conventional PH and log-linearity assumptions
               included the Cox’s proportional hazards (PH) model (Cox 1972) and the relative survival model of Esteve et
               al (Esteve, Benhamou et al. 1990). In all relative survival analyses, the expected hazard mortality ( λ e ) was

               quantified based on published age-, gender- and year of death-specific mortality rates in the general
               population of Côte d’Or, region in France, in which the 3367 cancer patients were identified. In both models
               older age, male gender, and higher cancer stage at diagnosis were all associated with significantly higher
               mortality. For period, the adjusted HR was significantly lower than 1.0 and suggested that with each
               consecutive year between 1976 and 2000 the risk of death decreased by about 3%.

               In our flexible multivariable ‘product model’ (3) for relative survival, for the two continuous covariates, age
               at diagnosis and calendar period of diagnosis, we modelled both non-linear and time-dependent effects
               (equations (4a) and (4b)). For each of the dummy variables for stages II-IV, we estimated their time-
               dependent effects. In contrast, we a priori assumed that the effects of gender and cancer location were
                                    Contents                                                              9 of 17


               constant over time and represented these effects by a single parameter (adjusted HR). Finally, the baseline
               log hazard γ(t) was modelled by the regression spline expansion defined in (4c).

               Table 1 summarizes the results of non-parametric tests. Each continuous variable shows very significant
               violations of both log-linearity and PH assumption, and the latter is also rejected for the two more advanced
               cancer stages (Table 1). Accordingly, we include all these effects in our final model.

                              Table 1: p-values for the non-parametric LRT tests of non-linear (NL) and
                                 time-dependent (TD) effects in the full multivariable flexible model.

                                       Variable            LRT for Test of NL LRT for Test of TD
                                       Age                         0,0007                 0,0001
                                       Period of diagnosis          0,017                 0,0018
                                       Stage II                                           0,2715
                                       Stage III                                          0,0001
                                       Stage IV                                           0,0001

               Figure 1 shows the adjusted spline estimates of time-dependent HR, relative to stage I, for cancer stages II-
               IV, obtained from the full flexible multivariable model. For advanced stages III and IV, the relative risks
               vary considerably over time, increasing steeply in the first year and then decreasing gradually. Comparison
               between the estimates shown in Figure 1 and the Esteve et al’s model reported in the right part of Table 1
                                    Contents                                                              10 of 17


               indicates that, by imposing the incorrect PH assumption, the latter model substantially under-estimates the
               impact of high cancer stages on disease-specific mortality around 1-2 years after diagnosis.

                                 Figure 1: Time-dependent effects of cancer stages, relative to stage I.
                                     Contents                                                            11 of 17


               Figure 2 describes the adjusted effect of age at diagnosis on the log hazard of colon cancer-specific
               mortality. The spline estimate of the dose-response function (drf) in Figure 2(a) is non-monotone, with the
               lowest risk for patients diagnosed around 50 years, and higher risks for both very young and older subjects,
               which explains why the log-linearity hypothesis was rejected (p=0.0007 in Table 2). However, the significant
               time-dependent effect (tdf) in Figure 2(b), indicates that the impact of age at diagnosis is mostly limited to

               the first 0.5 year after diagnosis. The 3D surface in Figure 2(c) shows how the log hazard (vertical axis)

               changes with increasing age at diagnosis (horizontal axis) and follow-up time (the 3rd axis). Figure 2(d)
               compares the drf’s for age corresponding to selected times after cancer diagnosis, i.e. shows the horizontal
               cross-sections of the 3D surface at selected follow-up times. Whereas all four curves in Figure 2(d) replicate
               the U-shaped drf from panel (a), the curve for 6 months after diagnosis shows much bigger differences in
               risk than the four curves for the later phases of disease.

                 Contents                                                           12 of 17


           T   Figure 2: Flexible modeling of the effects of age at cancer diagnosis.
                                   Contents                                                            13 of 17


               Figure 3 focuses on the adjusted effect of the calendar period of cancer diagnosis. Even if excess mortality
               generally decreased between 1976 and 2000, the drf curves in Figure 3 suggest two distinct calendar periods,
               of improving survival (1976-1985 and 1995-2000).

           T         Figure 3: Non-linear effects of calendar year at diagnosis, at selected times since diagnosis.
                                    Contents                                                             14 of 17


               4.   Discussion
               In the context of ‘crude survival’, Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007),
               incorporate flexible modelling of both non-linear and time-dependent effects of continuous covariates. The
               same paper demonstrates that accounting for both effects is essential to avoid biased estimates and/or
               incorrect conclusions (Abrahamowicz and MacKenzie 2007). Therefore, we have developed a flexible
               relative survival model, which generalizes the additive-hazards relative survival model of Esteve et al
               (Esteve, Benhamou et al. 1990) by implementing the ‘product model’ approach proposed by Abrahamowicz

               and MacKenzie (Abrahamowicz and MacKenzie 2007) to estimate the effects of continuous prognostic

               factors on the hazard of disease-specific mortality. Flexible modelling of a continuous age effect revealed a
               statistically significant non-monotonicity, with a clinically important decrease in disease-specific mortality
               between 30 and 50 years of age at diagnosis (Figure 2). For the period of cancer diagnosis, Figure 3 shows
               that clinically meaningful improvements in disease-specific survival occurred in two distinct periods: 1976-
               1985 and 1995-2000. It is likely that the two periods reflect different changes in medical or surgical

               Additional insights were obtained from flexible estimates of time-dependent effects. Age at diagnosis was
               found to have very important impact on the risk of dying almost immediately after diagnosis but matter little

                                     Contents                                                             15 of 17


               after the first year of follow-up. In contrast, higher cancer stage had little impact on mostly post-surgical
               mortality in the first 6 months after diagnosis.

               Both our relative survival model (3) and the crude survival model in (Abrahamowicz and MacKenzie 2007)
               assume that the two functions of the same covariate α i ( z i ) and β i (t ) are multiplied by each other. This
               implies that the shape of the flexible drf α i ( z i ) does not change over time, even if the strength of the
               covariate impact may change, as reflected by the tdf estimate β i (t ) (Abrahamowicz and MacKenzie 2007).

               Future research should develop efficient tests of the shape-invariance assumption, as well as reliable and

               user-friendly regression diagnostic methods for complex relative survival models. The methods proposed by
               Stare et al (Stare, Pohar et al. 2005) may be useful in this context.

               We add another method to the growing toolbox for flexible modelling of relative survival (Giorgi,
               Abrahamowicz et al. 2003; Lambert, Smith et al. 2005; Remontet, Bossard et al. 2007). Our model (3) can be
               considered an extension of the spline-based time-dependent relative survival model developed by Giorgi et al
               (Giorgi, Abrahamowicz et al. 2003). Limited comparisons between the two models confirmed that, similar to
               crude survival (Abrahamowicz and MacKenzie 2007), accounting for the non-loglinear effect of a
               continuous covariate may substantially change the estimate of its time-dependent effect. Indeed, in our
               dataset the model of Giorgi et al did not detect any significant time-dependent effect of the period of
                                    Contents                                                              16 of 17


               diagnosis (data not shown), in contrast to our results (Table 1). It would be interesting to explore if and how
               inclusion of the possibly non-loglinear effects of continuous covariates affects the fractional polynomials
               estimates of time dependent effects in multiplicative and additive relative survival models proposed by
               Lambert et al (Lambert, Smith et al. 2005), and to compare it with the flexible spline-based relative survival
               model of Remontet et al (Remontet, Bossard et al. 2007). In spite of some differences in the formulation of
               the model, its estimation and inference, all these models enhance our understanding of disease progression,
               and may ultimately improve clinical prognosis and management of various cancers.

               [1]   Abrahamowicz, M., A. Ciampi, et al. (1992). "Nonparametric density estimation for censored survival
                     data: regression-spline approach." The Canadian Journal of Statistics 20(2): 171-185.
               [2]   Abrahamowicz, M., T. MacKenzie, et al. (1996). "Time-dependent hazard ratio: modelling and
                     hypothesis testing with application in lupus nephritis " Journal of the American Statistical Association
                     91(436): 1432-1439.
               [3]   Abrahamowicz, M. and T. A. MacKenzie (2007). "Joint estimation of time-dependent and non-linear
                     effects of continuous covariates on survival." Statistics in Medicine 26(2): 392-408.
               [4]   Abrahamowicz, M. and J. O. Ramsay (1992). "Multicategorical spline model for time response theory."
                     Psychometrika 57(1): 5-27.
               [5]   Cox, D. R. (1972). "Regression models and life tables (with discussion)." Journal of the Royal
                     Statistical Society, Series B 34: 187-220.
           T   [6]   de Boor, C. (1978). A Practical guide to Splines. New York.
                                    Contents                                                              17 of 17

               [7]   Dickman, P. W., A. Sloggett, et al. (2004). "Regression models for relative survival." Statistics in
                     Medicine 23(1): 51-64.
               [8]   Esteve, J., E. Benhamou, et al. (1990). "Relative survival and the estimation of net survival: elements
                     for further discussion." Statistics in Medicine 9(5): 529-38.
               [9]   Giorgi, R., M. Abrahamowicz, et al. (2003). "A relative survival regression model using B-spline
                     functions to model non-proportional hazards." Statistics in Medicine 22(17): 2767-84.
               [10] Hakulinen, T. and L. Tenkanen (1987). "Regression analysis of survival rates." Journal of the Royal
                    Statistical Society, Series C 36: 309-317.
               [11] Lambert, P. C., L. K. Smith, et al. (2005). "Additive and multiplicative covariate regression models for

                    relative survival incorporating fractional polynomials for time-dependent effects." Statistics in

                    Medicine 24(24): 3871-85.
               [12] Ramsay, J. O. (1988). "Monotone regression splines in action (with discussion)." Statistical Science
                    3(4): 425-461.
               [13] Ramsay, J. O. and M. Abrahamowicz (1989). "Binomial regression with monotone splines: A
                    psychometric approach." Journal of the American Statistical Association 84: 906-915.
               [14] Remontet, L., N. Bossard, et al. (2007). "An overall strategy based on regression models to estimate
                    relative survival and model the effects of prognostic factors in cancer survival studies." Statistics in
                    Medicine 26(10): 2214-28.
               [15] Stare, J., R. Henderson, et al. (2005). "An individual measure of relative survival." Journal of the Royal
                    Statistical Society, Series C 54: 115-126.
               [16] Stare, J., M. Pohar, et al. (2005). "Goodness of fit of relative survival models." Stat Med 24(24): 3911-