Contents 1 of 17
Regression Spline-Based Flexible Relative Survival Model:
Estimation, Inference and Validation
Michal Abrahamowicz1; Amel Mahboubi1; Jean Faivre2; Claire Bonithon-Kopp3; Catherine Quantin34
Department of Epidemiology and Biostatistics, McGill University, Montreal, Que., Canada H3A 1A1.
Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Registre Bourguignon des Cancers Digestifs,
Dijon, F-21079, France.
Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Dijon, F-21079, France.
Department of Biostatistics and Medical Informatics. Dijon University Hospital, France.
E-mail: email@example.com; firstname.lastname@example.org; email@example.com
Funding: A Mahboubi was supported by a research grant from the Regional Council of Burgundy and INSERM. M.
Abrahamowicz is a James McGill Professor of Biostatistics at McGill University. This research was supported by the
NSERC grant # 228203 and the CIHR grant # MOP-81275.
Relative survival is an increasingly popular approach to modelling censored survival data in applications
where the individual causes of death remain unknown, such as cohort studies based on cancer registries
(Dickman, Sloggett et al. 2004). Several relative survival regression models have been proposed to
discriminate between the effects of covariates on disease-specific mortality from their effects on other-causes
mortality (Hakulinen and Tenkanen 1987; Esteve, Benhamou et al. 1990; Giorgi, Abrahamowicz et al. 2003;
Dickman, Sloggett et al. 2004; Stare, Henderson et al. 2005; Remontet, Bossard et al. 2007).
Contents 2 of 17
The current study is motivated by our belief that, similar to crude survival, an accurate assessment of the
effects of continuous prognostic factors on relative survival requires simultaneous modelling and testing of
both time-dependent and non-loglinear effects. In particular, we expand the relative survival model originally
proposed by Esteve et al (Esteve, Benhamou et al. 1990) in a similar way Abrahamowicz and MacKenzie
(Abrahamowicz and MacKenzie 2007) recently expanded the Cox’s model for crude survival. In the flexible
‘product model’ developed by Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007), the
effect of a continuous covariate on the logarithm of the hazard is modelled as a product of two ‘marginal’
functions, which represent, respectively, its time-dependent effect and non-linear relationship with the log-
2. Joint Flexible Modelling of Time-Dependent and Non-Loglinear Effects in Relative Survival
We propose a flexible extension of the parametric relative survival regression model proposed by Esteve et
al. (Esteve, Benhamou et al. 1990). Their additive-hazards model assumes that the observed hazard for all-
cause mortality, λt , at time t after diagnosis of an individual with age a at diagnosis and with a covariate
vector z, is the sum of two hazards:
λt (t z, a ) = λe (t + a z s ) + λc (t z ) (1)
Contents 3 of 17
The first component, λ e , represents the expected hazard function for all-causes mortality in the underlying
general population, stratified on a limited subset zs of covariates, typically age and sex, and is obtained from
relevant mortality statistics using external sources.
The second component, λ c , is the hazard function for disease-related mortality, which is estimated from the
data at hand. Disease-specific hazard, conditional on the covariate vector z is expressed as (Esteve,
Benhamou et al. 1990): p r
λc ( t z ) = exp ( ∑ β i z i ) ∑τ k I k (t ) (2)
th i =1 k =1
where τk is the baseline mortality hazard in the k time-segment for patients with z=0, Ik is an indicator
function (Ik=1 if tk-1 ≤ t < tk, and 0 otherwise), and βi (i=1,…,p) are the log hazard ratios corresponding to
each of the p covariates.
Model (2) carries the assumption of proportionality of hazards (PH), as βi are constant, i.e. independent of
time t. Moreover, for quantitative predictors, βi is multiplied by an un-transformed predictor value zi, which
implies log-linearity (LG), i.e. linearity of the predictor effect on the log hazard.
Our new model respects the assumption of the additivity of the hazards of (i) all-causes and (ii) disease-
specific mortality, as postulated by equation (1). The difference between our new model and the model
originally proposed by Esteve et al (Esteve, Benhamou et al. 1990) concerns equation (2). In particular, we
Contents 4 of 17
replace the parametric modelling of both baseline disease-specific hazard, and of covariates effects on this
hazard, by a more flexible approach. Firstly, similar to relative survival models proposed by Giorgi et al
(Giorgi, Abrahamowicz et al. 2003) and Remontet et al (Remontet, Bossard et al. 2007), to avoid clinically
implausible ‘jumps’ in the absolute risks, we assume the baseline hazard is a smooth function of the follow-
up time γ(t) rather than a step function as in (2). Secondly, when modelling the effects of continuous
covariates on the baseline log hazard, we simultaneously relax both the PH and the log-linearity assumptions.
Specifically, following the flexible product model proposed for crude survival by Abrahamowicz and
MacKenzie (Abrahamowicz and MacKenzie 2007), we model the logarithm of the hazard ratio associated
with a value zi of a (continuous) ith covariate, at time t, as a product of two functions: α i ( z i ) and β i (t ) .
Accordingly, we write the disease-specific hazard, conditional on the covariate vector z, as:
λc (t z ) = exp(γ (t )) * exp(∑ α i ( z i ) * β i (t )) (3)
As in (Abrahamowicz and MacKenzie 2007), α i ( z i ) in (3) represents a possibly non-linear dose-response
function (drf), i.e. describes how the log hazard ratio changes with increasing value of the ith covariate.
Because in (3), this function does not depend on t, the shape of the drf is assumed to remain unchanged over
time. However, the impact of these differences in the covariate values does vary over time because α i ( z i ) is
multiplied by the time-dependent function (tdf) β i (t ) . In other words, higher values of β i (t ) indicate that
at a corresponding time t the differences in the values of zi have bigger impact on the disease-specific hazard.
Contents 5 of 17
Similar to Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007) and Giorgi et al (Giorgi,
Abrahamowicz et al. 2003), all three functions in (3) are estimated by low-dimension quadratic regression
splines, i.e. smooth piecewise quadratic polynomials, whose pieces join each other at pre-specified argument
values termed knots (Ramsay 1988). Using quadratic splines ensures the continuity of the estimated function
and its first derivative, while limiting the number of the estimated coefficients, which enhances statistical
power for testing non-linear and/or time-dependent effects (Abrahamowicz, MacKenzie et al. 1996).
Specifically, following (Abrahamowicz and MacKenzie 2007), we used 1 knot, i.e. a 2-pieces quadratic
spline to model the non-linear drf α i ( z i ) , and 2 knots, corresponding to a 3-pieces spline for both functions
of time: the time-dependent function βi(t) and the baseline log hazard γ(t). Finally, knots were positioned so
as to ensure optimal and equal data support in each of the resulting intervals (Ramsay 1988). This implied a
single knot at the sample median value of a continuous covariate for α i ( z i ) , and two knots at the terciles of
the observed distribution of un-censored event times for both β i (t ) and γ(t) (Abrahamowicz, Ciampi et al.
1992). Accordingly, in our model (3):
α i ( z i ) = ∑ α il * Bl ,3 ( z i ) (4a)
β i (t ) = ∑ β i , j * B j ,3 (t ) (4b)
γ (t ) = ∑ γ k * Bk ,3 (t ) (4c)
T k =1
Contents 6 of 17
where B.,3 are quadratic, i.e. order 3, B-splines (de Boor 1978).
Estimation of model (3) poses an un-identifiability problem because the two functions of the same covariate
α i ( z i ) and β i (t ) ‘share the scale’ (Abrahamowicz and MacKenzie 2007). Abrahamowicz and MacKenzie
(Abrahamowicz and MacKenzie 2007) suggested to circumvent this problem by imposing a ‘scale’
constraint on α i ( z i ) , so that the range of its estimated values is a priori constrained to 1. This implies a
constraint that a value of β i (t ) , at a given time t, represents the adjusted difference in log hazards between
the values of z i corresponding to, respectively, the highest and the lowest risk (Abrahamowicz and
MacKenzie 2007). However, in the joint estimation procedure employed by Abrahamowicz and MacKenzie,
implementation of the scale constraint required a complex trial-and-error procedure and caused occasional
non-convergence problems (Abrahamowicz and MacKenzie 2007).
To achieve an efficient full maximum likelihood estimation of model (3), we use an iterative alternating
conditional algorithm. Alternating conditional algorithms are especially useful for estimation of complex
models, involving different types of coefficients, some of which are subject to specific constraints that do not
apply to other estimable coefficients (Ramsay and Abrahamowicz 1989; Abrahamowicz and Ramsay 1992).
The basic idea is to partition the estimation into consecutive steps, each of which involves estimation of only
a subset of all coefficients of interest. Simulations indicated very good performance of the alternating
Contents 7 of 17
conditional estimation of complex regression spline-based models with various constraints (Ramsay and
Abrahamowicz 1989; Abrahamowicz and Ramsay 1992).
To estimate model (3), we implement a 3-step alternating conditional full maximum likelihood estimation
(mle) algorithm. The 3 steps involve flexible estimation of, respectively, (1) baseline log hazard function
γ(t), defined in (4c); (2) non-linear dose-response functions α i ( z i ) in (4a), for each continuous covariate
z i ; and (3) time-dependent effect β i (t ) in (4b). At each step, the estimation of respective coefficients is
conditional on the most recent estimates of all the coefficients of the two other functions. The iterations stop
after the total reduction in the negative log likelihood (LogL) relative to the previous iteration, i.e. the sum of
the changes in LogL attained at each of the three steps, is below 0.1.
We re-assess the role of prognostic factors for mortality among 3367 patients diagnosed with colon cancer.
Time 0 was the date of the cancer diagnosis. Time-to-event was defined as the time to death of any cause.
The cause of death was not recorded. There were 1593 deaths during 5 years of follow-up (52.7% censoring
Contents 8 of 17
Available prognostic factors included two continuous covariates: age at diagnosis and calendar period of
diagnosis (1976 to 2000), as well as gender, tumour location (right vs left colon), and cancer stage at
diagnosis, categorized into 4 stages (I-IV). Period of diagnosis was included among ‘prognostic factors’ to
investigate if survival of colon cancer patients did change between 1976 and 2000.
Parametric multivariable regression models, which impose conventional PH and log-linearity assumptions
included the Cox’s proportional hazards (PH) model (Cox 1972) and the relative survival model of Esteve et
al (Esteve, Benhamou et al. 1990). In all relative survival analyses, the expected hazard mortality ( λ e ) was
quantified based on published age-, gender- and year of death-specific mortality rates in the general
population of Côte d’Or, region in France, in which the 3367 cancer patients were identified. In both models
older age, male gender, and higher cancer stage at diagnosis were all associated with significantly higher
mortality. For period, the adjusted HR was significantly lower than 1.0 and suggested that with each
consecutive year between 1976 and 2000 the risk of death decreased by about 3%.
In our flexible multivariable ‘product model’ (3) for relative survival, for the two continuous covariates, age
at diagnosis and calendar period of diagnosis, we modelled both non-linear and time-dependent effects
(equations (4a) and (4b)). For each of the dummy variables for stages II-IV, we estimated their time-
dependent effects. In contrast, we a priori assumed that the effects of gender and cancer location were
Contents 9 of 17
constant over time and represented these effects by a single parameter (adjusted HR). Finally, the baseline
log hazard γ(t) was modelled by the regression spline expansion defined in (4c).
Table 1 summarizes the results of non-parametric tests. Each continuous variable shows very significant
violations of both log-linearity and PH assumption, and the latter is also rejected for the two more advanced
cancer stages (Table 1). Accordingly, we include all these effects in our final model.
Table 1: p-values for the non-parametric LRT tests of non-linear (NL) and
time-dependent (TD) effects in the full multivariable flexible model.
Variable LRT for Test of NL LRT for Test of TD
Age 0,0007 0,0001
Period of diagnosis 0,017 0,0018
Stage II 0,2715
Stage III 0,0001
Stage IV 0,0001
Figure 1 shows the adjusted spline estimates of time-dependent HR, relative to stage I, for cancer stages II-
IV, obtained from the full flexible multivariable model. For advanced stages III and IV, the relative risks
vary considerably over time, increasing steeply in the first year and then decreasing gradually. Comparison
between the estimates shown in Figure 1 and the Esteve et al’s model reported in the right part of Table 1
Contents 10 of 17
indicates that, by imposing the incorrect PH assumption, the latter model substantially under-estimates the
impact of high cancer stages on disease-specific mortality around 1-2 years after diagnosis.
Figure 1: Time-dependent effects of cancer stages, relative to stage I.
Contents 11 of 17
Figure 2 describes the adjusted effect of age at diagnosis on the log hazard of colon cancer-specific
mortality. The spline estimate of the dose-response function (drf) in Figure 2(a) is non-monotone, with the
lowest risk for patients diagnosed around 50 years, and higher risks for both very young and older subjects,
which explains why the log-linearity hypothesis was rejected (p=0.0007 in Table 2). However, the significant
time-dependent effect (tdf) in Figure 2(b), indicates that the impact of age at diagnosis is mostly limited to
the first 0.5 year after diagnosis. The 3D surface in Figure 2(c) shows how the log hazard (vertical axis)
changes with increasing age at diagnosis (horizontal axis) and follow-up time (the 3rd axis). Figure 2(d)
compares the drf’s for age corresponding to selected times after cancer diagnosis, i.e. shows the horizontal
cross-sections of the 3D surface at selected follow-up times. Whereas all four curves in Figure 2(d) replicate
the U-shaped drf from panel (a), the curve for 6 months after diagnosis shows much bigger differences in
risk than the four curves for the later phases of disease.
Contents 12 of 17
T Figure 2: Flexible modeling of the effects of age at cancer diagnosis.
Contents 13 of 17
Figure 3 focuses on the adjusted effect of the calendar period of cancer diagnosis. Even if excess mortality
generally decreased between 1976 and 2000, the drf curves in Figure 3 suggest two distinct calendar periods,
of improving survival (1976-1985 and 1995-2000).
T Figure 3: Non-linear effects of calendar year at diagnosis, at selected times since diagnosis.
Contents 14 of 17
In the context of ‘crude survival’, Abrahamowicz and MacKenzie (Abrahamowicz and MacKenzie 2007),
incorporate flexible modelling of both non-linear and time-dependent effects of continuous covariates. The
same paper demonstrates that accounting for both effects is essential to avoid biased estimates and/or
incorrect conclusions (Abrahamowicz and MacKenzie 2007). Therefore, we have developed a flexible
relative survival model, which generalizes the additive-hazards relative survival model of Esteve et al
(Esteve, Benhamou et al. 1990) by implementing the ‘product model’ approach proposed by Abrahamowicz
and MacKenzie (Abrahamowicz and MacKenzie 2007) to estimate the effects of continuous prognostic
factors on the hazard of disease-specific mortality. Flexible modelling of a continuous age effect revealed a
statistically significant non-monotonicity, with a clinically important decrease in disease-specific mortality
between 30 and 50 years of age at diagnosis (Figure 2). For the period of cancer diagnosis, Figure 3 shows
that clinically meaningful improvements in disease-specific survival occurred in two distinct periods: 1976-
1985 and 1995-2000. It is likely that the two periods reflect different changes in medical or surgical
Additional insights were obtained from flexible estimates of time-dependent effects. Age at diagnosis was
found to have very important impact on the risk of dying almost immediately after diagnosis but matter little
Contents 15 of 17
after the first year of follow-up. In contrast, higher cancer stage had little impact on mostly post-surgical
mortality in the first 6 months after diagnosis.
Both our relative survival model (3) and the crude survival model in (Abrahamowicz and MacKenzie 2007)
assume that the two functions of the same covariate α i ( z i ) and β i (t ) are multiplied by each other. This
implies that the shape of the flexible drf α i ( z i ) does not change over time, even if the strength of the
covariate impact may change, as reflected by the tdf estimate β i (t ) (Abrahamowicz and MacKenzie 2007).
Future research should develop efficient tests of the shape-invariance assumption, as well as reliable and
user-friendly regression diagnostic methods for complex relative survival models. The methods proposed by
Stare et al (Stare, Pohar et al. 2005) may be useful in this context.
We add another method to the growing toolbox for flexible modelling of relative survival (Giorgi,
Abrahamowicz et al. 2003; Lambert, Smith et al. 2005; Remontet, Bossard et al. 2007). Our model (3) can be
considered an extension of the spline-based time-dependent relative survival model developed by Giorgi et al
(Giorgi, Abrahamowicz et al. 2003). Limited comparisons between the two models confirmed that, similar to
crude survival (Abrahamowicz and MacKenzie 2007), accounting for the non-loglinear effect of a
continuous covariate may substantially change the estimate of its time-dependent effect. Indeed, in our
dataset the model of Giorgi et al did not detect any significant time-dependent effect of the period of
Contents 16 of 17
diagnosis (data not shown), in contrast to our results (Table 1). It would be interesting to explore if and how
inclusion of the possibly non-loglinear effects of continuous covariates affects the fractional polynomials
estimates of time dependent effects in multiplicative and additive relative survival models proposed by
Lambert et al (Lambert, Smith et al. 2005), and to compare it with the flexible spline-based relative survival
model of Remontet et al (Remontet, Bossard et al. 2007). In spite of some differences in the formulation of
the model, its estimation and inference, all these models enhance our understanding of disease progression,
and may ultimately improve clinical prognosis and management of various cancers.
 Abrahamowicz, M., A. Ciampi, et al. (1992). "Nonparametric density estimation for censored survival
data: regression-spline approach." The Canadian Journal of Statistics 20(2): 171-185.
 Abrahamowicz, M., T. MacKenzie, et al. (1996). "Time-dependent hazard ratio: modelling and
hypothesis testing with application in lupus nephritis " Journal of the American Statistical Association
 Abrahamowicz, M. and T. A. MacKenzie (2007). "Joint estimation of time-dependent and non-linear
effects of continuous covariates on survival." Statistics in Medicine 26(2): 392-408.
 Abrahamowicz, M. and J. O. Ramsay (1992). "Multicategorical spline model for time response theory."
Psychometrika 57(1): 5-27.
 Cox, D. R. (1972). "Regression models and life tables (with discussion)." Journal of the Royal
Statistical Society, Series B 34: 187-220.
T  de Boor, C. (1978). A Practical guide to Splines. New York.
Contents 17 of 17
 Dickman, P. W., A. Sloggett, et al. (2004). "Regression models for relative survival." Statistics in
Medicine 23(1): 51-64.
 Esteve, J., E. Benhamou, et al. (1990). "Relative survival and the estimation of net survival: elements
for further discussion." Statistics in Medicine 9(5): 529-38.
 Giorgi, R., M. Abrahamowicz, et al. (2003). "A relative survival regression model using B-spline
functions to model non-proportional hazards." Statistics in Medicine 22(17): 2767-84.
 Hakulinen, T. and L. Tenkanen (1987). "Regression analysis of survival rates." Journal of the Royal
Statistical Society, Series C 36: 309-317.
 Lambert, P. C., L. K. Smith, et al. (2005). "Additive and multiplicative covariate regression models for
relative survival incorporating fractional polynomials for time-dependent effects." Statistics in
Medicine 24(24): 3871-85.
 Ramsay, J. O. (1988). "Monotone regression splines in action (with discussion)." Statistical Science
 Ramsay, J. O. and M. Abrahamowicz (1989). "Binomial regression with monotone splines: A
psychometric approach." Journal of the American Statistical Association 84: 906-915.
 Remontet, L., N. Bossard, et al. (2007). "An overall strategy based on regression models to estimate
relative survival and model the effects of prognostic factors in cancer survival studies." Statistics in
Medicine 26(10): 2214-28.
 Stare, J., R. Henderson, et al. (2005). "An individual measure of relative survival." Journal of the Royal
Statistical Society, Series C 54: 115-126.
 Stare, J., M. Pohar, et al. (2005). "Goodness of fit of relative survival models." Stat Med 24(24): 3911-