maths
Chapter 4
Modelling survival data
Often when analyzing survival data, we may wish to go beyond simply describing the survival process and instead seek greater understanding of factors associated with sur- vival. Often, supplementary information about patients will be available (either through design or opportunity), such as: age, sex, blood pressure, etc. Such characteristics may significantly be associated with the survival experience of an individual. In order to explore the relationship between survival times and risk factors (explanatory variables), an investigation based on statistical modelling can be used.
Remark 6 There are two broad reasons for modelling survival data. One reason is to understand the survival process, and determine what combination of potential ex- planatory variables that affect the risk of death, and therefore affect the form of the hazard function. For example, a treatment can directly affect the risk of death or other factors (such as demographics) may indirectly modify the effect of treatment on this risk. Another reason is for prediction, where the aim may be to forecast the hazard for a particular individual. This allows estimation of quantities such as the median survival time as a function of various explanatory variables, some of which can be ma- nipulated (such as treatment) and others which are intrinsic to the patient (such as demographic). Identifying the effect of ‘active’ covariates that can be manipulated pro- vide obvious benefits in devising individual treatment regimes, for example. Identifying the effect of intrinsic or ‘passive’ factors that cannot be manipulated can have other less obvious benefits. High risk individuals can be better targeted for awareness or screening campaigns. Providing access to appropriate treatment regimes may be given higher priority for such patients. In either case, proposing or constructing such a model can be valuable for designing collection of further data, to inform improved explanatory or predictive modelling. ♣
37
4.1 Modelling the hazard function
Through modelling, we can explore how the survival experience of a group of patients depends on one or more explanatory variables. The basic model for survival data considered in this section is the proportional hazards model or the Cox regression model. Although the model is based on the assumption of proportional hazards, it does not specify any particular form for the probability density function of the survival times. The model therefore can be considered semi-parametric.
4.1.1 A model for the comparison of two groups
Suppose that patients are randomly assigned a standard treatment or a new treatment, and let hS(t) and hN(t) be the hazards of death at time t for patients on the standard and new treatment, respectively. Under the assumption of proportional hazards, the hazard at time t for a patient on the new treatment is proportional to the hazard at the same t for a patient on the standard treatment. This can be expressed as
hN(t) = ψhS(t), (4.1)
for any non-negative value of t, where ψ ≥ 0 is a constant.
If ψ < 1, the hazard of death at t is smaller for an individual on the new drug, relative to an individual on the standard treatment. The new treatment is then an improvement on the standard.
Equation (4.1) can be expressed differently. Suppose that survival data are available on n individuals and denote the hazard function for individual i as hi(t) for i = 1, . . . , n. Also, let h0(t) denote the hazard function for an individual on the standard treatment. Then the hazard function for an individual on the new treatment is as before, ψh0(t). As ψ can’t be negative, set ψ = exp(β).
If xi is an indicator variable taking value one for the new treatment and zero for the standard treatment for i = 1, . . . , n, then
hi(t) = exp(xiβ)h0(t). (4.2)
This is the proportional hazards model for the comparison of two treatment groups.
Remark 7 Restructuring it in this way suggests how to include explanatory variables beyond an indicator of treatment. ♣
38
4.1.2 The general proportional hazards model
We now generalize the proportional hazards model to include p explanatory variables x = (x1, . . . , xp) recorded at time origin. Let h0(t) be the hazard function for an individual for whom the values of all explanatory variables are set to zero. This will be termed the baseline hazard. The hazard function for the ith individual is
hi(t) = ψ(xi)h0(t),
where ψ(xi) is a function of the values of the vector of explanatory variables for the ith individual. The function ψ(xi) can be interpreted as the hazard at time t for an individual whose vector of explanatory variable is xi, relative to the hazard for an individual for whom x = 0.
The general proportional hazards model then becomes
hi(t) = exp(x1iβ1 + . . . + xpiβp)h0(t). (4.3)
Since this model can be re-expressed in the form
log
{ hi(t)
h0(t)
}
= x1iβ1 + . . . + xpiβp,
the proportional hazards model may also be regarded as a linear model for the logarithm of the hazard ratio.
Remark 8 Note that we have made no assumptions concerning the actual form of the baseline hazard function h0(t). We will see later that it is possible to estimate β without making any such assumptions. ♣
Remark 9 This is similar to a generalized linear model with a log link to the mean, except that the intercept term has been fixed, and so will not be estimated. ♣
Thought 17 How would the model change if the explanatory variables were not recorded at time origin? ♠
4.2 The linear predictor
in a proportional hazards model
There are two types of explanatory variables, namely covariates and factors. A variate is a continuous variable such as blood pressure and a factor has specific levels such as gender. Prefixing variate with “co” to form the term covariate merely emphasizes that the variable is co-varying with the response.
39
4.2.1 Including a covariate
It is straightforward to include covariates (either alone or in combination) into a pro- portional hazards model. When modelling linear effects, each covariate appears in the model with a corresponding β coefficient. For example, consider two covariates x1 and x2. These can be included as follows
hi(t) = exp(x1iβ1 + x2iβ2)h0(t).
In such models, the baseline hazard function is the hazard for an individual for whom all the values included in the model take the value zero.
4.2.2 Including a factor
Suppose we wish to include a factor A with a levels into a proportional hazard model. Then each of the a levels may potentially have different effects on the hazard, denoted as α1, . . . ,αa. If A is the sole explanatory variable in the model, then the hazard function can then be expressed as hk(t) = exp(αk)h0(t), for k = 1, . . . , a. In this situation, the baseline hazard function will represent the effect when A is set to the value zero, and is more generally called a cornerpoint contrast. Therefore, we can arbitrarily set α1 = 0 which corresponds to taking the baseline hazard to the hazard for an individual for whom A is at the first level.
Remark 10 In practice, care should be taken to assign the first level of a factor, to facilitate interpretation of baseline hazards and baseline survival, and therefore hazards and survival more generally. ♣
In such a model, explanatory variables are coded as dummy or indicator variables. Table 4.1 shows how to code factor A. The term αk can be incorporated into the linear part of
Table 4.1: Survival models with covariates: coding a factor.
Level of A X2 X3 . . . Xa 1 0 0 . . . 0 2 1 0 . . . 0 3 0 1 . . . 0 ...
... ...
... ...
a 0 0 . . . 1
the model by including the a − 1 explanatory variables X2, X3, . . . , Xa with coefficients
40
α2,α3, . . . ,αa. When only linear effects of factor A are included, these effects are said to have a − 1 degrees of freedom.
4.2.3 Including an interaction
In general, if A and B are two factors, and the hazard of death depends on the combi- nation of levels of A and B, the A and B are said to interact. So for instance, the effect of A on death differs depending on the value of B (or vice versa). If A and B have a and b levels respectively, the term which represents an interaction between these two factors is denoted by (αβ)kℓ, for k = 1, . . . , a and ℓ = 1, . . . , b.
Remark 11 Typically when an interaction term is included into a model, then the corresponding main effects should also be present. This facilitates interpretation. More importantly it provides a basis for evaluating whether the additional complexity pro- vided by the interaction is useful. ♣
In order to include the term (αβ)kl, products of indicator variables associated with the main effects are calculated. For example, if A and B have 2 and 3 levels respectively, indicator variables U2 and V2, V3 are defined as in the following
Level ofA U2 1 0 2 1
Level ofB V2 V3 1 0 0 2 1 0 3 0 1
(4.4)
Let uk and vl be the value of Uk and Vl for a given individual, for k = 2 and l = 2, 3. Then the additional terms included in the model would be: (αβ)22u2v2 + (αβ)23u2v3. There are (a − 1)(b − 1) = 2 parameters (or degrees of freedom) associated with the interaction between A and B. Note that (αβ)kl is equal to zero whenever either A or B are at the first level.
4.3 Fitting the proportional hazards model
Fitting the proportional hazards model entails estimating the unknown β coefficients and possibly the baseline hazard function. These two components can be estimated separately. A typical strategy estimates the regression coefficients, β, first, and these are used to construct an estimate of the baseline hazard function.
In a classical approach, the regression coefficients in the proportional hazard model can be estimated using the method of maximum likelihood. To do this, we need the likelihood
41
function for the sample data. This is based on the joint probability distribution of the observed data, but is regarded as a function of the unknown parameters in the assumed model. These maximum likelihood estimates are therefore literally those values which maximize the likelihood function. Computationally, it it more convenient to maximize the logarithm of the likelihood function. Furthermore, approximations to the variance of the maximum likelihood estimates are often obtained using a Taylor expansion, focusing on terms up to the second derivatives of the log-likelihood function.
Remark 12 The maximum likelihood estimates (MLEs) chosen (or found) are those for which the data are most likely. These contrast with the Bayesian maximum a posteriori (MAP) estimates, which are the most likely given the observed data. Note the reversal in logic. To clarify, the MLEs are
max θ∈Θ
lik(θ; y) (4.5)
and the MAP is
max θ∈Θ
p(θ|y) = max θ∈Θ
lik(θ; y)p(θ) ∫ y lik(θ; y)p(θ)
. (4.6)
In some situations the MLE and MAP will be equivalent, such as in the limit as the sample size becomes suitably large, n → ∞, or when uniform priors (on the appropriate scale) have been used. For further reading on the interplay between Frequentist and Bayesian statistics, including this overlap, see Bayarri and Berger (2004).
It is important to be careful to avoid the conditional fallacy, a ubiquitous trap in logical reasoning, which in this context, leads to confusion between Frequentist confi- dence intervals and Bayesian credible intervals or between MLEs and MAP estimates (Low Choy and Wilson, 2009). ♣
Suppose we have survival data where all death times are unique (that is, there are no ‘ties’). Let n denote the number of individuals and n − r the number of censored observations. The r ordered death times are t(1) < . . . < t(r). The set of individuals who are at risk at time t(j) will be denoted by R(t(j)), so that R(t(j)) is the set of individuals who are alive and uncensored at a time just prior to t(j). The quantity R(t(j)) is called the risk set.
It can be shown that the likelihood function for the proportional hazard model given in equation (4.3) is
L(β) = r∏
j=1
exp(x(j)β) ∑
q∈R(t(j)) exp(xqβ)
, (4.7)
in which x(j) is the vector of covariates for the individual who dies at the jth ordered death time t(j). The summation in the denominator of this likelihood function is the
42
sum of the values of exp(xβ) over all individuals who are at risk at time t(j). Censored observations do not contribute to the numerator, only the denominator.
Alternatively, consider an indicator variable δi which is zero of the ith survival time ti, for i = 1, . . . , n, is right-censored, and unity otherwise. The above likelihood function can be expressed as
L(β) = n∏
i=1
[ exp(xiβ)
∑ q∈R(ti) exp(xqβ)
]δi , (4.8)
where R(ti) is the risk set at the time ti. The corresponding log-likelihood function is given by
log L(β) = n∑
i=1
δi
⎧ ⎨
⎩xiβ − log ∑
q∈R(ti)
exp(xqβ)
⎫ ⎬
⎭ . (4.9)
Maximizing this log-likelihood requires numerical methods, such as the Newton-Raphson procedure or some (global) search algorithm.
Thought 18 What was the conceptual ‘trick’ that was exploited to derive the log- likelihood for the proportional hazards? ♠
4.3.1 Likelihood function for proportional hazards
The basis of the argument used to construct the likelihood for the proportional hazards model is that intervals between death times convey no information about the effect of explanatory variables on the hazard of death. That is, these intervals provide no information for the estimation of the βs. We therefore consider the probability that the ith individual dies at some time t(j) conditional on t(j) being one of the observed set of r death times:
P(individual with characteristics xj dies at t(j)| one death at t(j)), (4.10)
or P(individual with characteristics xj dies at t(j))
P(one death at t(j)) . (4.11)
Since the death times are assumed to be independent, the denominator of this expression is the sum of the probabilities of death at time t(j) over all individuals who are at risk of death at that time. If these individuals are indexed by ℓ, with R(t(j)) denoting the set of individuals who are at risk at time t(j), the above expression becomes
P(individual with characteristics x(j) dies at t(j)) ∑
ℓ∈R(t(j)) P(individual ℓ dies at t(j))
.
43
The above probabilities of death at time t(j) are now replaced by probabilities of death in the interval (t(j), t(j) + δt), by dividing both the numerator and denominator by δt. Then,
P(individual with characteristics x(j) dies in (t(j), t(j) + δt))/δt ∑
ℓ∈R(t(j)) P(individual ℓ dies in (t(j), t(j) + δt))/δt
.
The limit of the above expression as δt → 0 is the ratio of the corresponding hazards of death at time t(j), that is,
Hazard of death at time t(j) for individual with characteristics x(j) ∑
ℓ∈R(t(j)) { Hazard of death at time t(j) for individual ℓ}
.
If it is the ith individual who dies at t(j), the hazard function in the numerator of this expression can be written hi(t(j)). Similarly, the denominator is the sum of the hazards of death at time t(j) over all individuals who are at risk of death at this time. This is the sum of the values hℓ(t(j)) over those individuals in the risk set at time t(j), R(t(j)). Consequently, the conditional probability becomes
hi(t(j)) ∑
ℓ∈R(t(j)) hℓ(t(j))
.
We note that the baseline hazard functions cancel each other out, which is the main benefit of the proportional hazards assumption, and we are left with
exp(x(j)β) ∑
ℓ∈R(t(j)) exp(xℓβ)
.
The likelihood of the (independent) observations is then formed by taking the product of these conditional probabilities over the r death times.
This likelihood is sometimes referred to as the partial likelihood function as it does not make direct use of the actual censored and uncensored survival times. Note that standard results used in maximum likelihood estimation carry over without modification to maximum partial likelihood estimation.
4.3.2 Treatment of ties
Many survival studies will give rise to more than a single death at each death time. In addition to this, there might also be one or more censored observations at a death time. If this situation arises, the censoring is assumed to occur after all the deaths. This assumption eliminates any ambiguity concerning which individuals should go into the risk sets. Therefore, we only need to consider how tied survival times can be handled in fitting the proportional hazards model.
44
Let sj be the vector of sums of the p covariates for those individuals who die at the jth death time t(j), for j = 1, . . . , r. If there are dj deaths at t(j), the hth element of sj is
shj = ∑dj
k=1 xhjk, where xhjk is the value of the hth explanatory variable for the k of dj individuals (h = 1, . . . , p, k = 1, . . . , dj). An approximation to the likelihood function is
r∏
j=1
exp(sjβ) [∑
l∈R(t(j)) exp(xlβ)
]dj . (4.12)
This likelihood is straightforward to compute, and is an adequate approximation when the number of tied observations at any one death time is not too large. This is the method that is usually implemented in statistical software packages.
4.4 Confidence intervals and hypothesis tests for
the effects of covariates
If a 100(1 − α)% confidence interval for β does not include zero, then this shows that zero is a highly unlikely value; and suggests that the value of β is non-zero.
Remark 13 Note that this does not discriminate between situations where a few al- ternative values are better supported by the data, and a situation where many values (zero or non-zero) may have equally low support. ♣
When conducting significance tests to determine if there is any such evidence, it is important to recognize that an hypothesis such as H0 : βk = 0 is being tested in the presence of all other terms in the model. For example, suppose that a model contains three explanatory variables X1, X2 and X3 with coefficients β1,β2 and β3. The test statistic β̂2/SE(β̂2) is then used to test the null hypothesis that β2 = 0 in the presence of β1 and β3. If there was no evidence to reject this hypothesis, we would conclude that X2 was not needed in the model in the presence of X1 and X3.
In general, the individual estimates β̂1, β̂2, . . . in a proportional hazards model are not all independent of one another. This means that the results of testing separate hypotheses about the β parameters in a model may not be easy to interpret. For example, consider the situation where there are three explanatory variables and a model comprising solely main effects is being tested, i.e. linear effects of each of the three variables but no interactions. If β̂1 and β̂2 were not found to be significantly different from zero, we could not conclude that only X3 need be included in the model. This is because the coefficient for X1, for example, could change when X2 was dropped from the model, and vice versa. This would certainly happen if X1 and X2 were correlated.
45
Because of this difficulty, we require alternative methods for comparing different pro- portional hazards models. Methods described in the next subsection are much more satisfactory than the hypothesis testing approach discussed above. In particular hy- pothesis tests may form the key output provided by a computer-driven analysis, based on fitting a single model to survival data. These results need to be viewed with caution, and should only be considered as the starting point of interpreting model fit.
4.4.1 Standard error and confidence interval for hazard ratios
We have seen that the parameter ψ = exp(β) is the hazard ratio when comparing two groups of survival times. The corresponding estimate of the hazard ratio is ψ̂, and the standard error of ψ̂ can be obtained. The approximate variance of ψ̂, a function of β̂, is
{exp(β̂)}2Var [ β̂ ] ,
so the standard error of ψ̂ is given by
SE [ ψ̂ ] = ψ̂SE
[ β̂ ] . (4.13)
A 100(1 − α)% confidence interval for the true hazard ratio ψ can be found by simply exponentiating the confidence limits for β. An interval estimate obtained in this way is preferable to one found using a normal approximation, i.e. ψ̂ ± zα/2SE(ψ̂). This is because the distribution of the logarithm of the estimated hazard ratio will be more closely approximated by a normal distribution than that of the hazard ratio itself.
Example 11 B: Prognosis for women with breast cancer. The variable which indicates the result of the staining process can be regarded as a factor with two levels, that is, X = 0 corresponds to negative staining and X = 1 to positive staining. Note that there are two women who die at 26 months (tied survival times). Under the proportional hazards model, the hazard of death at time t for the ith women, for whom the value of the indicator variable is xi, is
hi(t) = exp(βxi)h0(t).
The baseline hazard is then the hazard function for a women with a negatively stained tumour. Maximizing the likelihood in equation (4.12) yields a maximum likelihood estimate (MLE) for β that is β̂ = 0.908. The standard error of this estimate is
SE [ β̂ ] = 0.501. Then, the estimated value of the hazard ratio (ψ) is exp(0.908) = 2.48.
Since this is greater than one, we conclude that a woman who has a positively stained
46
tumour may have a greater risk of death at any given time than a comparable women whose tumour is negatively stained.
As shown previously, the standard error of the hazard ratio can be found from the standard error of β̂. That is, SE
[ ψ̂ ] = 2.480 × 0.501 = 1.242. Further, a 95%
confidence interval for the hazard ratio is exp ({ β̂ − 1.96SE
[ β̂ ] , β̂ + 1.96SE
[ β̂ ]})
or
exp([−0.074, 1.890]) = (0.93, 6.62). ♦
4.5 Comparing alternative models
In a modelling approach to the analysis of survival data, a model is developed for the dependence of the hazard function on one or more explanatory variables. We will refer to the model with no explanatory variables in the linear component as the null model.
Suppose that two models are contemplated for a particular data set, say Model (1) and Model (2), where Model (1) contains a subset of the terms in Model (2). Model (1) is then said to be parametrically nested within Model (2). More precisely, suppose that the p explanatory variables X1, . . . , Xp are fitted in Model (1), so that the hazard function under this model can be written exp{β1x1 + . . . + βpxp}h0(t). Also suppose that the p + q explanatory variables X1, . . . , Xp, Xp+1, . . . , Xp+q are fitted in Model (2), so that the hazard function under this model can be written exp{β1x1 + . . . + βpxp + βp+1xp+1 + βp+qxp+q}h0(t). Because Model (2) contains more parameters than Model (1), Model (2) will fit the data better. The statistical problem is then to determine whether the additional q terms significantly improve the fit of the model. If not, Model (1) would be deemed adequate, as well as more parsimonious.
4.5.1 The deviance −2 log L̂
The likelihood function summarizes the information that the data contain about the unknown parameters in a given model. The value of this function is therefore a useful summary statistic when the parameters are replaced by their maximum likelihood es- timates. For a given set of data, the larger the value of the maximized likelihood, the more likely the data are for the model (and corresponding parameter values). In the absence of any other information, this is often interpreted to reflect better agreement between the model and the observed data.
As will be seen, it is more convenient to use the deviance, which is minus twice the logarithm of the maximized likelihood, to compare alternative models:
D̂ = −2 log L̂. (4.14)
47
Since the likelihood function is a product of conditional probabilities, L̂ will always be less than unity. Therefore, the deviance will always be positive, and for a given data set, the smaller the deviance, the more concentrated the likelihood is at those parameter values, which (when the likelihood is well-behaved, e.g. concave) suggests a better model. We note that the deviance is only useful when making comparisons between fitted models on the same data set.
Remark 14 This is because the deviance is a transformation of the likelihood function L = lik(θ; y), which examines how the likelihood changes across parameter values, and is a function of the observed data. Note the subtle change of emphasis compared with the joint probabiility density of the data p(y|θ), which examines how the probability of different datasets vary, but is conditional on a fixed value of the parameter. The key difference is that the probability density satisfies the Kolmogorov axiom for a probability, and integrates to one across all possible datasets. However the likelihood does not integrate to one over the parameter space. Note that its normalizing constant will change with every θ value. ♣
For Model (1) and Model (2), let the value of the maximized log-likelihood function for each model be denoted L̂(1) and L̂(2), respectively. The two models can then be compared on the basis of the difference between D(1) = −2 log L̂(1) and D(2) = −2 log L̂(2). A large difference would lead to the conclusion that the q variates in Model (2) that are additional to those in Model (1) do improve the adequacy of the model. How larger difference will depend on the size of q. In particular, the difference will reflect the combined effect of adding the variables Xp+1, . . . , Xp+q to a model that already contains X1, . . . , Xp. This is said to be the change in the value of −2 log L̂ due to fitting Xp+1, . . . , Xp+q, adjusted for X1, . . . , Xp. The statistic for the difference can be written
−2 log{L̂(1)/L̂(2)},
and this is the log-likelihood statistic for testing the null hypothesis that the q param- eters βp+1, . . . ,βp+q in Model (2) are all zero. It can be shown that this statistic has a chi-squared distribution, under the null hypothesis that the coefficients of the addi- tional variables are zero. The number of degrees of freedom is equal to the difference between the number of parameters in each model, here, q.
Example 12 B: Prognosis for women with breast cancer. On fitting the pro- portional hazards model with no explanatory variables (the null model), the deviance is 173.968. If an indicator variable is included to represent positive or negative stain- ing, the deviance decreases to 170.096. The difference between these values is 3.872. Comparing these using the log-likelihood ratio statistic, which follows a chi-squared dis- tribution on one degree of freedom, we find this difference significant at the 5% level
48
(p-value = 0.049). We may therefore conclude that there is evidence that the hazard functions differ between the two groups of women. ♦
Example 13 F: Treatment of hypernephroma. Survival data were obtained on 36 patients with a malignant kidney tumour or hypernephroma (Table 4.2). As well as the appropriate course of treatment, in addition, a nephrectomy (the surgical removal of the kidney) was carried out on some of the patients. Of interest was whether the survival time of patients depended on their age, at the time of diagnosis, and on whether they received a nephrectomy. Note how the variable AGE is classified.
Suppose that the effect due to the jth age-group is denoted by αj, for j = 1, 2, 3 and that due to nephrectomy status is denoted by vk for k = 1, 2. Then there are five possible models
Model (1): hi(t) = h0(t),
Model (2): hi(t) = exp{αj}h0(t),
Model (3): hi(t) = exp{vk}h0(t),
Model (4): hi(t) = exp{αj + vk}h0(t),
Model (5): hi(t) = exp{αj + vk + (αv)jk}h0(t).
To fit the term αj, two indicator variables A2 and A3 are defined as below.
Age-group A2 A3 < 60 0 0
60 − 70 1 0 > 70 0 1
The term vk is fitted by defining a variable N, equal to zero when no nephrectomy has been performed and unity otherwise. Therefore, the baseline hazard function will correspond to an individual in the youngest age-group who has not had a nephrectomy.
The first step in comparing these different models is to determine if there is an inter- action between nephrectomy status and age-group. To do this, we compare Model (4) with Model (5), with log-likelihood ratio statistic 165.508−162.479 = 3.029 on 2 degrees of freedom. This is not significant (p-value = 0.220), so we conclude that is there no evidence for an interaction between age-group and nephrectomy status.
We now need to determine whether the hazard function is related to neither, one or both of the factors age-group and nephrectomy status. The change in the deviance on including the term αj in the model that contains vk is 170.247−165.508 = 4.739 on two degrees of freedom. The p-value for this difference is 0.094 reflecting some evidence for the inclusion of αj. The change in the deviance when vk is added to the model with αj is 172.172 − 165.508 = 6.664 on one degree of freedom, with corresponding p-value of 0.01. This suggests that both terms should probably be included into the model. ♦
49
Table 4.2: Example F: Survival times for treatment of hypernephroma.
No nephrectomy Nephrectomy < 60 60 − 70 > 70 < 60 60 − 70 > 70
9 15 12 104* 108* 10 6 8 9 26 9 21 17 56 14 18
35 115 6 52 52 68 5* 77* 18 84 36 8 9 38 72 36 48 26 108 5
There are two further steps in the modelling approach when analysing survival data. First, we will need to critically examine how well the model fits the observed data in order to ensure that the fitted proportional hazards model is indeed appropriate. Second, we will need to interpret the model, in order to quantify the effect that the explanatory variables have on the hazard function (Section 4.7).
Thought 19 What additional information could affect model choice? ♠
4.6 Strategy for model selection
In practice, a hazard function will not relate strongly to a unique combination of vari- ables. Instead there are likely to be a number of equally good models, rather than a single ‘best’ model. For this reason, a wide range of possible models should be consid- ered.
50
Table 4.3: Example F: Values of −2 log L̂ for nephrectomy example.
Terms in model Variables in model −2 log L̂ null model none 177.667 αj A2, A3 172.172 vk N 170.247 αj + vk A2, A3, N 165.508 αj + vk + (αv)jk A2, A3, N, A2N, A3N 162.479
4.6.1 Variable selection procedures
As well as comparing values of the deviance, comparisons between a number of possible models which need not necessarily be nested, can also be made on the basis of Akaike’s Information Criterion
AIC = D + αq
where q is the number of unknown β parameters in the model, and α is a predetermined constant. The value of α is generally taken to be between 2 and 6 (typically 2, but often 3). Smaller values of this statistic correspond to ‘better’ models, in the sense that they are parsimonious and the data are likely under the model. The second term in the AIC penalizes models with a larger number of parameters.
When the number of variables are relatively large, the number of possible models can be computationally expensive. In such situations, some automatic routines can be used: forward selection, backwards elimination and their combination, stepwise selection. In forward selection, variables are added to the model one at a time. At each stage of the routine, the variable added is the one which gives the largest decrease in the value of −2 log L̂ on its inclusion. The routine ends when the next candidate for inclusion in the model does not reduce the value of −2 log L̂ by more than desired (predetermined).
In backward elimination, a model that contains the largest number of variables under consideration is first fitted with variables then being excluded one at a time. At each stage the variable omitted is the one that increases the value of −2 log L̂ the small- est amount on its exclusion. The routine ends when the next candidate for deletion increases the value of −2 log L̂ by more than a prespecified amount. The stepwise pro- cedure operates in the same way as forward selection. However, a variable that has been included in the model can be considered for exclusion at a later stage.
Remark 15 While these routines are easy to implement and readily available in soft- ware packages, they do have some disadvantages. Most disadvantages arise from their
51
reliance on a single index of overall model fit, such as AIC. It is recommended that any model proposed by these variable selection methods is assessed carefully, to ensure that the parameters are indeed well estimated, and to identify any violation of model assumptions, by checking residuals and influence diagnostics. Because of the ‘one-step- look-ahead’ strategy of searching model space, the three routines may explore different models, and can often lead to different models being identified as the ‘best’. Hence they generally do not take any account of hierarchy in models. They are also highly sensitive to the stopping rule that is used, to determine when a ‘sufficient’ level of fit has been obtained. ♣
Instead of using automatic variable selection procedures, a general strategy for model selection is recommended. This adopts a forward selection approach, but involves more exploration at each step compared to the automated procedures.
1. Fit all variable one at a time to determine if each variable adds something on their own when compared to the null model.
2. The important variables from step 1 are then fitted together. In the presence of certain variables, others may cease to be important. These variables are subse- quently discarded.
3. Variables which were not important on their own, may become important in the presence of others. Any that are found to be important are included.
As a note, this is where interactions and higher-order terms would be considered if they are considered potentially important. In some cases, variables may be grouped into themes, if it is believed that just one of several alternatives will be necessary in the final model. This not only reduces the number of potential models, but also indicates a path for investigating options for the set of variables.
4. A final check is made to ensure that no term in the model can be omitted without significantly increasing the value of the deviance, and that no term not included significantly reduces the deviance.
In addition, if there is strong evidence (for example in the literature, on the same or related topic) that a group of variables ought to be included in the survival model, then these procedures can be implemented using this ‘alternative’ model as the basis.
Remark 16 In some pioneering studies, the number of variables of potential interest may be quite high in comparison to the number of observations. One example is the large p, small n problem that is prevalent in microassay experiments. In these cases a model-based approach to variable selection can be implemented within the Bayesian paradigm, but these tend to have extremely high computational overheads, as well as the obvious differences in inference and philosophy. ♣
52
Table 4.4: Deviance for various proportional hazards models for the myeloma study.
Variables in the model Deviance none 215.940 AGE 215.817 SEX 215.906 BUN 207.453 CA 215.494 HB 211.068 PC 215.875 BJ 213.890 HB +BUN 202.938 HB +BJ 209.829 BUN + BJ 203.641 BUN + HB + BJ 200.503 HB + BUN + AGE 202.669 HB + BUN + SEX 202.553 HB + BUN + CA 202.937 HB + BUN + PC 202.773
Example 14 C: Survival of multiple myeloma patients In this example, transfor- mations of the original variables and interactions between them will not be considered. We will further assume that there are no medical grounds for including particular vari- ables in a model. A summary of the deviance for all models that are to be considered is provided in Table 4.4.
1. Upon fitting each of the variables on their own, BUN leads to the largest reduction in the deviance. This reduction of 215.940−207.453 = 8.487 has a p-value of 0.004 when compared to a chi-squared distribution with 1 df. The variable HB also leads to a significant reduction in model error with a p-value of 0.027. BJ shows some potential with a p-value of 0.152, indicating that it also should be retained.
2. BUN, HB and BJ are now included in one model. When BUN is omitted, the deviance increases by 9.326. When HB is omitted the increase is 3.138 and when BJ is omitted the increase is 2.435. When compared to a chi-squared distribution with 1 df, this suggests that BJ is not needed whilst BUN and HB should be retained.
3. Excluding either BUN or HB from the model results in an increase of the value of −2 log L̂ by 8.130 and 4.515, respectively, suggesting that both should therefore retained.
53
4. The remaining variables AGE, SEX, CA and PC are now checked, for significant contribution in the presence of BUN and HB. The largest reduction in deviance achieved by any of these four variables is 0.5, suggesting that none provide sub- stantial additional information.
Therefore, we conclude that the most satisfactory proportional hazards model contains BUN and HB. ♦
We now outline the modelling procedure where there are variables of primary impor- tance, such as a treatment effect.
1. Consider all prognosis variables (ignoring the treatment effect). Follow the above four step procedure on these variables alone.
2. The treatment effect is then included into the model. In this way, any differences between the two groups that arise as a result of differences between the distri- butions of the prognostic variables in each treatment group are not attributed to the treatment.
3. If interactions between treatment and other explanatory variables has not been discounted, they must be considered before the treatment effect can be inter- preted.
Thought 20 In your experience of different statistical models and/or applications, how has variable selection been carried out? ♠
4.7 Interpretation of parameter estimates
When the proportional hazards model is used in the analysis of survival data, the coefficients in the model can be interpreted as the logarithm of the ratio of the hazard of death when compared to the baseline hazard.
4.7.1 Models with a variate
Suppose that a proportional hazards model contains a single continuous variable X, so that the hazard function for the ith of n individuals, for whom X takes value xi, is hi(t) = exp(βxi)h0(t). The coefficient of xi in this model can then be interpreted as the logarithm of a hazard ratio. Now consider the ratio of the hazard of death for an
54
individual for whom the value x + 1 is recorded on X, relative to one for whom the value x is obtained:
exp{(x + 1)β}
exp{xβ} = exp(β),
and so
β̂ in the fitted proportional hazards model is the estimated change in the logarithm of the hazard ratio when the value of X is increased by one unit.
Extending this,
If X is increased by r units to (x+r), then the log-hazard ratio will increase by rβ̂, and the corresponding estimate of the hazard ratio is exp(rβ̂).
Thus, when a continuous variable X is included in a proportional hazards model, the hazard ratio does not depend on the actual value of X, just the amount of change in X, here r. Hence, for example, if X refers to the age of an individual, the hazard of an individual aged 70, relative to one aged 65, would be the same as that for an individual aged 20, relative to one aged 15. This is a direct result from fitting X as a linear term in the model. If there is doubt about the assumption of linearity, then the continuous variable X can be replaced by a factor whose levels correspond to different sets of values.
4.7.2 Models with a factor
Now consider X as a factor, with m levels and corresponding parameters αj, for j = 1, . . . , m. At baseline, α1 = 0. The hazard function for an individual in the jth group (j = 2, . . . , m), relative to an individual in the first group is hj(t) = exp(αj)h0(t). Consequently, the parameter αj is the logarithm of the relative hazard, that is
αj = log hj(t)/h0(t) = log hj(t) − log h0(t).
Example 15 F: Treatment of hypernephroma If we focus on patients for whom a nephrectomy has been performed, the survival times can be classified according to their age-group. A proportional hazards model can then be defined using two indicator variables, as previously shown. The deviance for the null model is 128.901, and when the term for age-group αj, for j = 2, 3, is included, the deviance reduces to 122.501.
55
This reduction is significant at the 5% level, and so we conclude that the hazard function does depend on which age-group the patient is in.
The estimates of the coefficients for the indicator variables are α̂2 = −0.065 and α̂3 = 1.824 with standard errors SE[α̂2] = 0.498 and SE[α̂3] = 0.682, respectively. Therefore, the hazard ratio for a patient aged 60 − 70, relative to one aged less than 60, is exp{−0.065} = 0.94, while that for a patient whose age is greater than 70, relative to one aged less than 60, is exp{1.824} = 6.20. So we can conclude that the hazard of death at any given time is greatest for patients who are older than 70, although that there is little difference in the hazard functions for patients in the other two groups. ♦
4.7.3 Models with combinations of terms
So far, we have only considered the interpretation of parameter estimates in proportional hazards models that contain a single term. When a model contains more than one variable, the parameter estimate associated with a particular effect is said to be adjusted for the other variables in the model, and so are the log-hazard ratios.
Example 16 D: Comparison of two treatments for prostatic cancer The most important variables in this example were found to be size of tumour (SIZE) and the Gleason index (INDEX). With the treatment variable also included, the hazard function for the ith individual can be expressed as
hi(t) = exp{β1SIZEi + β2INDEXi + β3TREATi}h0(t),
Parameter estimates and the corresponding standard errors were estimated via maxi- mum likelihood (Table 4.5). Since the active treatment was coded as 1 while the placebo was coded as 0, the estimated log-hazard ratio for an individual on the active treatment, relative to an individual on the placebo, with the same values of SIZE and INDEX, was β̂3 = −1.113. Consequently, the estimated hazard is exp −1.113 = 0.329. This is irrespective of the actual values of SIZE and INDEX.
For a comparison, if a model that only contains TREAT is fitted, the estimated coeffi- cient of TREAT is −1.978. The estimated hazard ratio for an individual on the active treatment relative to one on placebo, unadjusted for SIZE and INDEX, is exp −1.978 = 0.14. Thus, unless proper account is taken of the effect of size of tumour and index of tumour grade, the extent of the treatment effect is overestimated. ♦
Thought 21 How would you convince a client to examine more than one proportional hazards model? ♠
56
Table 4.5: Example D: Estimated coefficients and standard errors for prostatic cancer study.
Variable β̂ SE [ β̂ ]
SIZE 0.083 0.048 INDEX 0.710 0.338 TREAT -1.113 1.203
4.8 Estimating the hazard and survivor functions
We have only considered the estimation of the β parameters in the linear component of a proportional hazards model. Once a suitable model for a set of survival data has been identified, the hazard function, and the corresponding survivor function, can also be estimated.
Suppose the hazard function for the ith individual in a study is given by
ĥi = exp{xiβ̂}ĥ0(t), (4.15)
for i = 1, . . . , n and xi define the covariates for the ith individual.
An estimate of the baseline hazard function was derived by Kalbfleisch and Prentice (1973) using an approach based on the method of maximum likelihood. As this deriva- tion is quite complex, it will not be discussed in detail here. The estimated baseline hazard function at time t(j) is given by
ĥ0(t(j)) = 1 − ξ̂j, (4.16)
where ξ̂j is the solution to the following equality
∑
l∈D(t(j))
exp(xlβ̂)
1 − ξ̂ exp(xlβ̂) j
= ∑
l∈R(t(j))
exp(xlβ̂), (4.17)
for j = 1, . . . , r. In equation (4.17), D(t(j)) is the set of all dj individuals who die at the jth death time t(j).
Thought 22 What computational techniques could be used to solve this equality? ♠
Example 17 F: Treatment of hypernephroma The estimated hazard function was found to depend upon the age-group and nephrectomy status. The estimated hazard function for the ith patient was
ĥi(t) = exp{0.013A2i + 1.342A3i − 1.411Ni}ĥ0(t).
57
Table 4.6: Estimated baseline cumulative hazard function and baseline survivor function for treatment of hypernephroma.
Time interval Ĥ0(t) Ŝ0(t) 0- 0.0000 1.0000 5- 0.0507 0.9505 6- 0.1548 0.8566 8- 0.2704 0.7631 9- 0.5199 0.5946 10- 0.5933 0.5525 12 0.6722 0.5106 14- 0.7851 0.4561 15- 0.9013 0.4060 17- 1.0329 0.3560 18- 1.3368 0.2627 21- 1.5218 0.2183
Time interval Ĥ0(t) Ŝ0(t) 26- 1.9760 0.1386 35- 2.2316 0.1074 36- 2.7766 0.0622 38- 3.0913 0.0454 48- 3.4320 0.0323 52- 4.1753 0.0154 56- 4.6298 0.0098 68- 5.1409 0.0059 72- 5.7248 0.0033 84- 6.5410 0.0014 108- 7.8969 0.0004 115- 11.9478 0.0000
Table 4.6 shows the estimated baseline cumulative baseline hazard and survivor functions for these survival times. The estimates of the baseline hazard function only apply at the death times of the patients. However, we make the assumption of a constant hazard in each time interval.
To estimate the median survival time, we need to find the smallest observed survival time for which the estimated survivor function is less than 0.5. From Table 4.6, the estimated median survival time for patients aged less than 60 who have not had a nephrectomy is 14 months. Furthermore, the individual survivor function, for the ith individual, can be estimated from the baseline survivor function and part of the estimated hazard function:
Ŝi(t) = [Ŝ0(t)] exp{0.013A2i+1.342A3i−1.411Ni}.
For an individual aged less than 60 who has had a nephrectomy, the estimated survivor function is [Ŝ0(t)]
exp −1.411. This is plotted in Figure 4.1 along with patients who have not have a nephrectomy. From the figure, the probability of surviving beyond any given time is greater for those who have had a nephrectomy, suggesting that a nephrectomy improves the prognosis for patients with nypernephroma. ♦
The above estimate for the probability that a patient, with particular characteristics, survives beyond any given time can be derived as follows:
− log Ŝ(t) = exp(xiβ̂)Ĥ0(t) (4.18)
log Ŝ(t) = exp(xiβ̂) log exp(−Ĥ0(t))
= log exp(−Ĥ0(t)) exp(xiβ̂)
58
0 20 40 60 80 100
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
Survival time
E st
im at
ed s
ur vi
vo r f
un ct
io n
Figure 4.1: Example F: Estimated survivor functions for patients aged less than 60 with (−) and without (−−) a nephrectomy.
so for t(k) ≤ t < t(k+1), and k = 1, . . . , r − 1, we have:
Ŝi(t) = [exp(−Ĥ0(t))] exp(xiβ̂)
= [Ŝ0(t)] exp(xiβ̂). (4.19)
59
60