Hosmer and Lemeshow type Goodness-of-Fit Statistics for the Cox Proportional Hazards Model

doi:10.1016/S0169-7161(03)23021-2

Handbook of Statistics

Volume 23, 2003, Pages 383-394

https://doi.org/10.1016/S0169-7161(03)23021-2 Get rights and content

Publisher Summary

This chapter discusses goodness-of-fit tests for the Cox proportional hazards model that are based on ideas similar to the Hosmer and Lemeshow goodness-of-fit test for logistic regression. All of these tests can be derived by adding group-indicator variables to the model and testing the hypothesis that the coefficients of the group indicator variables are zero via the score test. The tests that can be derived in this way are called the “added variable tests.” Care needs to be taken when implementing these tests because some of them require the use of time-dependent group indicator variables. Information regarding the time-dependent nature of the tests is also provided along with examples.

Introduction

The Cox (1972) proportional hazards (PH) model has been an extremely popular regression model in the analysis of survival data during the last decades. Even though a number of goodness-of-fit tests have been developed for the PH model, authors who utilize this model rarely compute these tests Andersen, 1991, Concato et al., 1993. One reason might be that only a few can be easily calculated in statistical software packages.

We discuss goodness-of-fit tests for the Cox proportional hazards model, which are based on ideas similar to the Hosmer and Lemeshow, 1980, Hosmer and Lemeshow, 2000 goodness-of-fit test for logistic regression. All of these tests can be derived by adding group indicator variables to the model and testing the hypothesis that the coefficients of the group indicator variables are zero via the score test. We will call the tests that can be derived in this way the added variable tests. The tests that we discuss were proposed by Moreau et al., 1985, Moreau et al., 1986 and Grønnesby and Borgan (1996). Care needs to be taken when implementing these tests since some of them require the use of time-dependent group indicator variables.

In Section 2 we discuss the different tests. Section 3 provides information regarding the time-dependent nature of the tests. In Section 4 we provide examples. Details of proofs as well as SAS and STATA code for the examples can be found in Appendix A, Appendix B, Appendix C.

Section snippets

The Hosmer and Lemeshow type test statistics

We assume the typical right-censored survival data where we observe for each of n individuals the time (denoted by t) from study entry to either event or censoring, whether an event occurred or whether the time was censored (denoted by δ), and a vector of p fixed covariates, $x =(x_{1},…,x_{p})′$ . Under the PH model the hazard function takes the following form: $λ(t, x)=λ_{0} (t) exp β ′ x,$ where λ₀(t) represents an unspecified baseline hazard function, and $β ′=(β_{1},…,β_{p})$ a vector of p coefficients. The component $β ′$

Necessity for time-dependent indicator variables

An important aspect of the added variable version of the Moreau et al. (1986) and the Moreau et al. (1985) tests is that the indicator variables for the time intervals are time-dependent. We will use a small example and the Moreau et al. (1985) test to illustrate the time dependence. Assume we observe four non-censored observations denoted t₁<t₂<t₃<t₄ and also observe whether each observation belongs to group one (denoted x=0,1) of two groups. Consider two time intervals, with the first two

Examples

The first example is based on the gastric cancer data presented by Stablein et al. (1981) (see also Moreau et al., 1985). Ninety cancer patients were either treated by chemotherapy or by both chemotherapy and radiotherapy. Like Moreau et al. (1985) we divide the time axis into four intervals such that each interval contains 18, 19, 18 and 19 deaths respectively. The Moreau et al. (1985) test in this case has 3 degrees of freedom with values of 10.21 (p=0.02) for the score statistic, 9.55 (p

Summary

While various goodness-of-fit tests have been developed to test the assumptions of the Cox proportional hazards model, only a few are readily available in existing statistical software packages. We discuss previously proposed goodness-of-fit tests for the Cox model, which are of the Hosmer–Lemeshow type. We present results that show that the tests can be calculated easily using existing statistical software packages. Care needs to be taken though when implementing some of these tests, since

References (18)

D.M. Stablein et al.
Analysis of survival data with nonproportional hazard functions
Controlled Clinical Trials
(1981)
P.K. Andersen
Survival analysis 1982–1991: The second decade of the proportional hazards regression model
Statist. Medicine
(1991)
J. Concato et al.
The risk of determining risk with multivariable models
Ann. Internal Medicine
(1993)
D.R. Cox
Regression models and life-tables
J. Roy. Statist. Soc. Ser. B
(1972)
J.K. Grønnesby et al.
A method for checking regression models in survival analysis based on the risk score
Lifetime Data Anal.
(1996)
D.W. Hosmer et al.
Goodness-of-fit tests for the multiple logistic regression model
Comm. Statist. Theory Methods A
(1980)
D.W. Hosmer et al.
Applied Logistic Regression
(2000)
D.W. Hosmer et al.
Applied Survival Analysis: Regression Modeling of Time to Event Data
(1999)
S. May et al.
A simplified method of calculating an overall goodness-of-fit test for the Cox proportional hazards model
Lifetime Data Anal.
(1998)

There are more references available in the full text version of this article.

Cited by (23)

Assessing causes of alarm fatigue in long-term acute care and its impact on identifying clinical changes in patient conditions
2020, Informatics in Medicine Unlocked
Citation Excerpt :
We also relied on Chaudhary et al. [21], who used U.S. Department of Defense TRICARE claims data (2011–2015) queried for trauma patients with risk-adjusted Cox models to determine the influence of a prolonged length of stay in an intensive care unit on 1-year mortality. We also used the Hosmer-Lemeshow test, which is well established in the literature [22]. Wang et al. [23] used the Hosmer-Lemeshow test for a subhealth analysis of software programmers.
Physiologic alarms are an important modality in the care of critically ill patients. Yet the many electronic devices used in patient care and the combination of alarms can cause sensory overload in caregivers. This sensory overload can lead to monitor fatigue, and caregivers may miss critical alarms, which can be fatal for patients. Many factors not related to a change in patients' condition can be directly linked to desensitization and alarm fatigue, leading to a failure to recognize or attend to true instability in spite of the alarm. Research demonstrates that the majority of alarms are non-actionable, and staff can develop alarm fatigue trying to determine which alarms are valid and which are not. We postulate that more experience detecting false alarms among professionals in a long-term acute care unit will lead to improved clinical changes and better survival rates among patients. Our proportional hazards model relates missing clinical changes in patients' condition as time passes, after reduced attention to false alarms, to professional experience. In our proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. Therefore, reduced attention to false alarms by experienced professionals decreases the hazard rate for missing a clinical change. We use survival analyses, the hazard function, the receiver-operating characteristic curve, and the Hosmer-Lemeshow test to support our conclusions. Our results show that monitoring equipment is instrumental in alerting staff in a long-term care unit to serious changes in patients’ condition and in preventing false positives and false negatives.
A Risk Calculator to Predict the Individual Risk of Conversion From Subthreshold Bipolar Symptoms to Bipolar Disorder I or II in Youth
2018, Journal of the American Academy of Child and Adolescent Psychiatry
Citation Excerpt :
The final model was externally validated on the BIOS sample and evaluated by the time-dependent AUC (predicting the 5-year risk of an event) and by the non–time-dependent AUC. Calibration was tested by Hosmer-Lemeshow testing42 and by plotting and comparing observed with predicted probability of conversion to BP-I or BP-II. Sensitivity, specificity, positive predictive value, and negative predictive value were assessed at a range of thresholds.
Youth with subthreshold mania are at increased risk of conversion to bipolar disorder (BP) I/II. Predictors for conversion have been published for the group as a whole. However, risk factors are heterogeneous, indicating the need for personalized risk assessment.
One hundred forty youth with BP not otherwise specified (BP-NOS; 6–17 years old) followed through the Course and Outcome of Bipolar Youth (COBY) study with at least 1 follow-up assessment before conversion to BP-I/II were included. Youths were assessed on average every 7 months (median 11.5 years) using standard instruments. Risk predictors reported in the literature were used to build a 5-year risk calculator. Discrimination was measured using the time-dependent area under the curve after 1,000 bootstrap resamples. Calibration was evaluated by comparing observed with predicted probability of conversion. External validation was performed using an independent sample of 58 youths with BP-NOS recruited from the Pittsburgh Bipolar Offspring Study.
Seventy-five (53.6%) COBY youths with BP-NOS converted to BP-I/II, of which 57 (76.0%) converted within 5 years. Earlier-onset BP-NOS, familial hypomania/mania, and high mania, anxiety, and mood lability symptoms were important predictors of conversion. The calculator showed excellent consistency between the predicted and observed risks of conversion, good discrimination between converters and non-converters (area under the curve 0.71, CI 0.67–0.74), and a proportionally increasing rate of converters at each successive risk class. Discrimination in the external validation sample was good (area under the curve 0.75).
If replicated, the risk calculator would provide a useful tool to predict personalized risk of conversion from subsyndromal mania to BP-I/II and inform individualized interventions and research.
Impact of reclamation on the environment of the lower mekong river basin
2018, Journal of Hydrology: Regional Studies
Citation Excerpt :
The Hosmer Lemeshow test has been used for the evaluation of the goodness of fit of the model. In this method, the values of probability by a regression equation are divided into plural groups to make the number of lines of every group same, and the goodness of fit of model could be discussed by the difference between the observation frequency and the expectation frequency calculated from the estimated probability in each group (Susanne and David, 2003). As a result of the Hosmer Lemeshow test, the significance of the probability was approximately 0.78, which is larger than the significance level of 0.05.
In the lower Mekong River Basin, the watershed development, such as reclamation, has been rapidly going on.
The reclamation is expected to cause many problems on this important watershed environment.
The objective of this research is to quantitatively clarify the impact of reclamation on the watershed environment of the lower Mekong River Basin.
The locations of reclamation areas were extracted using MNDWI, NDVI and NDSI derived by Landsat data. As a result, the 49 reclamation areas covering approximately 95% of all 52 reclamation areas determined by visual extraction, were extracted.
Then, the multiple logistic regression model was constructed to find the tendency of the occurrence of reclamation and reproduce the occurrence of reclamation.
The test of goodness of fit, such as Hosmer-Lemeshow test and Nagelkerke coefficient, shows high adaptability (R2 = 0.89) of the probability model.
The probability of occurrence of reclamation could be explained by the distance from the Phnom Penh city and the distance from the river channels.
Moreover, the existence of reclamation areas was incorporated into the flood-inundation model and total phosphorus transportation model changing the value of the elevation and the phosphorus loading respectively.
Consequently, the inundation water level increases about 3.8 m and total phosphorus concentration increases about 0.71 mg/L around Phnom Penh in case of the reclamation of 10% of all inundation areas of Phnom Penh and Kandal province.
Carbonic anhydrase-IX score is a novel biomarker that predicts recurrence and survival for high-risk, nonmetastatic renal cell carcinoma: Data from the phase III ARISER clinical trial
2015, Urologic Oncology: Seminars and Original Investigations
Citation Excerpt :
Treatment weights were included in the Cox model. We confirmed nonviolation of the proportional hazards assumption using “log-log” plots and adequate model fit using Hosmer and Lemeshow analysis [8]. We conducted all analyses with STATA software (College Station, TX).
With a limited number of prognostic and predictive biomarkers available, carbonic anhydrase-IX (CAIX) has served as an important prognostic biomarker for patients with clear cell renal cell carcinoma (ccRCC). However, studies have recently called into question the role of CAIX as a biomarker for ccRCC. To investigate this uncertainty, we quantified the association of CAIX with lymphatic involvement and survival using data from ARISER study (WX-2007-03-HR)—a prospective trial involving subjects with high-risk nonmetastatic ccRCC.
We reviewed the records of 813 patients enrolled in the ARISER study. Central review of histology, grade, and CAIX staining (frequency and intensity) was performed. CAIX score was derived by multiplying the staining intensity (1–3) by percent positive cells (0%–100%), yielding a range of 0 to 300. We quantified the association of CAIX expression and score with lymphatic spread and survival (disease-free survival [DFS] and overall survival [OS]) using Kaplan-Meier and multivariable propensity score adjusted Cox regression analyses.
Median follow-up of the cohort was 54.2 months. Although 56% of subjects with lymphatic involvement had CAIX>85%, only 33% had CAIX score≥200. On multivariable analysis, CAIX>85% was not a statistically significant predictor of DFS and OS (P = 0.06 and P = 0.15, respectively). However, CAIX score≥200, when compared with CAIX score≤100, was associated with improved DFS and OS (P = 0.01 and P = 0.01, respectively) on multivariable analysis.
The largest, multicenter, prospective analysis of patients with high-risk nonmetastatic ccRCC demonstrates the utility of CAIX score as a statistically significant prognostic biomarker for survival. We recommend that CAIX score be quantified for all patients with high-risk disease after nephrectomy.
Echocardiographic estimation of pulmonary arterial systolic pressure in acute heart failure. Prognostic implications
2013, European Journal of Internal Medicine
Citation Excerpt :
The model discriminations were assessed by the Harrell's C-statistic. Cox model calibration was tested by the Gronnesby and Borgan test [16]. A 2-sided p-value of < 0.05 was considered statistically significant for all analyses.
Prognostic implications of echocardiographic assessment of pulmonary hypertension (PH) in non-selected patients hospitalized for acute heart failure (AHF) are not clearly defined. The aim of this study was to evaluate the association between echocardiography-derived PH in AHF and 1-year all-cause mortality.
We prospectively included 1210 consecutive patients admitted for AHF. Patients with significant heart valve disease were excluded. Pulmonary arterial systolic pressure (PASP) was estimated using transthoracic echocardiography during hospitalization (mean time after admission 96 ± 24 h). Patients were categorized as follows: non-measurable, normal PASP (PASP ≤ 35 mm Hg), mild (PASP 36-45 mm Hg), moderate (PASP 46-60 mm Hg) and severe PH (PASP > 60 mm Hg). The independent association between PASP and 1-year mortality was assessed with Cox regression analysis.
At 1-year follow-up, 232 (19.2%) deaths were registered. PASP was measured in 502 (41.6%) patients with a median of 46 [38–55] mm Hg. The distribution of population was: 708 (58.5%), 76 (6.3%), 147 (12.1%), 190 (15.7%) and 89 (7.4%) for non-measurable, normal PASP, mild, moderate and severe PH, respectively. One-year mortality was lower for patients with normal PASP (1.32 per 10 person-years), intermediate for patients with non-measurable, mild and moderate PH (2.48, 2.46 and 2.62 per 10 persons-year, respectively) and higher for those with severe PH (4.89 per 10 person-years). After multivariate adjustment, only patients with PASP > 60 mm Hg displayed significant adjusted increase in the risk of 1-year all-cause mortality, compared to patients with normal PASP (HR = 2.56; CI 95%: 1.05–6.22, p = 0.038).
In AHF, severe pulmonary hypertension derived by echocardiography is an independent predictor of 1-year-mortality.
Prognostic implications of arterial blood gases in acute decompensated heart failure
2011, European Journal of Internal Medicine
Citation Excerpt :
The proportionality assumption for the hazard function over time was tested by means of the Schoenfeld residuals. The discrimination and calibration of the model were assessed using the Harrell's C-statistics and the Gronnesby and Borgan test [10] respectively. A 2-sided p-value of < 0.05 was considered to be statistically significant for all analyses.
The prognostic value of arterial blood gases (ABG) in patients with acute decompensated heart failure (ADHF) is not well-established. We therefore conducted the present study to determine the relationship between ABG on admission and long-term mortality in patients with ADHF.
We studied 588 patients consecutively admitted to our department with ADHF. ABG and classical prognostic variables were determined at patients' arrival to the emergency department. The independent association among the main variables of ABG (pO2, pCO2 and pH) and mortality was assessed with Cox regression analysis.
At a median follow-up of 23 months, 221 deaths (37.6%) were registered. 308 (52.4%), 54 (9.2%) and 50 (8.5%) patients showed hypoxemia (pO2 < 60 mm Hg), hypercapnia (pCO2 > 50 mm Hg) and acidosis (pH < 7.35), respectively. Patients with hypoxemia, hypercapnia and acidosis did not show higher mortality rates (38% vs. 37.1%, 42.6% vs. 37.1%, and 48% vs. 36.6%, respectively; p-value = ns for all comparisons). In multivariate analysis, after adjusting for well-known prognostic covariates, pO2, pCO2 and pH did not show a significant association with mortality. Hazard ratios (HR) for these variables were: pO2, per increase in 10 mm Hg: 0.99 (95% CI: 0.90–1.09), p = 0.861; pCO2, per increase in 10 mm Hg: 1.12 (95% CI: 0.91–1.39), p = 0.262; pH per increase in 0.1: 1.01 (95% CI: 0.99–1.04), p = 0.309. When dichotomizing these variables according to established cut-points, the HR were: hypoxemia (pO2 < 60 mm Hg):1.07 (95% CI: 0.81–1.40), p = 0.637; hypercapnia (pCO2 > 50 mm Hg): 0.98 (95% CI: 0.62–1.57), p = 0.952; acidosis (pH < 7.35): 1.38 (95% CI: 0.87–2.19), p = 0.173.
In patients admitted with ADHF, admission arterial pO2, pCO2 and pH were not associated with all-cause long-term mortality.

View all citing articles on Scopus

View full text

Hosmer and Lemeshow type Goodness-of-Fit Statistics for the Cox Proportional Hazards Model

Publisher Summary

Introduction

Section snippets

The Hosmer and Lemeshow type test statistics

Necessity for time-dependent indicator variables

Examples

Summary

Controlled Clinical Trials

Survival analysis 1982–1991: The second decade of the proportional hazards regression model

Statist. Medicine

The risk of determining risk with multivariable models

Ann. Internal Medicine

Regression models and life-tables

J. Roy. Statist. Soc. Ser. B

A method for checking regression models in survival analysis based on the risk score

Lifetime Data Anal.

Goodness-of-fit tests for the multiple logistic regression model

Comm. Statist. Theory Methods A

Applied Logistic Regression

Applied Survival Analysis: Regression Modeling of Time to Event Data

A simplified method of calculating an overall goodness-of-fit test for the Cox proportional hazards model

Lifetime Data Anal.