Skip to main content
Log in

A cautionary case study of approaches to the treatment of missing data

  • Original Article
  • Published:
Statistical Methods and Applications Aims and scope Submit manuscript

Abstract

This article presents findings from a case study of different approaches to the treatment of missing data. Simulations based on data from the Los Angeles Mammography Promotion in Churches Program (LAMP) led the authors to the following cautionary conclusions about the treatment of missing data: (1) Automated selection of the imputation model in the use of full Bayesian multiple imputation can lead to unexpected bias in coefficients of substantive models. (2) Under conditions that occur in actual data, casewise deletion can perform less well than we were led to expect by the existing literature. (3) Relatively unsophisticated imputations, such as mean imputation and conditional mean imputation, performed better than the technical literature led us to expect. (4) To underscore points (1), (2), and (3), the article concludes that imputation models are substantive models, and require the same caution with respect to specificity and calculability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Reference

  • Allison PD (2001) Missing data. Sage Publications, Thousand Oaks

    Google Scholar 

  • Ambler G, Omar RZ (2007) A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome. Stat Methods Med Res 16: 277–298

    Article  MATH  MathSciNet  Google Scholar 

  • Anderson AB, Basilevsky A, Hum DPJ (1983) Missing data: a review of the literature. In: Rossi, Wright, Anderson (eds) Handbook of survey research. Academic Press, New York

    Google Scholar 

  • Breen N, Kessler L (1994) Changes in the use of screening mammography: evidence from the 1987 and 1990 National Health Interview Surveys. Am J Public Health 84: 62–72

    Article  Google Scholar 

  • Brick JM, Kalton G (1996) Handling missing data in survey research. Stat Methods Med Res 5: 215–238

    Article  Google Scholar 

  • Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Methods Med Res 16: 259–275

    Article  MATH  MathSciNet  Google Scholar 

  • Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, New York

    MATH  Google Scholar 

  • Farewell VT (1979) Some results on the estimation of logistic models based on retrospective data. Biometrika 66: 533–538

    Article  MathSciNet  Google Scholar 

  • Fox J (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks

    Google Scholar 

  • Fox SA, Siu AL, Stein JA (1994) The importance of physician communication on breast-cancer screening of older women. Arch Intern Med 154: 2058–2068

    Article  Google Scholar 

  • Fox SA, Pitkin K, Paul C, Carson S, Duan N (1998) Breast cancer screening adherence: does church attendance matter?. Health Educ Behav 25: 742–758

    Article  Google Scholar 

  • Groves RM, Singer E, Corning A (2000) Leverage–Saliency theory of survey participation. Public Opin Q 64: 299–308

    Article  Google Scholar 

  • Heckman J (1976) The common structure of statistical models of truncation, sample selection, and limited dependent variables, and a simple estimator for such models. Ann Econ Soc Meas 5: 475–492

    Google Scholar 

  • Heckman J (1979) Sample selection bias as a specification error. Econometrica 47: 153–161

    Article  MATH  MathSciNet  Google Scholar 

  • Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 91: 222–230

    Article  MATH  Google Scholar 

  • Landerman LR, Land KC, Pieper CF (1997) An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res 26: 3–33

    Article  Google Scholar 

  • Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87: 1227–1238

    Article  Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York

    MATH  Google Scholar 

  • Rao JNK, Shao J (1992) Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79: 811–822

    Article  MATH  MathSciNet  Google Scholar 

  • Royston P (2004) Multiple imputation of missing values. Stata J 4: 227–241

    Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Google Scholar 

  • Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91: 473–489

    Article  MATH  Google Scholar 

  • Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81: 366–374

    Article  MATH  MathSciNet  Google Scholar 

  • Rubin DB, Schenker N (1991) Multiple imputation in health-care databases: an overview and some applications. Stat Med 10: 585–598

    Article  Google Scholar 

  • Schafer JL (1997a) Analysis of incomplete multivariate data. Chapman & Hall, London

    MATH  Google Scholar 

  • Schafer JL (1997b) Software for multiple imputation. [http://www.stat.psu.edu/~jls/misoftwa.html]

  • Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation (with discussion). J Am Stat Assoc 82: 528–550

    Article  MATH  MathSciNet  Google Scholar 

  • Vach W (1994) Logistic regression with missing values in the covariates. Springer, New York

    MATH  Google Scholar 

  • Xie Y, Manski CF (1989) The logit model and response-based samples. Sociol Methods Res 17: 283–302

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Paul.

Additional information

The research reported here was partially supported by National Institutes of Health, National Cancer Institute, R01 CA65879 (SAF). We thank Nicholas Wolfinger, Naihua Duan, John Adams, John Fox, and the anonymous referees for their thoughtful comments on earlier drafts. The responsibility for any remaining errors is ours alone. Benjamin Stein was exceptionally helpful in orchestrating the simulations at the labs of UCLA Social Science Computing. Michael Mitchell of the UCLA Academic Technology Services Statistical Consulting Group artfully created Fig. 1 using the Stata graphics language; we are most grateful.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paul, C., Mason, W.M., McCaffrey, D. et al. A cautionary case study of approaches to the treatment of missing data. Stat Meth Appl 17, 351–372 (2008). https://doi.org/10.1007/s10260-007-0090-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-007-0090-4

Keywords

Navigation