Skip to main content
Log in

Multiple imputation and functional methods in the presence of measurement error and missingness in explanatory variables

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

In many applications involving regression analysis, explanatory variables (or covariates) may be imprecisely measured or may contain missing values. Although there exists a vast literature on measurement error modeling to account for errors-in-variables, and on missing data methodology to handle missingness, very few methods have been developed to simultaneously address both. In this paper, we consider likelihood-based multiple imputation to handle missing data, and combine this with two well-known functional measurement error methods: simulation-extrapolation and corrected score. This unified approach has several appealing characteristics: the model fitting procedure is easy to understand and off-the-shelf software can be incorporated into the modeling framework; no calibration data or a validation subset is required in the model fitting procedure; and the missing data component of the proposed approach is likelihood-based which allows standard likelihood machinery. We demonstrate our methods on simulated datasets and apply them to daily ozone pollution measurements in Los Angeles where observed covariates consist of missing data and imprecise measurements. We conclude that the proposed methods substantially reduce bias and mean squared errors in regression coefficients, in comparison to methods that ignore either measurement error or missingness in covariates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Armstrong B (1985) Measurement error in the generalised linear model. Commun Stat Simul Comput 14(3):529–544

    MATH  Google Scholar 

  • Arunajadai SG, Rauh VA (2012) Handling covariates subject to limits of detection in regression. Environ Ecol Stat 19(3):369–391

    MathSciNet  Google Scholar 

  • Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598

    MathSciNet  MATH  Google Scholar 

  • Buzas JS, Stefanski LA (1996) A note on corrected-score estimation. Stat Probab Lett 28(1):1–8

    MathSciNet  MATH  Google Scholar 

  • Carroll RJ, Freedman L, Pee D (1997) Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics 53(4):1440–1457

    MATH  Google Scholar 

  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear models: a modern perspective. Chapman & Hall CRC Press, Boca Raton

    MATH  Google Scholar 

  • Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101(473):157–167

    MathSciNet  MATH  Google Scholar 

  • Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82

    Google Scholar 

  • Chen J, Hanfelt JJ, Huang Y (2015) A simple corrected score for logistic regression with errors-in-covariates. Commun Stat Theory Methods 44(10):2024–2036

    MathSciNet  MATH  Google Scholar 

  • Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J Am Stat Assoc 89(428):1314–1328

    MATH  Google Scholar 

  • Diebolt J, Celeux G (1993) Asymptotic properties of a Stochastic EM algorithm for estimating mixing proportions. Stoch Models 9:599–613

    MathSciNet  MATH  Google Scholar 

  • Diebolt J, Ip EHS (1996) Stochastic EM: method and application. In: Gilks WR, Richardson S, Spiegelhalter DJ (eds) Markov Chain Monte Carlo in practice. Chapman & Hall, London, pp 259–273

    Google Scholar 

  • Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75

    MathSciNet  MATH  Google Scholar 

  • Eugster MJ, Leisch F (2011) Weighted and robust archetypal analysis. Comput Stat Data Anal 55(3):1215–1225

    MathSciNet  MATH  Google Scholar 

  • Fuller WA (1987) Measurement error models. Wiley, New York

    MATH  Google Scholar 

  • Huang Y, Wang CY (2000) Cox regression with accurate covariates unascertainable: a nonparametric-correction approach. J Am Stat Assoc 95:1209–1219

    MathSciNet  MATH  Google Scholar 

  • Huang Y, Wang CY (2001) Consistent functional methods for logistic regression with errors in covariates. J Am Stat Assoc 96:1469–1482

    MathSciNet  MATH  Google Scholar 

  • Lederer W, Kuchenhoff H (2006) A short introduction to the SIMEX and MCSIMEX. R News 6(4):26–31

    Google Scholar 

  • Liang H, Wang S, Carroll RJ (2007) Partially linear models with missing response variables and error-prone covariates. Biometrika 94(1):185–198

    MathSciNet  MATH  Google Scholar 

  • Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B Stat Methodol 44(2):226–233

    MathSciNet  MATH  Google Scholar 

  • Miller A (2002) Subset selection in regression. Chapman & Hall CRC Press, Boca Raton

    MATH  Google Scholar 

  • Nakamura T (1990) Corrected score function for errors-in-variables models: methodology and application to generalized linear models. Biometrika 77(1):127–137

    MathSciNet  MATH  Google Scholar 

  • Nicoletti C, Peracchi F, Foliano F (2011) Estimating income poverty in the presence of missing data and measurement error. J Bus Econ Stat 29(1):61–72

    MathSciNet  MATH  Google Scholar 

  • Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6(3):457–489

    MathSciNet  MATH  Google Scholar 

  • Novick SJ, Stefanski LA (2002) Corrected score estimation via complex variable simulation extrapolation. J Am Stat Assoc 97(458):472–481

    MathSciNet  MATH  Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    MATH  Google Scholar 

  • Schafer DW (1987) Covariate measurement error in generalized linear models. Biometrika 72:385–391

    MathSciNet  MATH  Google Scholar 

  • Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15

    MathSciNet  Google Scholar 

  • Seaman SR, White IR (2011) Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 22(3):278–295

    MathSciNet  Google Scholar 

  • Shen CW, Chen YH (2016) Model selection for marginal regression analysis of longitudinal data with missing observations and covariate measurement error. Biostatistics 16(4):740–753

    MathSciNet  Google Scholar 

  • Song X, Wang CY (2019) GMM nonparametric correction methods for logistic regression with error-contaminated covariates and partially observed instrumental variables. Scand J Stat 46(3):898–919

    MathSciNet  MATH  Google Scholar 

  • Stefanski LA (1989) Unbiased estimation of a nonlinear function a normal mean with application to measurement error for models. Commun Stat Theory Methods 18(12):4335–4358

    MathSciNet  MATH  Google Scholar 

  • Stefanski LA, Carroll RJ (1985) Covariate measurement error in logistic regression. Ann Stat 13(4):1335–1351

    MathSciNet  MATH  Google Scholar 

  • Stefanski LA, Cook JR (1995) Simulation-extrapolation: the measurement error jackknife. J Am Stat Assoc 90(432):1247–1256

    MathSciNet  MATH  Google Scholar 

  • Stoklosa J, Daly C, Foster SD, Ashcroft MB, Warton DI (2015) A climate of uncertainty: accounting for error in climate variables for species distribution models. Methods Ecol Evol 6(4):412–423

    Google Scholar 

  • Stoklosa J, Lee SH, Hwang WH (2019) Closed-population capture-recapture models with measurement error and missing observations in covariates. Stat Sin 29(2):589–610

    MathSciNet  MATH  Google Scholar 

  • Van Buuren S (2012) Flexible imputation of missing data. Chapman & Hall CRC Press, Boca Raton

    MATH  Google Scholar 

  • van Buuren S, Groothuis-Oudshoorn K (2011) MICE: Multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Google Scholar 

  • von Hippel PT (2013) The bias and efficiency of incomplete-data estimators in small univariate normal samples. Social Methods Res 42(4):531–558

    MathSciNet  Google Scholar 

  • Wang N, Robins JM (1998) Large-sample theory for parametric multiple imputation procedures. Biometrika 85:935–948

    MathSciNet  MATH  Google Scholar 

  • Wang CY, Huang Y, Chao EC, Jeffcoat MK (2008) Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics 64(1):85–95

    MathSciNet  MATH  Google Scholar 

  • Wang M, Sun X, Lu T (2015) Bayesian structured variable selection in linear regression models. Comput Stat 30(1):205–229

    MathSciNet  MATH  Google Scholar 

  • Yi GY, Ma Y, Carroll RJ (2012) A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika 99(1):151–165

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by an Australian Government Research Training Program Scholarship. We thank the associate editor and the three anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Firouzeh Noghrehchi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary material

Supplementary material

The following plots, Figs. 8 and 9, show the results of the simulation study in Sect. 3.1 in terms of the magnitute of bias when \(n=50\) and \(n=1000\). Figures 8, 9, 10 and 11 show the accuracy of comparing methods in parameter estimation when \(X \sim N(0,1)\) and \(X \sim U(-1,1)\), respectively, for each n. Figures 8 and 9 are consistent with Fig. 1 (see Sect. 3.1) when \(n=100\) and \(X \sim N(0,1)\). They show that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and \(\sigma ^2_u\) are increased. Also, Figs. 8 and 9 are comparable with Fig. 2 in Sect. 3.1 when \(n=100\) and \(X \sim U(-1,1)\) in a similar way.

Fig. 8
figure 8

Boxplot of the slope coefficient estimates for different methods when \(X \sim N(0,1)\) and \(n=50\). Different methods are compared with the original dataset (complete, in mint) based on the accuracy of their estimators as the missing proportion and the error variance increase from the top-left corner to the bottom-right. Note that the true value of the slope coefficient is 1

Fig. 9
figure 9

Boxplot of the slope coefficient estimates for different methods when \(X \sim N(0,1)\) and \(n=1000\). Different methods are compared with the original dataset (complete, in mint) based on the accuracy of their estimators as the missing proportion and the error variance increase from the top-left corner to the bottom-right. Note that the true value of the slope coefficient is 1

Fig. 10
figure 10

Boxplot of the slope coefficient estimates for different methods when \(X \sim U(-1,1)\) and \(n=50\). Different methods are compared with the original dataset (complete, in mint) based on the accuracy of their estimators as the missing proportion and the error variance increase from the top-left corner to the bottom-right. Note that the true value of the slope coefficient is 1

Fig. 11
figure 11

Boxplot of the slope coefficient estimates for different methods when \(X \sim U(-1,1)\) and \(n=1000\). Different methods are compared with the original dataset (complete, in mint) based on the accuracy of their estimators as the missing proportion and the error variance increase from the top-left corner to the bottom-right. Note that the true value of the slope coefficient is 1

The following plots, Figs. 12 and 13, show the results of the simulation study in Sect. 3.1 in terms of MSE when \(n=50\) and \(n=1000\). Figures 12 and 13 show the predictive performances of the comparing methods when \(n=50\) and \(n=1000\), respectively. Comparing Fig. 3 (see Sect. 3.1) with these two plots suggests that MI-CS had the best predictive performance as the missingness and \(\sigma ^2_u\) increased. MI-SIMEX showed relatively high MSE for \(n=50\), however, the predictive performance of MI-CS improved substantially and equaled MI-SIMEX as n increased. The predictive performances of the comparing methods were similar when \(X \sim N(0,1)\) and \(X \sim U(-1,1)\).

Fig. 12
figure 12

Mean squared error of the slope coefficient estimates for different methods with \(n=50\) when a\(X \sim N(0,1)\) and b\(X \sim U(-1,1)\). Different methods are compared with the original dataset (complete, in solid mint line) for different values of error variance as the missing proportion increases

Fig. 13
figure 13

Mean squared error of the slope coefficient estimates for different methods with \(n=1000\) when a\(X \sim N(0,1)\) and b\(X \sim U(-1,1)\). Different methods are compared with the original dataset (complete, in solid mint line) for different values of error variance as the missing proportion increases

The following plot, Fig. 14, shows the results of the Poisson log-linear simulation study in Sect. 3.2 when \(X \sim U(-1,1)\). Figure 14 is consistent with Fig. 4 (see Sect. 3.2) for when \(X \sim N(0,1)\). It shows that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and \(\sigma ^2_u\) increased.

Fig. 14
figure 14

Boxplot of the slope coefficient estimates for different methods for Poisson log-linear model when \(X \sim U(-1,1)\). Different methods are compared with the original dataset (complete, in mint) based on the accuracy of their estimators as the missing proportion and the error variance increase from the top-left corner to the bottom-right. Note that the true value of the slope coefficient is 1

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Noghrehchi, F., Stoklosa, J. & Penev, S. Multiple imputation and functional methods in the presence of measurement error and missingness in explanatory variables. Comput Stat 35, 1291–1317 (2020). https://doi.org/10.1007/s00180-020-00976-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-00976-2

Keywords

Navigation