Skip to main content
Log in

Prediciton regions for the visualization of incomplete datasets

  • Published:
Computational Statistics Aims and scope Submit manuscript

Summary

A complication in the visualization of biomedical datasets is that they are often incomplete. A response to this is to multiply impute each missing datum prior to visualization in order to convey the uncertainty of the imputations. In our approach, the initially complete cases in a real-valued dataset are represented as points in a principal components plot and, for each initially incomplete case in the dataset, we use an associated prediction region or interval displayed on the same plot to indicate the probable location of the case. When a case has only one missing datum, a prediction interval is used in place of a region. The prediction region or interval associated with an incomplete case is determined from the dispersion of the multiple imputations of the case mapped onto the plot. We illustrate this approach with two incomplete datasets: the first is based on two multivariate normal distributions; the second on a published, simulated health survey.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Figure 1
Figure 2
Figure 3

Similar content being viewed by others

References

  • Albert, R.H. & W. Horwitz (1995), ‘Incomplete datasets: Coping with inadequate databases’, Journal of the AOAC International 78, 1513–1515.

    Google Scholar 

  • Dempster, A.P., N.M. Laird & D.B. Rubin (1977), ‘Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion)’, Journal of the Royal Statistical Society B39, 1–38.

    MATH  Google Scholar 

  • Geisser, S. (1993), Predictive Inference: An Introduction, Chapman and Hall, New York.

    Book  Google Scholar 

  • Gower, J.C. & D.J. Hand (1996), Biplots, Chapman and Hall, London, pp. 53–61.

    MATH  Google Scholar 

  • Graham, R.L. (1972), ‘An efficient algorithm for determining the convex hull of a finite planar set’, Information Processing Letters 1, 132–133.

    Article  Google Scholar 

  • Heitjan, D.F. (1993), ‘Ignorability and coarse data: Some biomedical examples’, Biometrics 49, 1099–1109.

    Article  Google Scholar 

  • Hotteling, H. (1933), ‘Analysis of a complex of statistical variables into principal components’, Journal of Educational Psychology 24, 417–441, 498–520.

    Article  Google Scholar 

  • Kelker, D. (1970), ‘Distribution theory of spherical distributions and a location-scale parameter generalization’, Sankhya A 32, 419–430.

    MathSciNet  MATH  Google Scholar 

  • Knaus, W.A., J.E. Zimmerman, P.P. Wagner, E.A. Draper & D.E. Lawrence (1981), ‘APACHE — acute physiology and chronic health evaluation: A physiologically based classification system’, Critical Care Medicine 9(8), 591–597.

    Article  Google Scholar 

  • Krzanowski, W.J. (1988), Principles of Multivariate Analysis: A User’s Perspective, Clarendon Press, Oxford.

    MATH  Google Scholar 

  • Krzanowski, W.J. (1995), ‘Orthogonal canonical variates for discrimination and classification’, Journal of Chemometrics 9(6), 509–520.

    Article  Google Scholar 

  • Little, R.J.A. & D.B. Rubin (1987), Statistical Analysis with Missing Data, Wiley, New York.

    MATH  Google Scholar 

  • Muirhead, R.J. (1982), Aspects of Multivariate Statistical Theory, John Wiley, New York, pp. 32–40.

    Book  Google Scholar 

  • National Center for Health Statistics (1994), Plan and operation of the Third National Health and Nutrition Examination Survey. Vital and Health Statistics Series 1, No. 32, NCHS.

  • Olkin, I. & R.F. Tate (1961), ‘Multivariate correlation models with mixed discrete and continuous variables’, Annals of Mathematical Statistics 32, 448–165.

    Article  MathSciNet  Google Scholar 

  • Raghunathan, T.E. & D.S. Siscovick (1996), ‘A multiple-imputation analysis of a case-control study of the risk of primary cardiac arrest among pharmcologically treated hypertensives’, Applied Statistics 45(3), 335–352.

    Article  Google Scholar 

  • Raghunathan, T.E. & J.E. Grizzle (1995), ‘A split questionnaire survey design’, Journal of the American Statistical Society 90(429), 54–63.

    Article  Google Scholar 

  • Roberts, G.O. (1996), Markov chain concepts related to sampling algorithms, in W. Gilks, S. Richardson & D. Spiegelhalter, eds, ‘Markov Chain Monte Carlo in Practice’, Chapman and Hall, London, pp. 45–57.

    Google Scholar 

  • Rubin, D.B. (1976), ‘Inference and missing data’, Biometrika 63, 581–592.

    Article  MathSciNet  Google Scholar 

  • Rubin, D.B. (1987), Multiple Imputation for Nonresponse in Surveys, John Wiley, New York.

    Book  Google Scholar 

  • Salas, S.L. & E. Hille (1982), Calculus: One and Several Variables, with Analytical Geometry, John Wiley, New York, pp. 400–404.

    MATH  Google Scholar 

  • Sammon, J.W. (1969), ‘A nonlinear mapping for data structure analysis’, IEEE Transactions in Computing C-18, 401–409.

    Article  Google Scholar 

  • Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data, Chapman & Hall, London.

    Book  Google Scholar 

  • Schafer, J.L. (1998), Software for Multiple Imputation [online]. Available from: http://www.stat.psu.edu/~jls/misoftwa.html [Accessed 18 June 1998].

  • Seal, H.L. (1964), Multivariate Statistical Analysis for Biologists, Methuen, London.

  • Statistical Solutions (1998), The Solution for Missing Values in your Data [online]. Available from: http://www.statsol.ie/solas.html [Accessed 1 Dec 1998].

  • Swayne, D.F. & A. Buja (1998), ‘Missing data in interactive high-dimensional data visualization’, Computational Statistics 13(1), 15–26.

    MATH  Google Scholar 

  • Tanner, M.A. & W.H. Wong (1987), ‘The calculation of posterior distributions by data augmentation (with discussion)’, Journal of the American Statistical Association 82, 528–550.

    Article  MathSciNet  Google Scholar 

  • Unwin, A.R., G. Hawkins, H. Hofmann & B. Siegl (1996), ‘Interactive graphics for data sets with missing values — MANET’, Journal of Computational and Graphical Statistics 5, 113–122.

    Google Scholar 

  • Venables, W.N. & B.D. Ripley (1997), Modern Applied Statistics with S-PLUS, 2nd edn, Springer, New York.

    Book  Google Scholar 

Download references

Acknowledgements

We thank the two anonymous referees for their constructive comments.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dybowski, R., Weller, P.R. Prediciton regions for the visualization of incomplete datasets. Computational Statistics 16, 25–41 (2001). https://doi.org/10.1007/PL00022718

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/PL00022718

Keywords

Navigation