Summary
A complication in the visualization of biomedical datasets is that they are often incomplete. A response to this is to multiply impute each missing datum prior to visualization in order to convey the uncertainty of the imputations. In our approach, the initially complete cases in a real-valued dataset are represented as points in a principal components plot and, for each initially incomplete case in the dataset, we use an associated prediction region or interval displayed on the same plot to indicate the probable location of the case. When a case has only one missing datum, a prediction interval is used in place of a region. The prediction region or interval associated with an incomplete case is determined from the dispersion of the multiple imputations of the case mapped onto the plot. We illustrate this approach with two incomplete datasets: the first is based on two multivariate normal distributions; the second on a published, simulated health survey.




Similar content being viewed by others
References
Albert, R.H. & W. Horwitz (1995), ‘Incomplete datasets: Coping with inadequate databases’, Journal of the AOAC International 78, 1513–1515.
Dempster, A.P., N.M. Laird & D.B. Rubin (1977), ‘Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion)’, Journal of the Royal Statistical Society B39, 1–38.
Geisser, S. (1993), Predictive Inference: An Introduction, Chapman and Hall, New York.
Gower, J.C. & D.J. Hand (1996), Biplots, Chapman and Hall, London, pp. 53–61.
Graham, R.L. (1972), ‘An efficient algorithm for determining the convex hull of a finite planar set’, Information Processing Letters 1, 132–133.
Heitjan, D.F. (1993), ‘Ignorability and coarse data: Some biomedical examples’, Biometrics 49, 1099–1109.
Hotteling, H. (1933), ‘Analysis of a complex of statistical variables into principal components’, Journal of Educational Psychology 24, 417–441, 498–520.
Kelker, D. (1970), ‘Distribution theory of spherical distributions and a location-scale parameter generalization’, Sankhya A 32, 419–430.
Knaus, W.A., J.E. Zimmerman, P.P. Wagner, E.A. Draper & D.E. Lawrence (1981), ‘APACHE — acute physiology and chronic health evaluation: A physiologically based classification system’, Critical Care Medicine 9(8), 591–597.
Krzanowski, W.J. (1988), Principles of Multivariate Analysis: A User’s Perspective, Clarendon Press, Oxford.
Krzanowski, W.J. (1995), ‘Orthogonal canonical variates for discrimination and classification’, Journal of Chemometrics 9(6), 509–520.
Little, R.J.A. & D.B. Rubin (1987), Statistical Analysis with Missing Data, Wiley, New York.
Muirhead, R.J. (1982), Aspects of Multivariate Statistical Theory, John Wiley, New York, pp. 32–40.
National Center for Health Statistics (1994), Plan and operation of the Third National Health and Nutrition Examination Survey. Vital and Health Statistics Series 1, No. 32, NCHS.
Olkin, I. & R.F. Tate (1961), ‘Multivariate correlation models with mixed discrete and continuous variables’, Annals of Mathematical Statistics 32, 448–165.
Raghunathan, T.E. & D.S. Siscovick (1996), ‘A multiple-imputation analysis of a case-control study of the risk of primary cardiac arrest among pharmcologically treated hypertensives’, Applied Statistics 45(3), 335–352.
Raghunathan, T.E. & J.E. Grizzle (1995), ‘A split questionnaire survey design’, Journal of the American Statistical Society 90(429), 54–63.
Roberts, G.O. (1996), Markov chain concepts related to sampling algorithms, in W. Gilks, S. Richardson & D. Spiegelhalter, eds, ‘Markov Chain Monte Carlo in Practice’, Chapman and Hall, London, pp. 45–57.
Rubin, D.B. (1976), ‘Inference and missing data’, Biometrika 63, 581–592.
Rubin, D.B. (1987), Multiple Imputation for Nonresponse in Surveys, John Wiley, New York.
Salas, S.L. & E. Hille (1982), Calculus: One and Several Variables, with Analytical Geometry, John Wiley, New York, pp. 400–404.
Sammon, J.W. (1969), ‘A nonlinear mapping for data structure analysis’, IEEE Transactions in Computing C-18, 401–409.
Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data, Chapman & Hall, London.
Schafer, J.L. (1998), Software for Multiple Imputation [online]. Available from: http://www.stat.psu.edu/~jls/misoftwa.html [Accessed 18 June 1998].
Seal, H.L. (1964), Multivariate Statistical Analysis for Biologists, Methuen, London.
Statistical Solutions (1998), The Solution for Missing Values in your Data [online]. Available from: http://www.statsol.ie/solas.html [Accessed 1 Dec 1998].
Swayne, D.F. & A. Buja (1998), ‘Missing data in interactive high-dimensional data visualization’, Computational Statistics 13(1), 15–26.
Tanner, M.A. & W.H. Wong (1987), ‘The calculation of posterior distributions by data augmentation (with discussion)’, Journal of the American Statistical Association 82, 528–550.
Unwin, A.R., G. Hawkins, H. Hofmann & B. Siegl (1996), ‘Interactive graphics for data sets with missing values — MANET’, Journal of Computational and Graphical Statistics 5, 113–122.
Venables, W.N. & B.D. Ripley (1997), Modern Applied Statistics with S-PLUS, 2nd edn, Springer, New York.
Acknowledgements
We thank the two anonymous referees for their constructive comments.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dybowski, R., Weller, P.R. Prediciton regions for the visualization of incomplete datasets. Computational Statistics 16, 25–41 (2001). https://doi.org/10.1007/PL00022718
Published:
Issue Date:
DOI: https://doi.org/10.1007/PL00022718