Skip to main content
Log in

Imputation of Missing Data in Industrial Databases

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

A limiting factor for the application of IDA methods in many domains is the incompleteness of data repositories. Many records have fields that are not filled in, especially, when data entry is manual. In addition, a significant fraction of the entries can be erroneous and there may be no alternative but to discard these records. But every cell in a database is not an independent datum. Statistical relationships will constrain and, often determine, missing values. Data imputation, the filling in of missing values for partially missing data, can thus be an invaluable first step in many IDA projects. New imputation methods that can handle the large-scale problems and large-scale sparsity of industrial databases are needed. To illustrate the incomplete database problem, we analyze one database with instrumentation maintenance and test records for an industrial process. Despite regulatory requirements for process data collection, this database is less than 50% complete. Next, we discuss possible solutions to the missing data problem. Several approaches to imputation are noted and classified into two categories: data-driven and model-based. We then describe two machine-learning-based approaches that we have worked with. These build upon well-known algorithms: AutoClass and C4.5. Several experiments are designed, all using the maintenance database as a common test-bed but with various data splits and algorithmic variations. Results are generally positive with up to 80% accuracies of imputation. We conclude the paper by outlining some considerations in selecting imputation methods, and by discussing applications of data imputation for intelligent data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. K. Lakshminarayan, S.A. Harp, R. Goldman, and T. Samad, “Imputation of missing data using machine learning techniques,” in Proceedings: Second International Conference on Knowledge Discovery and Data Mining, edited by Simoudis, Han and Fayyad, AAAI Press: Menlo Park, CA, pp. 140-145, 1996.

    Google Scholar 

  2. J.R. Quinlan, C4.5 Programs For Machine Learning, Morgan Kaufmann Publishers: San Mateo, California, 1993.

    Google Scholar 

  3. G.A. Greathouse, and C.J. Wessel, Deterioration of Materials, Reinhold Publishing Corporation, New York, 1954.

    Google Scholar 

  4. D.B. Rubin, “Inference and missing data,” Biometrika, vol. 63,no. 3, pp. 581-592, 1976.

    Google Scholar 

  5. W. Vach, “Missing values: statistical theory and computational practise,” in Computational Statistics, edited by P. Dirschedl and R. Ostermann, Heidelberg: Physica-Verlag, pp. 345-354, 1994.

    Google Scholar 

  6. R.J. Little, and D.B. Rubin, Statistical Analysis with Missing Data, New York: John Wiley and Sons, 1987.

    Google Scholar 

  7. A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm (with discussion),” J. Roy. Statist. Soci., vol. B39, pp. 1-38, 1977.

    Google Scholar 

  8. M.A. Tanner and W.H. Wong, “The calculation of posterior distributions by data augmentation (with discussion),” Journal of the American Statistical Association, vol. 82, pp. 528-550, 1987.

    Google Scholar 

  9. D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley and Sons: New York, 1987.

    Google Scholar 

  10. D.B. Rubin, “Multiple imputation after 18+ years,” Journal of the American Statistical Association, 1996 Jun, vol. 91,no. 434, pp. 473-489, 1996.

    Google Scholar 

  11. D.B. Rubin, “Multiple imputation in sample surveys—A phenomenological Bayesian approach to non-response,” in Proceedings Survey Research Methodology Section, American Statistical Association, 1978, pp. 20-28.

  12. B.L. Ford, “An overview of hot-deck procedures,” In Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by G.W. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 185-207, 1983.

    Google Scholar 

  13. A. Unwin, G. Hawkins, H. Hofmann, and B. Siegl, “Interactive graphics for data sets with missing values—MANET,” Journal of Computational and Graphical Statistics, vol. 5,no. 2, 1996.

  14. D.F. Swayne and A. Buja, “Missing Data in Interactive High-Dimensional Data Visualization,” Computational Statistics, vol. 13,no. 1, 1998.

  15. D.F. Swayne, D. Cook, and A. Buja, “XGobi: interactive dynamic data visualization in the X window system,” Journal of Computational and Graphical Statistics, vol. 7,no. 1, 1998.

  16. H.L. Oh and F.J. Scheuren, “Weighting adjustments for unit nonresponse,” in Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by W.G. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 143-183, 1983.

    Google Scholar 

  17. S.F. Buck, “A method of estimation of missing values in multivariate data suitable for use with an electronic computer,” J. Roy. Statist. Soci., vol. B22, pp. 302-306, 1960.

    Google Scholar 

  18. J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman and Hall: London, 1997.

    Google Scholar 

  19. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees, Chapman and Hall, 1984.

  20. P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, “Bayesian classification,” in Proceedings of American Association of Artificial Intelligence (AAAI), Morgan Kaufmann Publishers: San Mateo, CA, 1988, pp. 607-611.

    Google Scholar 

  21. P. Cheeseman and J. Stutz, “Bayesian classification (autoclass): Theory and results,” in Advances in Knowledge Discovery and Data Mining, edited by U.M. Fayyad. G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press: Menlo Park, CA, 1996.

    Google Scholar 

  22. R. Hanson, J. Stutz, and P. Cheeseman, “Bayesian classification theory,” Technical Report, FIA-90-12-7-01, NASA, Ames Research Center, 1990.

  23. K. McKusick and K. Thompson, “Cobweb/3: A portable implementation,” Report FIA-90-6-18-2, NASA, Ames Research Center, 1990.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakshminarayan, K., Harp, S.A. & Samad, T. Imputation of Missing Data in Industrial Databases. Applied Intelligence 11, 259–275 (1999). https://doi.org/10.1023/A:1008334909089

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008334909089

Navigation