Abstract
A limiting factor for the application of IDA methods in many domains is the incompleteness of data repositories. Many records have fields that are not filled in, especially, when data entry is manual. In addition, a significant fraction of the entries can be erroneous and there may be no alternative but to discard these records. But every cell in a database is not an independent datum. Statistical relationships will constrain and, often determine, missing values. Data imputation, the filling in of missing values for partially missing data, can thus be an invaluable first step in many IDA projects. New imputation methods that can handle the large-scale problems and large-scale sparsity of industrial databases are needed. To illustrate the incomplete database problem, we analyze one database with instrumentation maintenance and test records for an industrial process. Despite regulatory requirements for process data collection, this database is less than 50% complete. Next, we discuss possible solutions to the missing data problem. Several approaches to imputation are noted and classified into two categories: data-driven and model-based. We then describe two machine-learning-based approaches that we have worked with. These build upon well-known algorithms: AutoClass and C4.5. Several experiments are designed, all using the maintenance database as a common test-bed but with various data splits and algorithmic variations. Results are generally positive with up to 80% accuracies of imputation. We conclude the paper by outlining some considerations in selecting imputation methods, and by discussing applications of data imputation for intelligent data analysis.
Similar content being viewed by others
References
K. Lakshminarayan, S.A. Harp, R. Goldman, and T. Samad, “Imputation of missing data using machine learning techniques,” in Proceedings: Second International Conference on Knowledge Discovery and Data Mining, edited by Simoudis, Han and Fayyad, AAAI Press: Menlo Park, CA, pp. 140-145, 1996.
J.R. Quinlan, C4.5 Programs For Machine Learning, Morgan Kaufmann Publishers: San Mateo, California, 1993.
G.A. Greathouse, and C.J. Wessel, Deterioration of Materials, Reinhold Publishing Corporation, New York, 1954.
D.B. Rubin, “Inference and missing data,” Biometrika, vol. 63,no. 3, pp. 581-592, 1976.
W. Vach, “Missing values: statistical theory and computational practise,” in Computational Statistics, edited by P. Dirschedl and R. Ostermann, Heidelberg: Physica-Verlag, pp. 345-354, 1994.
R.J. Little, and D.B. Rubin, Statistical Analysis with Missing Data, New York: John Wiley and Sons, 1987.
A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm (with discussion),” J. Roy. Statist. Soci., vol. B39, pp. 1-38, 1977.
M.A. Tanner and W.H. Wong, “The calculation of posterior distributions by data augmentation (with discussion),” Journal of the American Statistical Association, vol. 82, pp. 528-550, 1987.
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley and Sons: New York, 1987.
D.B. Rubin, “Multiple imputation after 18+ years,” Journal of the American Statistical Association, 1996 Jun, vol. 91,no. 434, pp. 473-489, 1996.
D.B. Rubin, “Multiple imputation in sample surveys—A phenomenological Bayesian approach to non-response,” in Proceedings Survey Research Methodology Section, American Statistical Association, 1978, pp. 20-28.
B.L. Ford, “An overview of hot-deck procedures,” In Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by G.W. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 185-207, 1983.
A. Unwin, G. Hawkins, H. Hofmann, and B. Siegl, “Interactive graphics for data sets with missing values—MANET,” Journal of Computational and Graphical Statistics, vol. 5,no. 2, 1996.
D.F. Swayne and A. Buja, “Missing Data in Interactive High-Dimensional Data Visualization,” Computational Statistics, vol. 13,no. 1, 1998.
D.F. Swayne, D. Cook, and A. Buja, “XGobi: interactive dynamic data visualization in the X window system,” Journal of Computational and Graphical Statistics, vol. 7,no. 1, 1998.
H.L. Oh and F.J. Scheuren, “Weighting adjustments for unit nonresponse,” in Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by W.G. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 143-183, 1983.
S.F. Buck, “A method of estimation of missing values in multivariate data suitable for use with an electronic computer,” J. Roy. Statist. Soci., vol. B22, pp. 302-306, 1960.
J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman and Hall: London, 1997.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees, Chapman and Hall, 1984.
P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, “Bayesian classification,” in Proceedings of American Association of Artificial Intelligence (AAAI), Morgan Kaufmann Publishers: San Mateo, CA, 1988, pp. 607-611.
P. Cheeseman and J. Stutz, “Bayesian classification (autoclass): Theory and results,” in Advances in Knowledge Discovery and Data Mining, edited by U.M. Fayyad. G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press: Menlo Park, CA, 1996.
R. Hanson, J. Stutz, and P. Cheeseman, “Bayesian classification theory,” Technical Report, FIA-90-12-7-01, NASA, Ames Research Center, 1990.
K. McKusick and K. Thompson, “Cobweb/3: A portable implementation,” Report FIA-90-6-18-2, NASA, Ames Research Center, 1990.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lakshminarayan, K., Harp, S.A. & Samad, T. Imputation of Missing Data in Industrial Databases. Applied Intelligence 11, 259–275 (1999). https://doi.org/10.1023/A:1008334909089
Issue Date:
DOI: https://doi.org/10.1023/A:1008334909089