Imputation of Missing Data in Industrial Databases

Lakshminarayan, Kamakshi; Harp, Steven A.; Samad, Tariq

doi:10.1023/A:1008334909089

Imputation of Missing Data in Industrial Databases

Published: November 1999

Volume 11, pages 259–275, (1999)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Kamakshi Lakshminarayan¹,
Steven A. Harp¹ &
Tariq Samad¹

1594 Accesses
159 Citations
Explore all metrics

Abstract

A limiting factor for the application of IDA methods in many domains is the incompleteness of data repositories. Many records have fields that are not filled in, especially, when data entry is manual. In addition, a significant fraction of the entries can be erroneous and there may be no alternative but to discard these records. But every cell in a database is not an independent datum. Statistical relationships will constrain and, often determine, missing values. Data imputation, the filling in of missing values for partially missing data, can thus be an invaluable first step in many IDA projects. New imputation methods that can handle the large-scale problems and large-scale sparsity of industrial databases are needed. To illustrate the incomplete database problem, we analyze one database with instrumentation maintenance and test records for an industrial process. Despite regulatory requirements for process data collection, this database is less than 50% complete. Next, we discuss possible solutions to the missing data problem. Several approaches to imputation are noted and classified into two categories: data-driven and model-based. We then describe two machine-learning-based approaches that we have worked with. These build upon well-known algorithms: AutoClass and C4.5. Several experiments are designed, all using the maintenance database as a common test-bed but with various data splits and algorithmic variations. Results are generally positive with up to 80% accuracies of imputation. We conclude the paper by outlining some considerations in selecting imputation methods, and by discussing applications of data imputation for intelligent data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

K. Lakshminarayan, S.A. Harp, R. Goldman, and T. Samad, “Imputation of missing data using machine learning techniques,” in Proceedings: Second International Conference on Knowledge Discovery and Data Mining, edited by Simoudis, Han and Fayyad, AAAI Press: Menlo Park, CA, pp. 140-145, 1996.
Google Scholar
J.R. Quinlan, C4.5 Programs For Machine Learning, Morgan Kaufmann Publishers: San Mateo, California, 1993.
Google Scholar
G.A. Greathouse, and C.J. Wessel, Deterioration of Materials, Reinhold Publishing Corporation, New York, 1954.
Google Scholar
D.B. Rubin, “Inference and missing data,” Biometrika, vol. 63,no. 3, pp. 581-592, 1976.
Google Scholar
W. Vach, “Missing values: statistical theory and computational practise,” in Computational Statistics, edited by P. Dirschedl and R. Ostermann, Heidelberg: Physica-Verlag, pp. 345-354, 1994.
Google Scholar
R.J. Little, and D.B. Rubin, Statistical Analysis with Missing Data, New York: John Wiley and Sons, 1987.
Google Scholar
A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm (with discussion),” J. Roy. Statist. Soci., vol. B39, pp. 1-38, 1977.
Google Scholar
M.A. Tanner and W.H. Wong, “The calculation of posterior distributions by data augmentation (with discussion),” Journal of the American Statistical Association, vol. 82, pp. 528-550, 1987.
Google Scholar
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley and Sons: New York, 1987.
Google Scholar
D.B. Rubin, “Multiple imputation after 18+ years,” Journal of the American Statistical Association, 1996 Jun, vol. 91,no. 434, pp. 473-489, 1996.
Google Scholar
D.B. Rubin, “Multiple imputation in sample surveys—A phenomenological Bayesian approach to non-response,” in Proceedings Survey Research Methodology Section, American Statistical Association, 1978, pp. 20-28.
B.L. Ford, “An overview of hot-deck procedures,” In Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by G.W. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 185-207, 1983.
Google Scholar
A. Unwin, G. Hawkins, H. Hofmann, and B. Siegl, “Interactive graphics for data sets with missing values—MANET,” Journal of Computational and Graphical Statistics, vol. 5,no. 2, 1996.
D.F. Swayne and A. Buja, “Missing Data in Interactive High-Dimensional Data Visualization,” Computational Statistics, vol. 13,no. 1, 1998.
D.F. Swayne, D. Cook, and A. Buja, “XGobi: interactive dynamic data visualization in the X window system,” Journal of Computational and Graphical Statistics, vol. 7,no. 1, 1998.
H.L. Oh and F.J. Scheuren, “Weighting adjustments for unit nonresponse,” in Incomplete Data in Sample Surveys, Volume 2, Theory and Bibliographies, edited by W.G. Madow, I. Olkin, and D.B. Rubin, Academic Press: New York, pp. 143-183, 1983.
Google Scholar
S.F. Buck, “A method of estimation of missing values in multivariate data suitable for use with an electronic computer,” J. Roy. Statist. Soci., vol. B22, pp. 302-306, 1960.
Google Scholar
J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman and Hall: London, 1997.
Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees, Chapman and Hall, 1984.
P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, “Bayesian classification,” in Proceedings of American Association of Artificial Intelligence (AAAI), Morgan Kaufmann Publishers: San Mateo, CA, 1988, pp. 607-611.
Google Scholar
P. Cheeseman and J. Stutz, “Bayesian classification (autoclass): Theory and results,” in Advances in Knowledge Discovery and Data Mining, edited by U.M. Fayyad. G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press: Menlo Park, CA, 1996.
Google Scholar
R. Hanson, J. Stutz, and P. Cheeseman, “Bayesian classification theory,” Technical Report, FIA-90-12-7-01, NASA, Ames Research Center, 1990.
K. McKusick and K. Thompson, “Cobweb/3: A portable implementation,” Report FIA-90-6-18-2, NASA, Ames Research Center, 1990.

Download references

Author information

Authors and Affiliations

Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN, 55418
Kamakshi Lakshminarayan, Steven A. Harp & Tariq Samad

Authors

Kamakshi Lakshminarayan
View author publications
You can also search for this author in PubMed Google Scholar
Steven A. Harp
View author publications
You can also search for this author in PubMed Google Scholar
Tariq Samad
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakshminarayan, K., Harp, S.A. & Samad, T. Imputation of Missing Data in Industrial Databases. Applied Intelligence 11, 259–275 (1999). https://doi.org/10.1023/A:1008334909089

Download citation

Issue Date: November 1999
DOI: https://doi.org/10.1023/A:1008334909089

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputation of Missing Data in Industrial Databases

Abstract

Access this article

Similar content being viewed by others

A novel algorithm for imputing the missing values in incomplete datasets

Enhancing the Quality of Diagnosis in HealthCare Industries by Imputation of “Missing Data” Using “Data Mining”

An Incremental Algorithm for Repairing Training Sets with Missing Values

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Imputation of Missing Data in Industrial Databases

Abstract

Access this article

Similar content being viewed by others

A novel algorithm for imputing the missing values in incomplete datasets

Enhancing the Quality of Diagnosis in HealthCare Industries by Imputation of “Missing Data” Using “Data Mining”

An Incremental Algorithm for Repairing Training Sets with Missing Values

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation