Scalable Model-Based Cascaded Imputation of Missing Data

Montiel, Jacob; Read, Jesse; Bifet, Albert; Abdessalem, Talel

doi:10.1007/978-3-319-93040-4_6

Jacob Montiel¹⁹,
Jesse Read²⁰,
Albert Bifet¹⁹ &
…
Talel Abdessalem^19,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3491 Accesses
2 Citations

Abstract

Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)
MathSciNet Google Scholar
Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002)
Google Scholar
Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)
MathSciNet MATH Google Scholar
Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006)
Google Scholar
Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)
MATH Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Article MathSciNet Google Scholar
Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)
Article Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
MATH Google Scholar
Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)
MATH Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011)
Google Scholar
Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)
Article Google Scholar
King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001)
Google Scholar
Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)
Article MathSciNet Google Scholar
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70
Chapter Google Scholar
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)
Book Google Scholar
Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)
Article MathSciNet Google Scholar
Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998)
Google Scholar
Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)
Article Google Scholar
Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)
Article MathSciNet Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)
Article Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)
Article Google Scholar
Richman, M.B., Trafalis, T.B., Adrianto, I.: Missing data imputation through machine learning algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds.) Artificial Intelligence Methods in the Environmental Sciences, pp. 153–169. Springer, Dordrecht (2009). https://doi.org/10.1007/978-1-4020-9119-3_7
Chapter Google Scholar
Su, X., Greiner, R., Khoshgoftaar, T.M., Napolitano, A.: Using classifier-based nominal imputation to improve machine learning. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 124–135. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_11
Chapter Google Scholar
Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009)
Google Scholar
Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006)
Google Scholar
Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LTCI, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France
Jacob Montiel, Albert Bifet & Talel Abdessalem
LIX, École Polytechnique, 91120, Palaiseau, France
Jesse Read
UMI CNRS IPAL & National University of Singapore, Singapore, Singapore
Talel Abdessalem

Authors

Jacob Montiel
View author publications
You can also search for this author in PubMed Google Scholar
Jesse Read
View author publications
You can also search for this author in PubMed Google Scholar
Albert Bifet
View author publications
You can also search for this author in PubMed Google Scholar
Talel Abdessalem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacob Montiel .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Montiel, J., Read, J., Bifet, A., Abdessalem, T. (2018). Scalable Model-Based Cascaded Imputation of Missing Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_6
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics