Abstract
Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)
Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002)
Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)
Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006)
Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011)
Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)
King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001)
Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)
Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)
Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998)
Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)
Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)
Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)
Richman, M.B., Trafalis, T.B., Adrianto, I.: Missing data imputation through machine learning algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds.) Artificial Intelligence Methods in the Environmental Sciences, pp. 153–169. Springer, Dordrecht (2009). https://doi.org/10.1007/978-1-4020-9119-3_7
Su, X., Greiner, R., Khoshgoftaar, T.M., Napolitano, A.: Using classifier-based nominal imputation to improve machine learning. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 124–135. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_11
Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009)
Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006)
Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Montiel, J., Read, J., Bifet, A., Abdessalem, T. (2018). Scalable Model-Based Cascaded Imputation of Missing Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)