Skip to main content

Scalable Model-Based Cascaded Imputation of Missing Data

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Abstract

Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)

    MathSciNet  Google Scholar 

  2. Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002)

    Google Scholar 

  3. Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)

    MathSciNet  MATH  Google Scholar 

  4. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006)

    Google Scholar 

  5. Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)

    MATH  Google Scholar 

  7. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  8. Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)

    Article  Google Scholar 

  9. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)

    MATH  Google Scholar 

  10. Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)

    MATH  Google Scholar 

  11. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  12. Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011)

    Google Scholar 

  13. Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)

    Article  Google Scholar 

  14. King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001)

    Google Scholar 

  15. Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)

    Article  MathSciNet  Google Scholar 

  16. Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70

    Chapter  Google Scholar 

  17. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016)

    Google Scholar 

  18. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)

    Book  Google Scholar 

  19. Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)

    Article  MathSciNet  Google Scholar 

  20. Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998)

    Google Scholar 

  21. Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)

    Article  Google Scholar 

  22. Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)

    Article  MathSciNet  Google Scholar 

  23. Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)

    Article  Google Scholar 

  24. Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)

    Article  Google Scholar 

  25. Richman, M.B., Trafalis, T.B., Adrianto, I.: Missing data imputation through machine learning algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds.) Artificial Intelligence Methods in the Environmental Sciences, pp. 153–169. Springer, Dordrecht (2009). https://doi.org/10.1007/978-1-4020-9119-3_7

    Chapter  Google Scholar 

  26. Su, X., Greiner, R., Khoshgoftaar, T.M., Napolitano, A.: Using classifier-based nominal imputation to improve machine learning. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 124–135. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_11

    Chapter  Google Scholar 

  27. Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009)

    Google Scholar 

  28. Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006)

    Google Scholar 

  29. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jacob Montiel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Montiel, J., Read, J., Bifet, A., Abdessalem, T. (2018). Scalable Model-Based Cascaded Imputation of Missing Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93040-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93039-8

  • Online ISBN: 978-3-319-93040-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics