Skip to main content

Ensemble Learning for Heterogeneous Missing Data Imputation

  • Conference paper
  • First Online:
Big Data – BigData 2020 (BIGDATA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12402))

Included in the following conference series:

  • 913 Accesses

Abstract

Missing values can significantly affect the result of analyses and decision making in any field. Two major approaches deal with this issue: statistical and model-based methods. While the former brings bias to the analyses, the latter is usually designed for limited and specific use cases. To overcome the limitations of the two methods, we present a stacked ensemble framework based on the integration of the adaptive random forest algorithm, the Jaccard index, and Bayesian probability. Considering the challenge that the heterogeneous and distributed data from multiple sources represents, we build a model in our use case, that supports different data types: continuous, discrete, categorical, and binary. The proposed model tackles missing data in a broad and comprehensive context of massive data sources and data formats. We evaluated our proposed framework extensively on five different datasets that contained labelled and unlabelled data. The experiments showed that our framework produces encouraging and competitive results when compared to statistical and model-based methods. Since the framework works for various datasets, it overcomes the model-based limitations that were found in the literature review.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583, stat.ME (2018)

  2. Azimi, I., Pahikkala, T., Rahmani, A.M., Niela-Vilén, H., Axelin, A., Liljeberg, P.: Missing data resilient decision-making for healthcare IoT through personalization: a case study on maternal health. Future Gener. Comput. Syst. 96, 297–308 (2019). https://doi.org/10.1016/j.future.2019.02.015. ISSN 0167-739X

    Article  Google Scholar 

  3. Hatem Ben Sta: Quality and the efficiency of data in “Smart-Cities”. Future Gener. Comput. Syst. 74, 409–416 (2017). https://doi.org/10.1016/j.future.2016.12.021

    Article  Google Scholar 

  4. Schafer, L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods J. 7, 147–177 (2002). https://doi.org/10.1037/1082-989X.7.2.147

    Article  Google Scholar 

  5. Tan, Y., Zhang, C., Mao, Y., Qian, G.: Semantic presentation and fusion framework of unstructured data in smart cites. In: IEEE 10th Conference on Industrial Electronics and Applications (ICIEA), June 2015, pp. 897–901 (2015). https://doi.org/10.1109/ICIEA.2015.7334237

  6. Cearly, D.W.: Top 10 strategic technology trends for 2019. Gartner Inc. and/or its affiliates. All rights reserved. PR575107 (2019)

    Google Scholar 

  7. Qin, X., Gu, Y.: Data fusion in the Internet of Things. Procedia Eng. 15, 3023–3026 (2011). https://doi.org/10.1016/j.proeng.2011.08.567

    Article  Google Scholar 

  8. Lau, B.P.L., et al.: A survey of data fusion in smart city applications. Inf. Fusion J. 52, 357–374 (2019). https://doi.org/10.1016/j.inffus.2019.05.004. ISSN 1566-2535

    Article  Google Scholar 

  9. Marjani, M., et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5, 5247–5261 (2017). https://doi.org/10.1109/ACCESS.2017.2689040. ISSN 2169-3536

    Article  Google Scholar 

  10. Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends\(\textregistered \) Mach. Learn. 9, 1–118 (2016). https://doi.org/10.1561/2200000055. ISSN 1935-8237

  11. Housfater, A.S., Zhang, X.-P., Zhou, Y.: Nonlinear fusion of multiple sensors with missing data. In: IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 4, p. IV, May 2006. https://doi.org/10.1109/ICASSP.2006.1661130

  12. Sun, B., Saenko, K.: Correlation Alignment for Deep Domain Adaptation (2015)

    Google Scholar 

  13. Sun, B., Feng, J., Saenko, K.: Correlation alignment for unsupervised domain adaptation. arXiv:1612.01939, cs.CV (2016)

  14. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. arXiv:1206.5538, cs.LG (2012)

  15. Bubenik, P.: Statistical topological data analysis using persistence landscapes. arXiv:1207.6437, math.AT (2012)

  16. Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized Low Rank Models (2016). https://github.com/powerscorinne/GLRM

  17. Petrozziello, A., Jordanov, I., Sommeregger, C.: Distributed neural networks for missing big data imputation. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018. https://doi.org/10.1109/IJCNN.2018.8489488

  18. Baraldi, P., Di Maio, F., Genini, D., Zio, E.: Reconstruction of missing data in multidimensional time series by fuzzy similarity. Appl. Soft Comput. J. 26, 1–9 (2015). https://doi.org/10.1016/j.asoc.2014.09.038. ISSN 1568-4946

    Article  Google Scholar 

  19. Aggarwal, C.C., Parthasarathy, S.: Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 227–232. ACM, New York (2001). https://doi.org/10.1145/502512.502543. ISBN 1-58113-391-X

  20. Albergante, L., et al.: Robust and scalable learning of data manifolds with complex topologies via ElPiGraph. CoRR Journal, vol. abs/1804.07580, August 2018. arxiv.org/abs/1804.07580

  21. Bishop, C.M.: Model-based machine learning. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. https://doi.org/10.1098/rsta.2012.0222

  22. Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, Boca Raton (2012)

    Book  Google Scholar 

  23. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with LLUNATIC. VLDB J. (2019). https://doi.org/10.1007/s00778-019-00586-5

    Article  Google Scholar 

  24. Musil, C.M., Warner, C.B., Yobas, P.K., Jones, S.L.: A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24(7), 815–829 (2002)

    Article  Google Scholar 

Download references

Acknowledgments

The authors thank NSERC Strategic Program # 506319-17 for their financial support, and SARA at ÉTS for their writing support during this work, and the GLRM team for publishing their Synthetic dataset model, used as reference project at  [16].

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Andre Luis Costa Carvalho , Darine Ameyed or Mohamed Cheriet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carvalho, A.L.C., Ameyed, D., Cheriet, M. (2020). Ensemble Learning for Heterogeneous Missing Data Imputation. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59612-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59611-8

  • Online ISBN: 978-3-030-59612-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics