Ensemble Learning for Heterogeneous Missing Data Imputation

Carvalho, Andre Luis Costa; Ameyed, Darine; Cheriet, Mohamed

doi:10.1007/978-3-030-59612-5_10

Andre Luis Costa Carvalho¹⁴,
Darine Ameyed¹⁴ &
Mohamed Cheriet¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12402))

Included in the following conference series:

International Conference on Big Data

913 Accesses

Abstract

Missing values can significantly affect the result of analyses and decision making in any field. Two major approaches deal with this issue: statistical and model-based methods. While the former brings bias to the analyses, the latter is usually designed for limited and specific use cases. To overcome the limitations of the two methods, we present a stacked ensemble framework based on the integration of the adaptive random forest algorithm, the Jaccard index, and Bayesian probability. Considering the challenge that the heterogeneous and distributed data from multiple sources represents, we build a model in our use case, that supports different data types: continuous, discrete, categorical, and binary. The proposed model tackles missing data in a broad and comprehensive context of massive data sources and data formats. We evaluated our proposed framework extensively on five different datasets that contained labelled and unlabelled data. The experiments showed that our framework produces encouraging and competitive results when compared to statistical and model-based methods. Since the framework works for various datasets, it overcomes the model-based limitations that were found in the literature review.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583, stat.ME (2018)
Azimi, I., Pahikkala, T., Rahmani, A.M., Niela-Vilén, H., Axelin, A., Liljeberg, P.: Missing data resilient decision-making for healthcare IoT through personalization: a case study on maternal health. Future Gener. Comput. Syst. 96, 297–308 (2019). https://doi.org/10.1016/j.future.2019.02.015. ISSN 0167-739X
Article Google Scholar
Hatem Ben Sta: Quality and the efficiency of data in “Smart-Cities”. Future Gener. Comput. Syst. 74, 409–416 (2017). https://doi.org/10.1016/j.future.2016.12.021
Article Google Scholar
Schafer, L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods J. 7, 147–177 (2002). https://doi.org/10.1037/1082-989X.7.2.147
Article Google Scholar
Tan, Y., Zhang, C., Mao, Y., Qian, G.: Semantic presentation and fusion framework of unstructured data in smart cites. In: IEEE 10th Conference on Industrial Electronics and Applications (ICIEA), June 2015, pp. 897–901 (2015). https://doi.org/10.1109/ICIEA.2015.7334237
Cearly, D.W.: Top 10 strategic technology trends for 2019. Gartner Inc. and/or its affiliates. All rights reserved. PR575107 (2019)
Google Scholar
Qin, X., Gu, Y.: Data fusion in the Internet of Things. Procedia Eng. 15, 3023–3026 (2011). https://doi.org/10.1016/j.proeng.2011.08.567
Article Google Scholar
Lau, B.P.L., et al.: A survey of data fusion in smart city applications. Inf. Fusion J. 52, 357–374 (2019). https://doi.org/10.1016/j.inffus.2019.05.004. ISSN 1566-2535
Article Google Scholar
Marjani, M., et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5, 5247–5261 (2017). https://doi.org/10.1109/ACCESS.2017.2689040. ISSN 2169-3536
Article Google Scholar
Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends\(\textregistered \) Mach. Learn. 9, 1–118 (2016). https://doi.org/10.1561/2200000055. ISSN 1935-8237
Housfater, A.S., Zhang, X.-P., Zhou, Y.: Nonlinear fusion of multiple sensors with missing data. In: IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 4, p. IV, May 2006. https://doi.org/10.1109/ICASSP.2006.1661130
Sun, B., Saenko, K.: Correlation Alignment for Deep Domain Adaptation (2015)
Google Scholar
Sun, B., Feng, J., Saenko, K.: Correlation alignment for unsupervised domain adaptation. arXiv:1612.01939, cs.CV (2016)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. arXiv:1206.5538, cs.LG (2012)
Bubenik, P.: Statistical topological data analysis using persistence landscapes. arXiv:1207.6437, math.AT (2012)
Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized Low Rank Models (2016). https://github.com/powerscorinne/GLRM
Petrozziello, A., Jordanov, I., Sommeregger, C.: Distributed neural networks for missing big data imputation. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018. https://doi.org/10.1109/IJCNN.2018.8489488
Baraldi, P., Di Maio, F., Genini, D., Zio, E.: Reconstruction of missing data in multidimensional time series by fuzzy similarity. Appl. Soft Comput. J. 26, 1–9 (2015). https://doi.org/10.1016/j.asoc.2014.09.038. ISSN 1568-4946
Article Google Scholar
Aggarwal, C.C., Parthasarathy, S.: Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 227–232. ACM, New York (2001). https://doi.org/10.1145/502512.502543. ISBN 1-58113-391-X
Albergante, L., et al.: Robust and scalable learning of data manifolds with complex topologies via ElPiGraph. CoRR Journal, vol. abs/1804.07580, August 2018. arxiv.org/abs/1804.07580
Bishop, C.M.: Model-based machine learning. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. https://doi.org/10.1098/rsta.2012.0222
Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, Boca Raton (2012)
Book Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with LLUNATIC. VLDB J. (2019). https://doi.org/10.1007/s00778-019-00586-5
Article Google Scholar
Musil, C.M., Warner, C.B., Yobas, P.K., Jones, S.L.: A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24(7), 815–829 (2002)
Article Google Scholar

Download references

Acknowledgments

The authors thank NSERC Strategic Program # 506319-17 for their financial support, and SARA at ÉTS for their writing support during this work, and the GLRM team for publishing their Synthetic dataset model, used as reference project at [16].

Author information

Authors and Affiliations

System Engineering Department, University of Quebec’s Ecole de Technologie Superieure, Montreal, QC, H3C 1K3, Canada
Andre Luis Costa Carvalho, Darine Ameyed & Mohamed Cheriet

Authors

Andre Luis Costa Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Darine Ameyed
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Cheriet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Andre Luis Costa Carvalho , Darine Ameyed or Mohamed Cheriet .

Editor information

Editors and Affiliations

CSIRO, Marsfield, NSW, Australia
Surya Nepal
Facebook (United States), Menlo Park, CA, USA
Wenqi Cao
Chungbuk National University, Cheongju, Korea (Republic of)
Aziz Nasridinov
Fordham University, Bronx, NY, USA
MD Zakirul Alam Bhuiyan
University of North Texas, Denton, TX, USA
Xuan Guo
Kingdee International Software Group Co. Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carvalho, A.L.C., Ameyed, D., Cheriet, M. (2020). Ensemble Learning for Heterogeneous Missing Data Imputation. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-59612-5_10
Published: 18 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59611-8
Online ISBN: 978-3-030-59612-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics