Abstract
Missing values can significantly affect the result of analyses and decision making in any field. Two major approaches deal with this issue: statistical and model-based methods. While the former brings bias to the analyses, the latter is usually designed for limited and specific use cases. To overcome the limitations of the two methods, we present a stacked ensemble framework based on the integration of the adaptive random forest algorithm, the Jaccard index, and Bayesian probability. Considering the challenge that the heterogeneous and distributed data from multiple sources represents, we build a model in our use case, that supports different data types: continuous, discrete, categorical, and binary. The proposed model tackles missing data in a broad and comprehensive context of massive data sources and data formats. We evaluated our proposed framework extensively on five different datasets that contained labelled and unlabelled data. The experiments showed that our framework produces encouraging and competitive results when compared to statistical and model-based methods. Since the framework works for various datasets, it overcomes the model-based limitations that were found in the literature review.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583, stat.ME (2018)
Azimi, I., Pahikkala, T., Rahmani, A.M., Niela-Vilén, H., Axelin, A., Liljeberg, P.: Missing data resilient decision-making for healthcare IoT through personalization: a case study on maternal health. Future Gener. Comput. Syst. 96, 297–308 (2019). https://doi.org/10.1016/j.future.2019.02.015. ISSN 0167-739X
Hatem Ben Sta: Quality and the efficiency of data in “Smart-Cities”. Future Gener. Comput. Syst. 74, 409–416 (2017). https://doi.org/10.1016/j.future.2016.12.021
Schafer, L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods J. 7, 147–177 (2002). https://doi.org/10.1037/1082-989X.7.2.147
Tan, Y., Zhang, C., Mao, Y., Qian, G.: Semantic presentation and fusion framework of unstructured data in smart cites. In: IEEE 10th Conference on Industrial Electronics and Applications (ICIEA), June 2015, pp. 897–901 (2015). https://doi.org/10.1109/ICIEA.2015.7334237
Cearly, D.W.: Top 10 strategic technology trends for 2019. Gartner Inc. and/or its affiliates. All rights reserved. PR575107 (2019)
Qin, X., Gu, Y.: Data fusion in the Internet of Things. Procedia Eng. 15, 3023–3026 (2011). https://doi.org/10.1016/j.proeng.2011.08.567
Lau, B.P.L., et al.: A survey of data fusion in smart city applications. Inf. Fusion J. 52, 357–374 (2019). https://doi.org/10.1016/j.inffus.2019.05.004. ISSN 1566-2535
Marjani, M., et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5, 5247–5261 (2017). https://doi.org/10.1109/ACCESS.2017.2689040. ISSN 2169-3536
Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends\(\textregistered \) Mach. Learn. 9, 1–118 (2016). https://doi.org/10.1561/2200000055. ISSN 1935-8237
Housfater, A.S., Zhang, X.-P., Zhou, Y.: Nonlinear fusion of multiple sensors with missing data. In: IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 4, p. IV, May 2006. https://doi.org/10.1109/ICASSP.2006.1661130
Sun, B., Saenko, K.: Correlation Alignment for Deep Domain Adaptation (2015)
Sun, B., Feng, J., Saenko, K.: Correlation alignment for unsupervised domain adaptation. arXiv:1612.01939, cs.CV (2016)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. arXiv:1206.5538, cs.LG (2012)
Bubenik, P.: Statistical topological data analysis using persistence landscapes. arXiv:1207.6437, math.AT (2012)
Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized Low Rank Models (2016). https://github.com/powerscorinne/GLRM
Petrozziello, A., Jordanov, I., Sommeregger, C.: Distributed neural networks for missing big data imputation. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018. https://doi.org/10.1109/IJCNN.2018.8489488
Baraldi, P., Di Maio, F., Genini, D., Zio, E.: Reconstruction of missing data in multidimensional time series by fuzzy similarity. Appl. Soft Comput. J. 26, 1–9 (2015). https://doi.org/10.1016/j.asoc.2014.09.038. ISSN 1568-4946
Aggarwal, C.C., Parthasarathy, S.: Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 227–232. ACM, New York (2001). https://doi.org/10.1145/502512.502543. ISBN 1-58113-391-X
Albergante, L., et al.: Robust and scalable learning of data manifolds with complex topologies via ElPiGraph. CoRR Journal, vol. abs/1804.07580, August 2018. arxiv.org/abs/1804.07580
Bishop, C.M.: Model-based machine learning. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. https://doi.org/10.1098/rsta.2012.0222
Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, Boca Raton (2012)
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with LLUNATIC. VLDB J. (2019). https://doi.org/10.1007/s00778-019-00586-5
Musil, C.M., Warner, C.B., Yobas, P.K., Jones, S.L.: A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24(7), 815–829 (2002)
Acknowledgments
The authors thank NSERC Strategic Program # 506319-17 for their financial support, and SARA at ÉTS for their writing support during this work, and the GLRM team for publishing their Synthetic dataset model, used as reference project at [16].
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Carvalho, A.L.C., Ameyed, D., Cheriet, M. (2020). Ensemble Learning for Heterogeneous Missing Data Imputation. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-59612-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59611-8
Online ISBN: 978-3-030-59612-5
eBook Packages: Computer ScienceComputer Science (R0)