Abstract
Machine Learning problems are significantly growing in complexity, either due to an increase in the volume of data, to new forms of data, or due to the change of data over time. This poses new challenges that are both technical and scientific. In this paper we propose a Distributed Learning System that runs on top of a Hadoop cluster, leveraging its native functionalities. It is guided by the principle of data locality. Data are distributed across the cluster, so models are also distributed and trained in parallel. Models are thus seen as Ensembles of base models, and predictions are made by combining the predictions of the base models. Moreover, models are replicated and distributed across the cluster, so that multiple nodes can answer requests. This results in a system that is both resilient and with high availability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bashir, H.A., Neville, R.S.: Hybrid evolutionary computation for continuous optimization. arXiv preprint arXiv:1303.3469 (2013)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018)
Chandra, A., Yao, X.: Ensemble learning using multi-objective evolutionary algorithms. J. Math. Model. Algorithms 5(4), 417–445 (2006)
Chandra, A., Yao, X.: Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7–9), 686–700 (2006)
Chen, H., Li, T., Luo, C., Horng, S.J., Wang, G.: A rough set-based method for updating decision rules on attribute values’ coarsening and refining. IEEE Trans. Knowl. Data Eng. 26(12), 2886–2899 (2014)
Chen, J., Wang, C., Wang, R.: Using stacked generalization to combine SVMs in magnitude and shape feature spaces for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 47(7), 2193–2205 (2009)
Christ, M., Kempa-Liehr, A.W., Feindt, M.: Distributed and parallel time series feature extraction for industrial big data applications. arXiv preprint arXiv:1610.07717 (2016)
Gagné, C., Sebag, M., Schoenauer, M., Tomassini, M.: Ensemble learning for free with evolutionary algorithms? In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp. 1782–1789 (2007)
Gomes, H.M., Read, J., Bifet, A., Barddal, J.P., Gama, J.: Machine learning for streaming data: state of the art, challenges, and opportunities. ACM SIGKDD Explor. Newsl. 21(2), 6–22 (2019)
Leyva, E., González, A., Perez, R.: A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 27(2), 354–367 (2014)
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2018)
Qiu, J., Wu, Q., Ding, G., Xu, Y., Feng, S.: A survey of machine learning for big data processing. EURASIP J. Adv. Sig. Process. 2016(1), 1–16 (2016)
Ramos, D., Carneiro, D., Novais, P.: Using a genetic algorithm to optimize a stacking ensemble in data streaming scenarios. AI Commun. (Preprint) 33, 1–14 (2020)
Ren, P., Xiao, Y., Chang, X., Huang, P.Y., Li, Z., Chen, X., Wang, X.: A survey of deep active learning. arXiv preprint arXiv:2009.00236 (2020)
Sarnovsky, M., Vronc, M.: Distributed boosting algorithm for classification of text documents. In: 2014 IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 217–220. IEEE (2014)
Suárez, J.L., Garcıa, S., Herrera, F.: pyDML: a Python library for distance metric learning. J. Mach. Learn. Res. 21(96), 1–7 (2020)
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., Rellermeyer, J.S.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Acknowledgements
This work was supported by the Northern Regional Operational Program, Portugal 2020 and European Union, trough European Regional Development Fund (ERDF) in the scope of project number 39900 - 31/SI/2017, and by FCT—Fundação para a Ciência e Tecnologia within projects UIDB/04728/2020 and UIDB/00319/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carneiro, D., Oliveira, F., Novais, P. (2022). A Data-Locality-Aware Distributed Learning System. In: Novais, P., Carneiro, J., Chamoso, P. (eds) Ambient Intelligence – Software and Applications – 12th International Symposium on Ambient Intelligence. ISAmI 2021. Lecture Notes in Networks and Systems, vol 483. Springer, Cham. https://doi.org/10.1007/978-3-031-06894-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-06894-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06893-5
Online ISBN: 978-3-031-06894-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)