Abstract
This paper deals with aspects of data distribution for machine learning tasks, considering the advantages as well as the drawbacks that are frequently associated with data partitioning and its different models. This study, from the point of view of the distributed data, reviews some of the algorithms that have been used to treat each case, although it is not a review of learning or computation algorithms. Finally, this report looks into the issues that new data partitioning-based models such as MapReduce have brought to distributed learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1–2), 105–139 (1999)
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) AI 2010. LNCS, vol. 6085, pp. 220–231. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13059-5_22
Weiss, G.M., Provost, F.: Learning when training data is costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Ally, M.: Survey on multiclass classification methods. Neural Netw. pp. 1–9 (2005)
Moreno-Torres, J., Raeder, T., Alaiz-Rodríguez, R., Chawla, N., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012)
Bekkerman, R., Bilenko, M., Langford, J.: Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2011)
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). doi:10.1007/3-540-45014-9_1
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM Spec. Interest Group Knowl. Disc. Data Min. Explor. 6(1), 1–6 (2004)
Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: Fifth National Conference on Artificial Intelligence, pp. 496–501 (1986)
Tsoumakas, G., Vlahavas, I.: Effective stacking of distributed classifiers. In: European Conference in Artificial Intelligence, pp. 340–344 (2002)
Lazarevic, A., Obradovic, Z.: Boosting algorithms for parallel and distributed learning. Distrib. Parallel Databases 11(2), 203–229 (2002)
Ishibuchi, H., Mihara, S., Nojima, Y.: Parallel distributed hybrid fuzzy GBML models with rule set migration and training data rotation. IEEE Trans. Fuzzy Syst. 21(2), 355–368 (2013)
Provost, F., Hennessy, D.: Distributed machine learning: scaling up with coarse-grained parallelism. In: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, pp. 340–347 (1994)
Giordana, A., Saitta, L.: Learning disjunctive concepts by means of genetic algorithms. In: Proceedings of the International Conference on Machine Learning, pp. 96–104 (1994)
Anglano, C., Giordana, A., Bello, G.L., Saitta, L.: An experimental evaluation of coevolutive concept learning. In: Proceedings of the 15th International Conference on Machine Learning, pp. 19–27 (1998)
Rodríguez, M., Escalante, D.M., Peregrín, A.: Efficient distributed genetic algorithm for rule extraction. Appl. Soft Comput. 11(1), 733–743 (2011)
Lopez, L.I., Bardallo, J.M., De Vega, M.A., Peregrin, A.: REGAL-TC: a distributed genetic algorithm for concept learning based on regal and the treatment of counterexamples. Soft. Comput. 15(7), 1389–1403 (2011)
Cantú-Paz, E.: A Survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems 10(2), 141–171 (1998)
Fayyad, U.M., Djorgovski, S.G., Nicholas, W.: Automating analysis and cataloging of sky surveys. In: Advance in Knowledge Discovery and Data Mining, pp. 471–493 (1996)
Peteiro-Barral, G.-B.D.: A survey of methods for distributed machine learning. Proc. Artif. Intell. 2(1), 1–11 (2013)
Chan, P.K., Stolfo, S.J.: Experiments on multistrategy learning by meta-learning. In: Proceedings of the Second International Conference on Information and Knowledge Management, pp. 314–323 (1993)
Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150, 331–345 (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Chan, P.K., Stolfo, S.J.: Toward parallel and distributed learning by meta-learning. In: AAAI Workshop in Knowledge Discovery in Databases, pp. 227–240 (1993)
Chan, P., Stolfo, S.: Experiments on multistrategy learning by meta-learning. In: Proceedings Second International Conference of Information and Knowledge Management, pp. 314–323 (1993)
Peralta, D., Río, S., Ramírez-Gallego, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce aproach. Math. Probl. Eng. (2015). doi:10.1155/2015/246139
Ramirez, S.: Repository of machine learning algorithm over spark (2016). Accessed Jan 2017
Triguero, I., Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)
Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Río, S.: Repository on imbalanced preprocessing MapReduce (2015). https://github.com/saradelrio/hadoop-imbalancedpreprocessing
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Río, S., Benítez, J.M., Herrera, F.: Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification. In: IEEE BigDataSE 2015, vol. 2, pp. 180–185 (2015)
Apache Mahout. http://mahout.apache.org. Accessed Jan 2017
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Xin, D.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
Río, S., López, V., Benítez, J.M., Herrera, F.: A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int. J. Comput. Intell. Syst. 8(3), 422–437 (2015)
Fernandez, A., Río, S., Herrera, F.: Fuzzy rule based classification systems for big data with MapReduce: granularity analysis. Adv. Data Anal. Classif. (2016). doi:10.1007/s11634-016-0260-z
Maillo, J., Ramírez-Gallego, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl. Based Syst. (2016). doi:10.1016/j.knosys.2016.06.012
White, T.: Hadoop,The Definitive Guide. OReilly Media Inc., Sebastopol (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10–17 (2010)
Martín, D., Martínez-Ballesteros, M., Río, S., Alcalá-Fdez, J., Riquelme, J., Herrera, F.: MOPNAR-BigData: un diseno MapReduce para la extracción de reglas de asociación cuantitativas en problemas de Big Data. In: CAEPIA 2015, pp. 979–989 (2015)
Verma, A., Llorá, X., Goldberg, D., Campbell, R.: Scaling genetic algorithms using MapReduce. In: Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, pp. 13–18 (2009)
Geronimo, D., Ferrucci, L.F., Murolo, A., Sarro, F.: A parallel genetic algorithm based on hadoop MapReduce for the automatic generation of unit test suites. In: IEEE 5th International Conference Software Testing, Verification and Validation, pp. 785–793 (2012)
Jin, C., Vecchiola, C., Buyya, R.: MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: Proceeding of the 4th IEEE International Conference on eScience, pp. 214–221 (2008)
Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Data Min. Knowl. Disc. 6(1), 5–21 (2016)
Acknowledgements
This work was partially supported by the Spanish Ministry of Education and Science under Project TIN2014- 57251-P.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Rodríguez, M.Á., Fernández, A., Peregrín, A., Herrera, F. (2017). A Review of Distributed Data Models for Learning. In: Martínez de Pisón, F., Urraca, R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2017. Lecture Notes in Computer Science(), vol 10334. Springer, Cham. https://doi.org/10.1007/978-3-319-59650-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-59650-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59649-5
Online ISBN: 978-3-319-59650-1
eBook Packages: Computer ScienceComputer Science (R0)