A Review of Distributed Data Models for Learning

Rodríguez, Miguel Ángel; Fernández, Alberto; Peregrín, Antonio; Herrera, Francisco

doi:10.1007/978-3-319-59650-1_8

Miguel Ángel Rodríguez¹⁷,
Alberto Fernández¹⁸,
Antonio Peregrín¹⁷ &
…
Francisco Herrera¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10334))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2583 Accesses
3 Citations

Abstract

This paper deals with aspects of data distribution for machine learning tasks, considering the advantages as well as the drawbacks that are frequently associated with data partitioning and its different models. This study, from the point of view of the distributed data, reviews some of the algorithms that have been used to treat each case, although it is not a review of learning or computation algorithms. Finally, this report looks into the issues that new data partitioning-based models such as MapReduce have brought to distributed learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1–2), 105–139 (1999)
Article Google Scholar
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) AI 2010. LNCS, vol. 6085, pp. 220–231. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13059-5_22
Chapter Google Scholar
Weiss, G.M., Provost, F.: Learning when training data is costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
MATH Google Scholar
Ally, M.: Survey on multiclass classification methods. Neural Netw. pp. 1–9 (2005)
Google Scholar
Moreno-Torres, J., Raeder, T., Alaiz-Rodríguez, R., Chawla, N., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012)
Article Google Scholar
Bekkerman, R., Bilenko, M., Langford, J.: Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). doi:10.1007/3-540-45014-9_1
Chapter Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM Spec. Interest Group Knowl. Disc. Data Min. Explor. 6(1), 1–6 (2004)
Google Scholar
Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: Fifth National Conference on Artificial Intelligence, pp. 496–501 (1986)
Google Scholar
Tsoumakas, G., Vlahavas, I.: Effective stacking of distributed classifiers. In: European Conference in Artificial Intelligence, pp. 340–344 (2002)
Google Scholar
Lazarevic, A., Obradovic, Z.: Boosting algorithms for parallel and distributed learning. Distrib. Parallel Databases 11(2), 203–229 (2002)
Article MATH Google Scholar
Ishibuchi, H., Mihara, S., Nojima, Y.: Parallel distributed hybrid fuzzy GBML models with rule set migration and training data rotation. IEEE Trans. Fuzzy Syst. 21(2), 355–368 (2013)
Article Google Scholar
Provost, F., Hennessy, D.: Distributed machine learning: scaling up with coarse-grained parallelism. In: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, pp. 340–347 (1994)
Google Scholar
Giordana, A., Saitta, L.: Learning disjunctive concepts by means of genetic algorithms. In: Proceedings of the International Conference on Machine Learning, pp. 96–104 (1994)
Google Scholar
Anglano, C., Giordana, A., Bello, G.L., Saitta, L.: An experimental evaluation of coevolutive concept learning. In: Proceedings of the 15th International Conference on Machine Learning, pp. 19–27 (1998)
Google Scholar
Rodríguez, M., Escalante, D.M., Peregrín, A.: Efficient distributed genetic algorithm for rule extraction. Appl. Soft Comput. 11(1), 733–743 (2011)
Article Google Scholar
Lopez, L.I., Bardallo, J.M., De Vega, M.A., Peregrin, A.: REGAL-TC: a distributed genetic algorithm for concept learning based on regal and the treatment of counterexamples. Soft. Comput. 15(7), 1389–1403 (2011)
Article Google Scholar
Cantú-Paz, E.: A Survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems 10(2), 141–171 (1998)
Google Scholar
Fayyad, U.M., Djorgovski, S.G., Nicholas, W.: Automating analysis and cataloging of sky surveys. In: Advance in Knowledge Discovery and Data Mining, pp. 471–493 (1996)
Google Scholar
Peteiro-Barral, G.-B.D.: A survey of methods for distributed machine learning. Proc. Artif. Intell. 2(1), 1–11 (2013)
Article Google Scholar
Chan, P.K., Stolfo, S.J.: Experiments on multistrategy learning by meta-learning. In: Proceedings of the Second International Conference on Information and Knowledge Management, pp. 314–323 (1993)
Google Scholar
Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150, 331–345 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Chan, P.K., Stolfo, S.J.: Toward parallel and distributed learning by meta-learning. In: AAAI Workshop in Knowledge Discovery in Databases, pp. 227–240 (1993)
Google Scholar
Chan, P., Stolfo, S.: Experiments on multistrategy learning by meta-learning. In: Proceedings Second International Conference of Information and Knowledge Management, pp. 314–323 (1993)
Google Scholar
Peralta, D., Río, S., Ramírez-Gallego, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce aproach. Math. Probl. Eng. (2015). doi:10.1155/2015/246139
Google Scholar
Ramirez, S.: Repository of machine learning algorithm over spark (2016). Accessed Jan 2017
Google Scholar
Triguero, I., Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)
Article Google Scholar
Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Article Google Scholar
Río, S.: Repository on imbalanced preprocessing MapReduce (2015). https://github.com/saradelrio/hadoop-imbalancedpreprocessing
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Article Google Scholar
Río, S., Benítez, J.M., Herrera, F.: Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification. In: IEEE BigDataSE 2015, vol. 2, pp. 180–185 (2015)
Google Scholar
Apache Mahout. http://mahout.apache.org. Accessed Jan 2017
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Xin, D.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MATH MathSciNet Google Scholar
Río, S., López, V., Benítez, J.M., Herrera, F.: A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int. J. Comput. Intell. Syst. 8(3), 422–437 (2015)
Article Google Scholar
Fernandez, A., Río, S., Herrera, F.: Fuzzy rule based classification systems for big data with MapReduce: granularity analysis. Adv. Data Anal. Classif. (2016). doi:10.1007/s11634-016-0260-z
Google Scholar
Maillo, J., Ramírez-Gallego, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl. Based Syst. (2016). doi:10.1016/j.knosys.2016.06.012
Google Scholar
White, T.: Hadoop,The Definitive Guide. OReilly Media Inc., Sebastopol (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10–17 (2010)
Google Scholar
Martín, D., Martínez-Ballesteros, M., Río, S., Alcalá-Fdez, J., Riquelme, J., Herrera, F.: MOPNAR-BigData: un diseno MapReduce para la extracción de reglas de asociación cuantitativas en problemas de Big Data. In: CAEPIA 2015, pp. 979–989 (2015)
Google Scholar
Verma, A., Llorá, X., Goldberg, D., Campbell, R.: Scaling genetic algorithms using MapReduce. In: Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, pp. 13–18 (2009)
Google Scholar
Geronimo, D., Ferrucci, L.F., Murolo, A., Sarro, F.: A parallel genetic algorithm based on hadoop MapReduce for the automatic generation of unit test suites. In: IEEE 5th International Conference Software Testing, Verification and Validation, pp. 785–793 (2012)
Google Scholar
Jin, C., Vecchiola, C., Buyya, R.: MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: Proceeding of the 4th IEEE International Conference on eScience, pp. 214–221 (2008)
Google Scholar
Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Data Min. Knowl. Disc. 6(1), 5–21 (2016)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the Spanish Ministry of Education and Science under Project TIN2014- 57251-P.

Author information

Authors and Affiliations

Department of Information Technologies, University of Huelva, Huelva, Spain
Miguel Ángel Rodríguez & Antonio Peregrín
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Alberto Fernández & Francisco Herrera

Authors

Miguel Ángel Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Peregrín
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Ángel Rodríguez .

Editor information

Editors and Affiliations

University of La Rioja , Logroño, La Rioja, Spain
Francisco Javier Martínez de Pisón
University of La Rioja , Logroño, La Rioja, Spain
Rubén Urraca
University of A Coruña , Ferrol, La Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodríguez, M.Á., Fernández, A., Peregrín, A., Herrera, F. (2017). A Review of Distributed Data Models for Learning. In: Martínez de Pisón, F., Urraca, R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2017. Lecture Notes in Computer Science(), vol 10334. Springer, Cham. https://doi.org/10.1007/978-3-319-59650-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-59650-1_8
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59649-5
Online ISBN: 978-3-319-59650-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics