MR-DIS: democratic instance selection for big data by MapReduce

Arnaiz-González, Álvar; González-Rogel, Alejandro; Díez-Pastor, José-Francisco; López-Nozal, Carlos

doi:10.1007/s13748-017-0117-5

MR-DIS: democratic instance selection for big data by MapReduce

Regular Paper
Published: 10 February 2017

Volume 6, pages 211–219, (2017)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Álvar Arnaiz-González¹,
Alejandro González-Rogel¹,
José-Francisco Díez-Pastor¹ &
…
Carlos López-Nozal¹

481 Accesses
24 Citations
Explore all metrics

Abstract

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

Article 19 March 2022

A Framework for Clustering and Classification of Big Data Using Spark

In Defense of Online Kmeans for Prototype Generation and Instance Reduction

Notes

The subset selected by the algorithm is indistinctly referred to as filtered or selected set in the present paper.
We recommend the work of S. García et al. [10] for readers interested in this field.
In Spark, the process located between the map and the reduce phases is usually referred to as shuffle.
Author: Alejandro González-Rogel, https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark.
In the Spark framework, each worker node has one or more executors, each one of which completes a task. A processor was assigned to each executor in the experimental work.
In [12] the percentage of instances used for error estimation in massive data sets was \(0.1\%\), but the use of a parallel implementation of 1-NN permits an increase in this percentage, improving the precision of the estimation.
Google Cloud Platform: https://cloud.google.com/dataproc/.

References

Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485. ACM, New York (1967). doi:10.1145/1465482.1465560
Angiulli, F., Folino, G.: Distributed nearest neighbor-based condensation of very large data sets. IEEE Trans. Knowl. Data Eng. 19(12), 1593–1606 (2007). doi:10.1109/TKDE.2007.190665
Article Google Scholar
Arnaiz-González, Á., Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I.: Instance selection of linear complexity for big data. Knowl. Based Syst. 107, 83–95 (2016). doi:10.1016/j.knosys.2016.05.056
Article Google Scholar
Asimov, D.: The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6(1), 128–143 (1985)
Article MathSciNet MATH Google Scholar
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov. 6(2), 153–172 (2002). doi:10.1023/A:1014043630878
Article MathSciNet MATH Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005). doi:10.1016/j.patrec.2004.09.043
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014). doi:10.1007/s11036-013-0489-0
Article Google Scholar
de Haro-García, A., García-Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009). doi:10.1007/s10618-008-0121-2
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492
Article Google Scholar
Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Berlin (2014)
García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174(56), 410–441 (2010). doi:10.1016/j.artint.2010.01.001
Article MathSciNet Google Scholar
Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. 1(3), 12–21 (1993). doi:10.1109/88.242438
Article Google Scholar
Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. ACM, New York (1998). doi:10.1145/276698.276876
Laney, D.: 3-d data management: controlling data volume, velocity and variety, Technical Report META Group Research Note (2001)
Leyva, E., González, A., Pérez, R.: Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recognit. 48(4), 1523–1537 (2015). doi:10.1016/j.patcog.2014.10.001
Article Google Scholar
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems (2016). doi:10.1016/j.knosys.2016.06.012
Google Scholar
Minelli, M., Chambers, M., Dhiraj, A.: Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. Wiley, London (2012). doi:10.1002/9781118562260.fmatter
Google Scholar
Ramírez-Gallego, S., García, S., Mouriño Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173
Article Google Scholar
Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: Mrpr: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150 Part A, 331–345 (2015). doi:10.1016/j.neucom.2014.04.078
Tsai, C.F., Lin, W.C., Ke, S.W.: Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J. Syst. Softw. 122, 83–92 (2016). doi:10.1016/j.jss.2016.09.007
Article Google Scholar
Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML97), pp. 404–411. Morgan Kaufmann (1997)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014). doi:10.1109/TKDE.2013.109
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)
Google Scholar

Download references

Acknowledgements

This work was funded by the Ministry of Economy and Competitiveness, Project TIN 2015-67534-P.

Author information

Authors and Affiliations

University of Burgos, Avda. Cantabria s/n, 09006, Burgos, Burgos, Spain
Álvar Arnaiz-González, Alejandro González-Rogel, José-Francisco Díez-Pastor & Carlos López-Nozal

Authors

Álvar Arnaiz-González
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro González-Rogel
View author publications
You can also search for this author in PubMed Google Scholar
José-Francisco Díez-Pastor
View author publications
You can also search for this author in PubMed Google Scholar
Carlos López-Nozal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Álvar Arnaiz-González.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, JF. et al. MR-DIS: democratic instance selection for big data by MapReduce. Prog Artif Intell 6, 211–219 (2017). https://doi.org/10.1007/s13748-017-0117-5

Download citation

Received: 18 November 2016
Accepted: 28 January 2017
Published: 10 February 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s13748-017-0117-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MR-DIS: democratic instance selection for big data by MapReduce

Abstract

Access this article

Similar content being viewed by others

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

A Framework for Clustering and Classification of Big Data Using Spark

In Defense of Online Kmeans for Prototype Generation and Instance Reduction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MR-DIS: democratic instance selection for big data by MapReduce

Abstract

Access this article

Similar content being viewed by others

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

A Framework for Clustering and Classification of Big Data Using Spark

In Defense of Online Kmeans for Prototype Generation and Instance Reduction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation