Skip to main content
Log in

MR-DIS: democratic instance selection for big data by MapReduce

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The subset selected by the algorithm is indistinctly referred to as filtered or selected set in the present paper.

  2. We recommend the work of S. García et al. [10] for readers interested in this field.

  3. In Spark, the process located between the map and the reduce phases is usually referred to as shuffle.

  4. Author: Alejandro González-Rogel, https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark.

  5. In the Spark framework, each worker node has one or more executors, each one of which completes a task. A processor was assigned to each executor in the experimental work.

  6. In [12] the percentage of instances used for error estimation in massive data sets was \(0.1\%\), but the use of a parallel implementation of 1-NN permits an increase in this percentage, improving the precision of the estimation.

  7. Google Cloud Platform: https://cloud.google.com/dataproc/.

References

  1. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485. ACM, New York (1967). doi:10.1145/1465482.1465560

  2. Angiulli, F., Folino, G.: Distributed nearest neighbor-based condensation of very large data sets. IEEE Trans. Knowl. Data Eng. 19(12), 1593–1606 (2007). doi:10.1109/TKDE.2007.190665

    Article  Google Scholar 

  3. Arnaiz-González, Á., Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I.: Instance selection of linear complexity for big data. Knowl. Based Syst. 107, 83–95 (2016). doi:10.1016/j.knosys.2016.05.056

    Article  Google Scholar 

  4. Asimov, D.: The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6(1), 128–143 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  5. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov. 6(2), 153–172 (2002). doi:10.1023/A:1014043630878

    Article  MathSciNet  MATH  Google Scholar 

  6. Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005). doi:10.1016/j.patrec.2004.09.043

    Article  Google Scholar 

  7. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014). doi:10.1007/s11036-013-0489-0

    Article  Google Scholar 

  8. de Haro-García, A., García-Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009). doi:10.1007/s10618-008-0121-2

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492

    Article  Google Scholar 

  10. Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142

    Article  Google Scholar 

  11. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Berlin (2014)

  12. García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174(56), 410–441 (2010). doi:10.1016/j.artint.2010.01.001

    Article  MathSciNet  Google Scholar 

  13. Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. 1(3), 12–21 (1993). doi:10.1109/88.242438

    Article  Google Scholar 

  14. Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)

    Article  Google Scholar 

  15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. ACM, New York (1998). doi:10.1145/276698.276876

  16. Laney, D.: 3-d data management: controlling data volume, velocity and variety, Technical Report META Group Research Note (2001)

  17. Leyva, E., González, A., Pérez, R.: Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recognit. 48(4), 1523–1537 (2015). doi:10.1016/j.patcog.2014.10.001

    Article  Google Scholar 

  18. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  19. Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems (2016). doi:10.1016/j.knosys.2016.06.012

    Google Scholar 

  20. Minelli, M., Chambers, M., Dhiraj, A.: Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. Wiley, London (2012). doi:10.1002/9781118562260.fmatter

    Google Scholar 

  21. Ramírez-Gallego, S., García, S., Mouriño Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173

    Article  Google Scholar 

  22. Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: Mrpr: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150 Part A, 331–345 (2015). doi:10.1016/j.neucom.2014.04.078

  23. Tsai, C.F., Lin, W.C., Ke, S.W.: Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J. Syst. Softw. 122, 83–92 (2016). doi:10.1016/j.jss.2016.09.007

    Article  Google Scholar 

  24. Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML97), pp. 404–411. Morgan Kaufmann (1997)

  25. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014). doi:10.1109/TKDE.2013.109

    Article  Google Scholar 

  26. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the Ministry of Economy and Competitiveness, Project TIN 2015-67534-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Álvar Arnaiz-González.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, JF. et al. MR-DIS: democratic instance selection for big data by MapReduce. Prog Artif Intell 6, 211–219 (2017). https://doi.org/10.1007/s13748-017-0117-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-017-0117-5

Keywords

Navigation