Abstract
Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.
Similar content being viewed by others
Notes
The subset selected by the algorithm is indistinctly referred to as filtered or selected set in the present paper.
We recommend the work of S. García et al. [10] for readers interested in this field.
In Spark, the process located between the map and the reduce phases is usually referred to as shuffle.
Author: Alejandro González-Rogel, https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark.
In the Spark framework, each worker node has one or more executors, each one of which completes a task. A processor was assigned to each executor in the experimental work.
In [12] the percentage of instances used for error estimation in massive data sets was \(0.1\%\), but the use of a parallel implementation of 1-NN permits an increase in this percentage, improving the precision of the estimation.
Google Cloud Platform: https://cloud.google.com/dataproc/.
References
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485. ACM, New York (1967). doi:10.1145/1465482.1465560
Angiulli, F., Folino, G.: Distributed nearest neighbor-based condensation of very large data sets. IEEE Trans. Knowl. Data Eng. 19(12), 1593–1606 (2007). doi:10.1109/TKDE.2007.190665
Arnaiz-González, Á., Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I.: Instance selection of linear complexity for big data. Knowl. Based Syst. 107, 83–95 (2016). doi:10.1016/j.knosys.2016.05.056
Asimov, D.: The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6(1), 128–143 (1985)
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov. 6(2), 153–172 (2002). doi:10.1023/A:1014043630878
Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005). doi:10.1016/j.patrec.2004.09.043
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014). doi:10.1007/s11036-013-0489-0
de Haro-García, A., García-Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009). doi:10.1007/s10618-008-0121-2
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492
Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Berlin (2014)
García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174(56), 410–441 (2010). doi:10.1016/j.artint.2010.01.001
Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. 1(3), 12–21 (1993). doi:10.1109/88.242438
Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. ACM, New York (1998). doi:10.1145/276698.276876
Laney, D.: 3-d data management: controlling data volume, velocity and variety, Technical Report META Group Research Note (2001)
Leyva, E., González, A., Pérez, R.: Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recognit. 48(4), 1523–1537 (2015). doi:10.1016/j.patcog.2014.10.001
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems (2016). doi:10.1016/j.knosys.2016.06.012
Minelli, M., Chambers, M., Dhiraj, A.: Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. Wiley, London (2012). doi:10.1002/9781118562260.fmatter
Ramírez-Gallego, S., García, S., Mouriño Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173
Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: Mrpr: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150 Part A, 331–345 (2015). doi:10.1016/j.neucom.2014.04.078
Tsai, C.F., Lin, W.C., Ke, S.W.: Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J. Syst. Softw. 122, 83–92 (2016). doi:10.1016/j.jss.2016.09.007
Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML97), pp. 404–411. Morgan Kaufmann (1997)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014). doi:10.1109/TKDE.2013.109
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)
Acknowledgements
This work was funded by the Ministry of Economy and Competitiveness, Project TIN 2015-67534-P.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, JF. et al. MR-DIS: democratic instance selection for big data by MapReduce. Prog Artif Intell 6, 211–219 (2017). https://doi.org/10.1007/s13748-017-0117-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-017-0117-5