On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data

García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-319-92639-1_2

On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data

Diego García-Gil²⁰,
Sergio Ramírez-Gallego²⁰,
Salvador García²⁰ &
…
Francisco Herrera²⁰

Conference paper
First Online: 08 June 2018

2461 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10870))

Abstract

Massive data growth in recent years has made data reduction techniques to gain a special popularity because of their ability to reduce this enormous amount of data, also called Big Data. Random Projection Random Discretization is an innovative ensemble method. It uses two data reduction techniques to create more informative data, their proposed Random Discretization, and Random Projections (RP). However, RP has some shortcomings that can be solved by more powerful methods such as Principal Components Analysis (PCA). Aiming to tackle this problem, we propose a new ensemble method using the Apache Spark framework and PCA for dimensionality reduction, named Random Discretization Dimensionality Reduction Ensemble. In our experiments on five Big Data datasets, we show that our proposal achieves better prediction performance than the original algorithm and Random Forest.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
IDC: The Digital Universe of Opportunities. 2018 [Online] Available: http://www.emc.com/infographics/digital-universe-2014.htm.
2.
https://spark-packages.org/package/djgarcia/RD2R.
3.
Apache Hadoop Project 2018 [Online] Available: https://hadoop.apache.org/.
4.
Apache Spark Project 2018 [Online] Available: https://spark.apache.org/.

References

Ahmad, A., Brown, G.: Random projection random discretization ensembles - ensembles of linear multivariate decision trees. IEEE Trans. Knowl. Data Eng. 26(5), 1225–1239 (2014)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Dasgupta, S.: Experiments with random projection. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI 2000, pp. 143–151. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45014-9_1
Chapter Google Scholar
Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 517–522. ACM, New York (2003)
Google Scholar
García, S., Luengo, J., Sáez, J., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl. Syst. 98, 1–29 (2016)
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-10247-4
Book Google Scholar
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1(1), 9 (2016)
Article Google Scholar
García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Anal. 2(1), 11 (2017)
Article Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)
MathSciNet MATH Google Scholar
Lin, J.: Mapreduce is good enough? If all you have is a hammer, throw away everything that’s not a nail!. Big Data 1(1), 28–37 (2013)
Article Google Scholar
Ramírez-Gallego, S., García, S., Benítez, J., Herrera, F.: A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evolut. Comput. 38, 240–250 (2018)
Article Google Scholar
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fusion 42, 51–61 (2018)
Article Google Scholar
del Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of mapreduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, pp. 15–28. USENIX Association, Berkeley (2012)
Google Scholar

Download references

Acknowledgments

This contribution is supported by FEDER, the Spanish National Research Projects TIN2014-57251-P and TIN2017-89517-P, and the Project BigDaP-TOOLS - Ayudas Fundación BBVA a Equipos de Investigación Científica 2016.

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, University of Granada, CITIC-UGR, 18071, Granada, Spain
Diego García-Gil, Sergio Ramírez-Gallego, Salvador García & Francisco Herrera

Authors

Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego García-Gil .

Editor information

Editors and Affiliations

Department of Mine Operating and Prospection, University of Oviedo, Oviedo, Spain
Francisco Javier de Cos Juez
Department of Computer Science, University of Oviedo, Oviedo, Spain
José Ramón Villar
Department of Computer Science, University of Oviedo, Oviedo, Spain
Enrique A. de la Cal
Department of Civil Engineering, University of Burgos, Burgos, Spain
Álvaro Herrero
University of A Coruña, A Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
José António Sáez
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2018). On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data. In: de Cos Juez, F., et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-92639-1_2
Published: 08 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92638-4
Online ISBN: 978-3-319-92639-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics