Big Data Optimisation Among RDDs Persistence in Apache Spark

Aziz, Khadija; Zaidouni, Dounia; Bellafkih, Mostafa

doi:10.1007/978-3-319-96292-4_3

Khadija Aziz¹²,
Dounia Zaidouni¹² &
Mostafa Bellafkih¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 872))

Included in the following conference series:

International Conference on Big Data, Cloud and Applications

1214 Accesses
1 Citations

Abstract

Nowadays, several actors of digital technologies produce an infinite number of data coming from several sources such as: social networks, connected objects, e-commerce, and radars. Several technologies are implemented to generate all this data which is incremented quickly. In order to exploit this data efficiently and durably, it is important to respect the dynamics of their chronological evolution. For fast and reliable processing, powerful technologies are designed to analyze large data. Apache Spark is designed to make fast and sophisticated processing, but when it comes to process a huge amount of data, Spark becomes slower until it doesn’t enough memory to process the data and it has to pay for more memory consumption. In this paper, we highlight the implementation of the framework Apache Spark. Thereafter, we conduct experimental simulations to show the weakness of Apache Spark. Finally, to further enforce our contribution, we propose to persist RDDs (Resilient Distributed Dataset) in order to improve performances for computing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Beyer, M.: Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data. Gartner. Archived from the original on 10 (2011)
Google Scholar
Hadoop. http://hadoop.apache.org/
Spark. https://spark.apache.org/
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
https://spark.apache.org/research.html
Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, April 2012
Google Scholar
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM, June 2013
Google Scholar
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 383–392. IEEE, May 2016
Google Scholar
Sehrish, S., Kowalkowski, J., Paterno, M.: Exploring the performance of spark for a scientific use case. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1653–1659. IEEE, May 2016
Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015)
Google Scholar
Spark architecture. https://spark.apache.org/docs/latest/cluster-overview.html
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)
Article Google Scholar
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23(3), 355–380 (2014)
Article Google Scholar
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Article Google Scholar
Gu, L., Li, H.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC\_EUC), pp. 721–727. IEEE, November 2013
Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, p. 53. ACM, May 2015
Google Scholar
Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), pp. 188–195. IEEE, June 2016
Google Scholar

Download references

Author information

Authors and Affiliations

Networks, Informatics and Mathematics department, National Institute of Posts and Telecommunications, Rabat, Morocco
Khadija Aziz, Dounia Zaidouni & Mostafa Bellafkih

Authors

Khadija Aziz
View author publications
You can also search for this author in PubMed Google Scholar
Dounia Zaidouni
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Bellafkih
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khadija Aziz .

Editor information

Editors and Affiliations

Abdelmalek Essaâdi University, Tétouan, Morocco
Youness Tabii
Abdelmalek Essaâdi University, Tétouan, Morocco
Mohamed Lazaar
Abdelmalek Essaâdi University, Tétouan, Morocco
Mohammed Al Achhab
Université Ibn-Tofail, Tétouan, Morocco
Nourddine Enneya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aziz, K., Zaidouni, D., Bellafkih, M. (2018). Big Data Optimisation Among RDDs Persistence in Apache Spark. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-96292-4_3
Published: 14 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96291-7
Online ISBN: 978-3-319-96292-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics