Abstract
Nowadays, several actors of digital technologies produce an infinite number of data coming from several sources such as: social networks, connected objects, e-commerce, and radars. Several technologies are implemented to generate all this data which is incremented quickly. In order to exploit this data efficiently and durably, it is important to respect the dynamics of their chronological evolution. For fast and reliable processing, powerful technologies are designed to analyze large data. Apache Spark is designed to make fast and sophisticated processing, but when it comes to process a huge amount of data, Spark becomes slower until it doesn’t enough memory to process the data and it has to pay for more memory consumption. In this paper, we highlight the implementation of the framework Apache Spark. Thereafter, we conduct experimental simulations to show the weakness of Apache Spark. Finally, to further enforce our contribution, we propose to persist RDDs (Resilient Distributed Dataset) in order to improve performances for computing data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beyer, M.: Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data. Gartner. Archived from the original on 10 (2011)
Hadoop. http://hadoop.apache.org/
Spark. https://spark.apache.org/
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, April 2012
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM, June 2013
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 383–392. IEEE, May 2016
Sehrish, S., Kowalkowski, J., Paterno, M.: Exploring the performance of spark for a scientific use case. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1653–1659. IEEE, May 2016
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015)
Spark architecture. https://spark.apache.org/docs/latest/cluster-overview.html
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23(3), 355–380 (2014)
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Gu, L., Li, H.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC\_EUC), pp. 721–727. IEEE, November 2013
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, p. 53. ACM, May 2015
Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), pp. 188–195. IEEE, June 2016
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Aziz, K., Zaidouni, D., Bellafkih, M. (2018). Big Data Optimisation Among RDDs Persistence in Apache Spark. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-96292-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96291-7
Online ISBN: 978-3-319-96292-4
eBook Packages: Computer ScienceComputer Science (R0)