Skip to main content

Big Data Optimisation Among RDDs Persistence in Apache Spark

  • Conference paper
  • First Online:
Big Data, Cloud and Applications (BDCA 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 872))

Included in the following conference series:

Abstract

Nowadays, several actors of digital technologies produce an infinite number of data coming from several sources such as: social networks, connected objects, e-commerce, and radars. Several technologies are implemented to generate all this data which is incremented quickly. In order to exploit this data efficiently and durably, it is important to respect the dynamics of their chronological evolution. For fast and reliable processing, powerful technologies are designed to analyze large data. Apache Spark is designed to make fast and sophisticated processing, but when it comes to process a huge amount of data, Spark becomes slower until it doesn’t enough memory to process the data and it has to pay for more memory consumption. In this paper, we highlight the implementation of the framework Apache Spark. Thereafter, we conduct experimental simulations to show the weakness of Apache Spark. Finally, to further enforce our contribution, we propose to persist RDDs (Resilient Distributed Dataset) in order to improve performances for computing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Beyer, M.: Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data. Gartner. Archived from the original on 10 (2011)

    Google Scholar 

  2. Hadoop. http://hadoop.apache.org/

  3. Spark. https://spark.apache.org/

  4. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  5. https://spark.apache.org/research.html

  6. Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)

    Article  Google Scholar 

  7. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, April 2012

    Google Scholar 

  8. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM, June 2013

    Google Scholar 

  9. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  10. Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 383–392. IEEE, May 2016

    Google Scholar 

  11. Sehrish, S., Kowalkowski, J., Paterno, M.: Exploring the performance of spark for a scientific use case. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1653–1659. IEEE, May 2016

    Google Scholar 

  12. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015)

    Google Scholar 

  13. Spark architecture. https://spark.apache.org/docs/latest/cluster-overview.html

  14. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)

    Article  Google Scholar 

  15. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23(3), 355–380 (2014)

    Article  Google Scholar 

  16. Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)

    Article  Google Scholar 

  17. Gu, L., Li, H.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC\_EUC), pp. 721–727. IEEE, November 2013

    Google Scholar 

  18. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, p. 53. ACM, May 2015

    Google Scholar 

  19. Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), pp. 188–195. IEEE, June 2016

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khadija Aziz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aziz, K., Zaidouni, D., Bellafkih, M. (2018). Big Data Optimisation Among RDDs Persistence in Apache Spark. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96292-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96291-7

  • Online ISBN: 978-3-319-96292-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics