Skip to main content
Log in

Dynamic data replacement and adaptive scheduling policies in spark

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Improper data replacement and inappropriate selection of job scheduling policy are important reasons for the degradation of Spark system operation speed, which directly causes the performance degradation of Spark parallel computing. In this paper, we analyze the existing caching mechanism of Spark and find that there is still more room for optimization of the existing caching policy. For the task structure analysis, the key information of Spark tasks is taken out to obtain the data and memory usage during the task runtime, and based on this, an RDD weight calculation method is proposed, which integrates various factors affecting the RDD usage and establishes an RDD weight model. Based on this model, a minimum weight replacement algorithm based on RDD structure analyzing is proposed. The algorithm ensure that the relatively more valuable data in the data replacement process can be cached into memory. In addition, the default job scheduling algorithm of the Spark framework considers a single factor, which cannot form effective scheduling for jobs and causes a waste of cluster resources. In this paper, an adaptive job scheduling policy based on job classification is proposed to solve the above problem. The policy can classify job types and schedule resources more effectively for different types of jobs. The experimental results show that the proposed dynamic data replacement algorithm effectively improves Spark's memory utilization. The proposed job classification-based adaptive job scheduling algorithm effectively improves the system resource utilization and shortens the job completion time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Manyika, J., Chui, M., Brown, B., et al. Big data: The next frontier for innovation, competition, and productivity[R/OL]. The Mekinsey Global Institute, Las Vegas (2011). http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

  2. Armstrong, K.: Big data: a revolution that will transform how we live, work, and think. Math. Comput. Educ. 47(10), 181–183 (2014)

    Google Scholar 

  3. Li, C., Song, M., Zhang, Q., Luo, Y.: Cluster load based content distribution and speculative execution for geographically distributed cloud environment. Comput. Netw. 186, 107807 (2021)

    Article  Google Scholar 

  4. Li, C., Song, M., Zhang, M., Luo, Y.: Effective replica management for improving reliability and availability in edge-cloud computing environment. J. Parallel Distrib. Comput. 143, 107–128 (2020)

    Article  Google Scholar 

  5. Lig, J., Cheng, X.Q.: Research status and scientific thinking of big data. Bull. Chin. Acad. Sci 27(6), 647–657 (2012)

    Google Scholar 

  6. Wang, Y., Lu, W., Lou, R., et al.: Improving map reduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)

    Article  Google Scholar 

  7. Grolinger, K., Hayes, M., Higashino, W.A., et al.: Challenges for map reduce in big data. In: 2014 IEEE World Congress on Services (SERVICES), pp 182–189 (2014)

  8. Huang, C.Q., Yang, S.Q., Tang, J.C., et al.: RDD share: reusing results of spark RDD. In: IEEE International Conference on Data Science in Cyberspace. IEEE (2017).

  9. Li, C., Bai, J., Chen, Y., Luo, Y.: Resource and replica management strategy for optimizing financial cost and user experience in edge cloud computing system. Inf. Sci. 516, 33–55 (2020)

    Article  MathSciNet  Google Scholar 

  10. He, Z., Huang, Q., Li, Z., et al.: Resource and replica management strategy in spark SQL using task stealing. Int. J. Parallel Program. (2020)

  11. Jyothi, S.A., Curino, C., Menache, I., et al.: Morpheus: towards automated SLOs for enterprise clusters. In: Usenix Conference on Operating Systems Design and Implementation, pp. 117–134. USENIX Association Berkeley (2016).

  12. Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, Berkeley (2012)

  13. Xu, L., Min, L., Li, Z., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2016)

  14. Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. 28(8), 2473–2486 (2016)

    Article  Google Scholar 

  15. Liu, H., Tan, L.: New RDD partition weight cache replacement algorithm in spark. J. Chin. Comput. Syst. (2018).

  16. Chen, T.Y., Zhang, L.X., Ken-Li, L.I., et al.: Optimization of RDD Cache Replacement Strategy Optimization in Spark Framework. (2019)

  17. Wang, Y., Lei, T., et al.: A Lowest Cost RDD Caching Strategy for Spark. (2019)

  18. Wang, B., Tang, J., Zhang, R., et al. (2019) LCRC: a dependency-aware cache management policy for spark. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE (2019)

  19. Wason, S., Kumar, P., Rathi, S.: Dynamic memory allocation. J. Comput. Sci. Eng. 4, 99–103 (2017)

    Google Scholar 

  20. Wang, K., Khan, M., Nguyen, N., et al.: Design and implementation of an analytical framework for interference aware job scheduling on Apache Spark platform. Clust. Comput. 22(1), 2223–2237 (2019)

    Article  Google Scholar 

  21. Zainab, A., Ghrayeb, A., Abu-Rub, H., et al.: Distributed tree-based machine learning for short-term load forecasting with apache spark. IEEE Access 99, 1 (2021)

    Google Scholar 

  22. Hu, Z., Li, D., Zhang, Y., et al.: Branch scheduling: dag-aware scheduling for speeding up data-parallel jobs. In: the International Symposium (2019)

  23. Zhuo, T., Az, A., Xz, A., et al.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)

    Article  Google Scholar 

  24. Jin, H., Chen, F., Wu, S., et al.: Towards low-latency batched stream processing by pre-scheduling. IEEE Trans. Parallel Distrib. Syst. 30(3), 710–722 (2019)

    Article  Google Scholar 

  25. Li, H., Wang, H., Fang, S., et al.: An energy-aware scheduling algorithm for big data applications in Spark. Clust. Comput. 23(11) (2020).

  26. Fu, Z., Tang, Z., Yang, L., et al.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parallel Distrib. Syst. 99, 1 (2020)

    Google Scholar 

  27. Xu, Y., Liu, L., Ding, Z.: DAG-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2020)

  28. Fu, Z., Tang, Z., Yang, L., et al.: ImRP: A predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)

    Article  MathSciNet  Google Scholar 

  29. Boubela, R.N., Klaudius, K., Wolfgang, H., et al.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9(62), 492 (2016)

    Google Scholar 

  30. Gopalani, S., Arora, R., Gopalani, S., et al.: Comparing apache spark and map reduce with performance analysis using K-means. Int. J. Comput. Appl. 113(1), 8–11 (2015)

    Google Scholar 

  31. Frigo, M., Leiserson, C.E., Prokop, H., et al.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1), 4 (2012)

    Article  MathSciNet  Google Scholar 

  32. Yu, Y., Wei, W., Zhang, J., et al. LRC: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017—IEEE Conference on Computer Communications. IEEE (2017)

Download references

Acknowledgements

The work was supported by Key Research and Development Plan of Hubei Province (No. 2020BAB102), Open fund Project of Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation. Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunlin Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Cai, Q. & Luo, Y. Dynamic data replacement and adaptive scheduling policies in spark. Cluster Comput 25, 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03541-2

Keywords

Navigation