Dynamic data replacement and adaptive scheduling policies in spark

Li, Chunlin; Cai, Qianqian; Luo, Youlong

doi:10.1007/s10586-022-03541-2

Dynamic data replacement and adaptive scheduling policies in spark

Published: 19 January 2022

Volume 25, pages 1421–1439, (2022)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Chunlin Li^1,2,
Qianqian Cai² &
Youlong Luo²

529 Accesses
3 Citations
Explore all metrics

Abstract

Improper data replacement and inappropriate selection of job scheduling policy are important reasons for the degradation of Spark system operation speed, which directly causes the performance degradation of Spark parallel computing. In this paper, we analyze the existing caching mechanism of Spark and find that there is still more room for optimization of the existing caching policy. For the task structure analysis, the key information of Spark tasks is taken out to obtain the data and memory usage during the task runtime, and based on this, an RDD weight calculation method is proposed, which integrates various factors affecting the RDD usage and establishes an RDD weight model. Based on this model, a minimum weight replacement algorithm based on RDD structure analyzing is proposed. The algorithm ensure that the relatively more valuable data in the data replacement process can be cached into memory. In addition, the default job scheduling algorithm of the Spark framework considers a single factor, which cannot form effective scheduling for jobs and causes a waste of cluster resources. In this paper, an adaptive job scheduling policy based on job classification is proposed to solve the above problem. The policy can classify job types and schedule resources more effectively for different types of jobs. The experimental results show that the proposed dynamic data replacement algorithm effectively improves Spark's memory utilization. The proposed job classification-based adaptive job scheduling algorithm effectively improves the system resource utilization and shortens the job completion time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Task scheduling and VM placement to resource allocation in Cloud computing: challenges and opportunities

Article 08 July 2023

A novel strategy for deterministic workflow scheduling with load balancing using modified min-min heuristic in cloud computing environment

Article 15 March 2024

References

Manyika, J., Chui, M., Brown, B., et al. Big data: The next frontier for innovation, competition, and productivity[R/OL]. The Mekinsey Global Institute, Las Vegas (2011). http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
Armstrong, K.: Big data: a revolution that will transform how we live, work, and think. Math. Comput. Educ. 47(10), 181–183 (2014)
Google Scholar
Li, C., Song, M., Zhang, Q., Luo, Y.: Cluster load based content distribution and speculative execution for geographically distributed cloud environment. Comput. Netw. 186, 107807 (2021)
Article Google Scholar
Li, C., Song, M., Zhang, M., Luo, Y.: Effective replica management for improving reliability and availability in edge-cloud computing environment. J. Parallel Distrib. Comput. 143, 107–128 (2020)
Article Google Scholar
Lig, J., Cheng, X.Q.: Research status and scientific thinking of big data. Bull. Chin. Acad. Sci 27(6), 647–657 (2012)
Google Scholar
Wang, Y., Lu, W., Lou, R., et al.: Improving map reduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)
Article Google Scholar
Grolinger, K., Hayes, M., Higashino, W.A., et al.: Challenges for map reduce in big data. In: 2014 IEEE World Congress on Services (SERVICES), pp 182–189 (2014)
Huang, C.Q., Yang, S.Q., Tang, J.C., et al.: RDD share: reusing results of spark RDD. In: IEEE International Conference on Data Science in Cyberspace. IEEE (2017).
Li, C., Bai, J., Chen, Y., Luo, Y.: Resource and replica management strategy for optimizing financial cost and user experience in edge cloud computing system. Inf. Sci. 516, 33–55 (2020)
Article MathSciNet Google Scholar
He, Z., Huang, Q., Li, Z., et al.: Resource and replica management strategy in spark SQL using task stealing. Int. J. Parallel Program. (2020)
Jyothi, S.A., Curino, C., Menache, I., et al.: Morpheus: towards automated SLOs for enterprise clusters. In: Usenix Conference on Operating Systems Design and Implementation, pp. 117–134. USENIX Association Berkeley (2016).
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, Berkeley (2012)
Xu, L., Min, L., Li, Z., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2016)
Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. 28(8), 2473–2486 (2016)
Article Google Scholar
Liu, H., Tan, L.: New RDD partition weight cache replacement algorithm in spark. J. Chin. Comput. Syst. (2018).
Chen, T.Y., Zhang, L.X., Ken-Li, L.I., et al.: Optimization of RDD Cache Replacement Strategy Optimization in Spark Framework. (2019)
Wang, Y., Lei, T., et al.: A Lowest Cost RDD Caching Strategy for Spark. (2019)
Wang, B., Tang, J., Zhang, R., et al. (2019) LCRC: a dependency-aware cache management policy for spark. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE (2019)
Wason, S., Kumar, P., Rathi, S.: Dynamic memory allocation. J. Comput. Sci. Eng. 4, 99–103 (2017)
Google Scholar
Wang, K., Khan, M., Nguyen, N., et al.: Design and implementation of an analytical framework for interference aware job scheduling on Apache Spark platform. Clust. Comput. 22(1), 2223–2237 (2019)
Article Google Scholar
Zainab, A., Ghrayeb, A., Abu-Rub, H., et al.: Distributed tree-based machine learning for short-term load forecasting with apache spark. IEEE Access 99, 1 (2021)
Google Scholar
Hu, Z., Li, D., Zhang, Y., et al.: Branch scheduling: dag-aware scheduling for speeding up data-parallel jobs. In: the International Symposium (2019)
Zhuo, T., Az, A., Xz, A., et al.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)
Article Google Scholar
Jin, H., Chen, F., Wu, S., et al.: Towards low-latency batched stream processing by pre-scheduling. IEEE Trans. Parallel Distrib. Syst. 30(3), 710–722 (2019)
Article Google Scholar
Li, H., Wang, H., Fang, S., et al.: An energy-aware scheduling algorithm for big data applications in Spark. Clust. Comput. 23(11) (2020).
Fu, Z., Tang, Z., Yang, L., et al.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parallel Distrib. Syst. 99, 1 (2020)
Google Scholar
Xu, Y., Liu, L., Ding, Z.: DAG-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2020)
Fu, Z., Tang, Z., Yang, L., et al.: ImRP: A predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)
Article MathSciNet Google Scholar
Boubela, R.N., Klaudius, K., Wolfgang, H., et al.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9(62), 492 (2016)
Google Scholar
Gopalani, S., Arora, R., Gopalani, S., et al.: Comparing apache spark and map reduce with performance analysis using K-means. Int. J. Comput. Appl. 113(1), 8–11 (2015)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., et al.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1), 4 (2012)
Article MathSciNet Google Scholar
Yu, Y., Wei, W., Zhang, J., et al. LRC: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017—IEEE Conference on Computer Communications. IEEE (2017)

Download references

Acknowledgements

The work was supported by Key Research and Development Plan of Hubei Province (No. 2020BAB102), Open fund Project of Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation. Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.

Author information

Authors and Affiliations

Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation, China Merchants Testing Certification Vehicle Technology Research Institute Co., Ltd, Chongqing, People’s Republic of China
Chunlin Li
Department of Computer Science, Wuhan University of Technology, Wuhan, 430063, People’s Republic of China
Chunlin Li, Qianqian Cai & Youlong Luo

Authors

Chunlin Li
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Cai
View author publications
You can also search for this author in PubMed Google Scholar
Youlong Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunlin Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Cai, Q. & Luo, Y. Dynamic data replacement and adaptive scheduling policies in spark. Cluster Comput 25, 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2

Download citation

Received: 25 July 2020
Revised: 11 December 2021
Accepted: 07 January 2022
Published: 19 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10586-022-03541-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic data replacement and adaptive scheduling policies in spark

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Task scheduling and VM placement to resource allocation in Cloud computing: challenges and opportunities

A novel strategy for deterministic workflow scheduling with load balancing using modified min-min heuristic in cloud computing environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dynamic data replacement and adaptive scheduling policies in spark

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Task scheduling and VM placement to resource allocation in Cloud computing: challenges and opportunities

A novel strategy for deterministic workflow scheduling with load balancing using modified min-min heuristic in cloud computing environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation