Abstract
Improper data replacement and inappropriate selection of job scheduling policy are important reasons for the degradation of Spark system operation speed, which directly causes the performance degradation of Spark parallel computing. In this paper, we analyze the existing caching mechanism of Spark and find that there is still more room for optimization of the existing caching policy. For the task structure analysis, the key information of Spark tasks is taken out to obtain the data and memory usage during the task runtime, and based on this, an RDD weight calculation method is proposed, which integrates various factors affecting the RDD usage and establishes an RDD weight model. Based on this model, a minimum weight replacement algorithm based on RDD structure analyzing is proposed. The algorithm ensure that the relatively more valuable data in the data replacement process can be cached into memory. In addition, the default job scheduling algorithm of the Spark framework considers a single factor, which cannot form effective scheduling for jobs and causes a waste of cluster resources. In this paper, an adaptive job scheduling policy based on job classification is proposed to solve the above problem. The policy can classify job types and schedule resources more effectively for different types of jobs. The experimental results show that the proposed dynamic data replacement algorithm effectively improves Spark's memory utilization. The proposed job classification-based adaptive job scheduling algorithm effectively improves the system resource utilization and shortens the job completion time.
Similar content being viewed by others
References
Manyika, J., Chui, M., Brown, B., et al. Big data: The next frontier for innovation, competition, and productivity[R/OL]. The Mekinsey Global Institute, Las Vegas (2011). http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
Armstrong, K.: Big data: a revolution that will transform how we live, work, and think. Math. Comput. Educ. 47(10), 181–183 (2014)
Li, C., Song, M., Zhang, Q., Luo, Y.: Cluster load based content distribution and speculative execution for geographically distributed cloud environment. Comput. Netw. 186, 107807 (2021)
Li, C., Song, M., Zhang, M., Luo, Y.: Effective replica management for improving reliability and availability in edge-cloud computing environment. J. Parallel Distrib. Comput. 143, 107–128 (2020)
Lig, J., Cheng, X.Q.: Research status and scientific thinking of big data. Bull. Chin. Acad. Sci 27(6), 647–657 (2012)
Wang, Y., Lu, W., Lou, R., et al.: Improving map reduce performance with partial speculative execution. J. Grid Comput. 13(4), 587–604 (2015)
Grolinger, K., Hayes, M., Higashino, W.A., et al.: Challenges for map reduce in big data. In: 2014 IEEE World Congress on Services (SERVICES), pp 182–189 (2014)
Huang, C.Q., Yang, S.Q., Tang, J.C., et al.: RDD share: reusing results of spark RDD. In: IEEE International Conference on Data Science in Cyberspace. IEEE (2017).
Li, C., Bai, J., Chen, Y., Luo, Y.: Resource and replica management strategy for optimizing financial cost and user experience in edge cloud computing system. Inf. Sci. 516, 33–55 (2020)
He, Z., Huang, Q., Li, Z., et al.: Resource and replica management strategy in spark SQL using task stealing. Int. J. Parallel Program. (2020)
Jyothi, S.A., Curino, C., Menache, I., et al.: Morpheus: towards automated SLOs for enterprise clusters. In: Usenix Conference on Operating Systems Design and Implementation, pp. 117–134. USENIX Association Berkeley (2016).
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association, Berkeley (2012)
Xu, L., Min, L., Li, Z., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2016)
Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. 28(8), 2473–2486 (2016)
Liu, H., Tan, L.: New RDD partition weight cache replacement algorithm in spark. J. Chin. Comput. Syst. (2018).
Chen, T.Y., Zhang, L.X., Ken-Li, L.I., et al.: Optimization of RDD Cache Replacement Strategy Optimization in Spark Framework. (2019)
Wang, Y., Lei, T., et al.: A Lowest Cost RDD Caching Strategy for Spark. (2019)
Wang, B., Tang, J., Zhang, R., et al. (2019) LCRC: a dependency-aware cache management policy for spark. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE (2019)
Wason, S., Kumar, P., Rathi, S.: Dynamic memory allocation. J. Comput. Sci. Eng. 4, 99–103 (2017)
Wang, K., Khan, M., Nguyen, N., et al.: Design and implementation of an analytical framework for interference aware job scheduling on Apache Spark platform. Clust. Comput. 22(1), 2223–2237 (2019)
Zainab, A., Ghrayeb, A., Abu-Rub, H., et al.: Distributed tree-based machine learning for short-term load forecasting with apache spark. IEEE Access 99, 1 (2021)
Hu, Z., Li, D., Zhang, Y., et al.: Branch scheduling: dag-aware scheduling for speeding up data-parallel jobs. In: the International Symposium (2019)
Zhuo, T., Az, A., Xz, A., et al.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)
Jin, H., Chen, F., Wu, S., et al.: Towards low-latency batched stream processing by pre-scheduling. IEEE Trans. Parallel Distrib. Syst. 30(3), 710–722 (2019)
Li, H., Wang, H., Fang, S., et al.: An energy-aware scheduling algorithm for big data applications in Spark. Clust. Comput. 23(11) (2020).
Fu, Z., Tang, Z., Yang, L., et al.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parallel Distrib. Syst. 99, 1 (2020)
Xu, Y., Liu, L., Ding, Z.: DAG-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2020)
Fu, Z., Tang, Z., Yang, L., et al.: ImRP: A predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)
Boubela, R.N., Klaudius, K., Wolfgang, H., et al.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9(62), 492 (2016)
Gopalani, S., Arora, R., Gopalani, S., et al.: Comparing apache spark and map reduce with performance analysis using K-means. Int. J. Comput. Appl. 113(1), 8–11 (2015)
Frigo, M., Leiserson, C.E., Prokop, H., et al.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1), 4 (2012)
Yu, Y., Wei, W., Zhang, J., et al. LRC: dependency-aware cache management for data analytics clusters. In: IEEE INFOCOM 2017—IEEE Conference on Computer Communications. IEEE (2017)
Acknowledgements
The work was supported by Key Research and Development Plan of Hubei Province (No. 2020BAB102), Open fund Project of Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation. Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, C., Cai, Q. & Luo, Y. Dynamic data replacement and adaptive scheduling policies in spark. Cluster Comput 25, 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-022-03541-2