Memory management optimization strategy in Spark framework based on less contention

Song, Yixin; Yu, Junyang; Wang, JinJiang; He, Xin

doi:10.1007/s11227-022-04663-5

Memory management optimization strategy in Spark framework based on less contention

Published: 27 July 2022

Volume 79, pages 1504–1525, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yixin Song^1,2,
Junyang Yu ORCID: orcid.org/0000-0003-0151-580X^1,2,
JinJiang Wang^1,2 &
…
Xin He^1,2

460 Accesses
Explore all metrics

Abstract

The parallel computing framework Spark 2.x adopts a unified memory management model. In the case of the memory bottleneck, the memory allocation of active tasks and the RDD(Resilient Distributed Datasets) cache causes memory contention, which may reduce computing resource utilization and persistence acceleration effects, thus affecting program execution efficiency. To this end, we propose less contention management strategy, abbreviated as MCM, to reduce the negative impact of memory contention. MCM is divided into two steps: Firstly, the task minimum memory priority guarantee algorithm is priority to meet the minimum resources of tasks for execution, to optimize the number of active tasks. Secondly, considering contention costs, the persisted location selection algorithm dynamically selects the best storage location to improve the effect of persistence acceleration. The experimental results comfirm that MCM has wonderful adaptability and scalability. In the case of serious memory bottleneck, MCM obviously reduces job execution time. Compared with similar works, such as only_memory, only_disk, memory_and_disk, DMAOM and SACM, MCM reduces the execution time by 28.3% .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Article 26 December 2022

Memory Management Approaches in Apache Spark: A Review

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author or the first author on reasonable request.

References

Mostafaeipour A, Rafsanjani AJ, Ahmadi M, Dhanraj JA (2020) Investigating the performance of hadoop and spark platforms on machine learning algorithms. J Supercomput 77(2):1–28
Google Scholar
Apache. Apache Spark. https://spark.apache.org. Accessed 24 Oct 2021
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark. O’Reilly Media Inc. Sebastopol pp 1-30
Zhang XW, Li ZH, Liu GS, Xu JJ, Xie TK (2018) A spark scheduling strategy for heterogeneous cluster. Comput Mater Continua 55(3):405–417
Google Scholar
Ahmed N, Barczak A, Susnjak T et al (2020) A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. J Big Data 110(7):1–18
Google Scholar
Hu ZY, Shi XH, Ke ZX et al (2020) Estimating the memory consumption of big data applications based on program analysis. Scientia Sinica Inf 50(8):1178–1196
Google Scholar
Hong-tao M, Song-ping Y, Fang L, Nong XIAO et al (2017) Research on memory management and cache replacement policies in spark. Comput Sci 44(06):37–41
Google Scholar
Apache.Spark memory management overview. http://Spark.apache.org/docs/latest/tuning.html#memory-management-overview. Accessed 24 Oct (2021)
Zaharia M, Xin R, Wendell P et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar
Apache.Unified Memory Management in Spark 1.6. https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-Spark-10000.pdf. Accessed 24 Dec (2019)
Zhao Z, Zhang H, Geng X, Ma H (2019). Resource-aware cache management for in-memory data analytics frameworks. In: 2019 IEEE international conference on parallel & distributed processing with applications, big data & cloud computing, sustainable computing & communications, social computing & networking (ISPA/BDCloud/SocialCom/SustainCom), pp 365-371
Bian C (2017) Research on key technologies of memory computing framework performance optimization (Ph.D. Thesis). Xinjiang University, China
Ying C T (2017) Research on storage layer fault tolerance and optimization strategy in memory computing environment (Ph.D. Thesis). Xinjiang University, China
L Yuan (2018) Research and optimization on resource usage and allocation strategy for spark (M.S. Thesis). Huazhong University of Science and Technology, China
Geng Yuanzhen et al (2017) LCS: an efficient data eviction strategy for spark. Int J Parallel Prog 45(6):1285–1297
Article Google Scholar
Adinew DM, Shijie Z, Liao Y (2020) Spark performance optimization analysis in memory management with deploy mode in standalone cluster computing. In: 2020 IEEE 36th international conference on data engineering (ICDE), IEEE, pp 58-69
Wang SZ, Zhang YP, Zhang L, Cao N, Pang CY (2018) An improved memory cache management study based on spark. Comput Mater Continua 56(04):415–431
Google Scholar
Yun W, Yuchen Ding (2020) Research on efficient RDD self-cache replacement strategy in Spark. Appl Res Comput 37(10):3043–3047
Google Scholar
Wangjian L, Yongfeng H, Congkai Bao (2018) Memory optimization of Spark parallellel computing framework. Comput Eng Sci 40(04):587–593
Google Scholar
Wang B, Tang J, Zhang R, Ding W, Qi D (2019) LCRC: a dependency-aware cache management policy for spark. In: 2018 IEEE international conference on parallel & distributed processing with applications, ubiquitous computing & communications, big data & cloud computing, social computing & networking, sustainable computing & communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE. pp 27-46
Young N (1994) The k-server dual and loose competitiveness for paging. Algorithmica 11(6):525–541
Article MathSciNet Google Scholar
Li C, Cox AL (2015) GD-Wheel: a cost-aware replacement policy for key-value stores. In: the tenth European conference on computer systems,ACM, pp 1-15
Duan M, Li K, Tang Z, Xiao G, Li K (2016) Selection and replacement algorithms for memory performance improvement in Spark. Concurr Comput Pract Exp 28(8):2473–2486
Article Google Scholar
Heng Liu, Liang Tan (2018) New RDD partition weight cache replacement algorithm in spark. J Chin Comput Syst 39(10):2279–2284
Google Scholar
Bian C, Yu J, Ying CT et al (2017) Self-adaptive strategy for cache management in spark. Acta Electron Sin 45(2):278–284
Google Scholar
Kang M, Lee JG (2020) Effect of garbage collection in iterative algorithms on Spark: an experimental analysis. J Supercomput 76(01):7204–7218
Article Google Scholar
Bian C, Yu J, Xiu YR et al (2019) Parallelism deduction algorithm for spark. J Univ Electron Sci Technol China 48(04):567–574
Google Scholar
Zhuo T, Az A, Xz A, Li YC, Kl A (2020) Dynamic memory-aware scheduling in Spark computing environment. J Parallel Distrib Comput 14(01):10–22
Google Scholar
Xu L, Li M, Zhang L, Butt AR, Wang Y, Hu ZZ (2016) MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: proceedings of IEEE international parallel and distributed processing symposium (IPDPS), pp 383–392
Wang Suzhen et al (2019) A dynamic memory allocation optimization mechanism based on spark. CMC Comput Mater Continua 61(02):739–757
Google Scholar
Karau H, Warren R (2016) High performance Spark: best practices for scaling and optimizing Apache Spark. O’Reilly Media Inc, Sebastopol pp 1-10
Li C, Cai Q, Luo Y (2022) Data balancing-based intermediate data partitioning and check point-based cache recovery in spark environment. Supercomput 78(08):3561–3604
Article Google Scholar
Elmeiligy MA, Desouky A, Elghamrawy SM (2021) An efficient parallel indexing structure for multi-dimensional big data using spark. Supercomput 77(03):11187–11214
Article Google Scholar
Raj S, Ramesh D, Sethi KK (2021) A spark-based apriori algorithm with reduced shuffle overhead. Supercomput 77(03):133–151
Article Google Scholar
Zhu Z, Shen Q, Yang Y, Wu Z (2017) MCS: memory constraint strategy for unified memory manager in spark. In: IEEE international conference on parallel & distributed systems, IEEE. pp 41-60
Jia D, Bhimani J, Nguyen SN, Sheng B, Mi N (2019) ATuMm: auto-tuning memory manager in apache spark. In: 2019 IEEE 38th international performance computing and communications conference (IPCCC), IEEE. pp 14-33
Hussain MA, Tsai TH (2021) Memory access optimization for on-chip transfer learning. IEEE Trans Circuits Syst 68(04):1507–1519
Article Google Scholar
Kumari P, Saxena AS (2021) Advanced fusion ACO approach for memory optimization in cloud computing environment. In: 2021 third international conference on intelligent communication technologies and virtual mobile networks (ICICV) pp 168-172
Allen T, Ge R (2021) In-depth analyses of unified virtual memory system for GPU accelerated computing. In: the international conference for high performance computing, networking, storage and analysis pp 1-15
Bender MA (2021) External-memory dictionaries in the affine and PDAM models. ACM Trans Parallel Comput 8(03):1–20
Article MathSciNet Google Scholar
Saha R, Pundir YP, Pal PK (2021) Design of an area and energy-efficient last-level cache memory using STT-MRAM. J Magn Magn Mater 529(03):167882
Article Google Scholar
Chaudhuri M (2021) Zero directory eviction victim: unbounded coherence directory and core cache isolation. In: 2021 IEEE international symposium on high-performance computer architecture (HPCA) IEEE, pp 277-290
Apache. Apache Spark web interfaces. https://Spark.apache.org/docs/latest/monitoring.html. Accessed 24 Dec (2021)

Download references

Acknowledgements

This work was supported by Henan Province Science and Technology R &D Project (Grant No: 212102210078) and Henan Province Major Science and Technology Project (Grant No: 201300210400).

Author information

Authors and Affiliations

School of Software, Henan University, Kaifeng, 475000, China
Yixin Song, Junyang Yu, JinJiang Wang & Xin He
Intelligent Data Processing Engineering Research Center of Henan Province, Kaifeng, China
Yixin Song, Junyang Yu, JinJiang Wang & Xin He

Authors

Yixin Song
View author publications
You can also search for this author inPubMed Google Scholar
Junyang Yu
View author publications
You can also search for this author inPubMed Google Scholar
JinJiang Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xin He
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Junyang Yu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Y., Yu, J., Wang, J. et al. Memory management optimization strategy in Spark framework based on less contention. J Supercomput 79, 1504–1525 (2023). https://doi.org/10.1007/s11227-022-04663-5

Download citation

Accepted: 17 June 2022
Published: 27 July 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11227-022-04663-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory management optimization strategy in Spark framework based on less contention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Memory Management Approaches in Apache Spark: A Review

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now