LCS: An Efficient Data Eviction Strategy for Spark

Geng, Yuanzhen; Shi, Xuanhua; Pei, Cheng; Jin, Hai; Jiang, Wenbin

doi:10.1007/s10766-016-0470-1

LCS: An Efficient Data Eviction Strategy for Spark

Published: 02 November 2016

Volume 45, pages 1285–1297, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Yuanzhen Geng¹,
Xuanhua Shi ORCID: orcid.org/0000-0001-8451-8656¹,
Cheng Pei¹,
Hai Jin¹ &
…
Wenbin Jiang¹

974 Accesses
13 Citations
Explore all metrics

Abstract

As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space is not sufficient, and reduce 30–50% of the total execution time .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Article 26 December 2022

Hui Li, Shuping Ji, … Tao Huang

Memory Management Approaches in Apache Spark: A Review

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Article 02 August 2021

Chunlin Li, Qianqian Cai & Youlong Luo

Notes

We have made LCS open source at Github, and also added the patch at Apache Software Foundation. The web links are https://github.com/SCTS/Spark-LCS and https://issues.apache.org/jira/browse/SPARK-14289, respectively.

References

Hadoop, A. http://hadoop.apache.org
HiBench. https://github.com/intel-hadoop/HiBench
Unified Memory Management. https://issues.apache.org/jira/browse/SPARK-10000
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 267–280 (2012)
Boldi, P., Vigna, S.: The webgraph framework I: Compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Bu, Y., Borkar, V., Xu, G., Carey, M.J.: A bloat-aware design for big data applications. In: Proceedings of the 2013 International Symposium on Memory Management (ISMM), pp. 119–130 (2013)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation (OSDI), pp. 137–150 (2004)
Fan, B., Andersen, D.G., Kaminsky, M.: Memc3: compact and concurrent memcache with dumber caching and smarter hashing. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 371–384 (2013)
Ghandeharizadeh, S., Irani, S., Lam, J., Yap, J.: Camp: a cost adaptive multiqueue eviction policy for key-value stores. In: Proceedings of the 15th International Middleware Conference (Middleware), pp. 289–300 (2014)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), pp. 599–613 (2014)
Jalaparti, V., Bodik, P., Menache, I., Rao, S., Makarychev, K., Caesar, M.: Network-aware scheduling for data-parallel jobs: plan when you can. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 407–420 (2015)
Li, C., Cox, A.L.: Gd-wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems (EuroSys), pp. 1–15 (2015)
Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing (SoCC), pp. 1–15 (2014)
Mitchell, N., Sevitsky, G.: Building memory-efficient java applications: practices and challenges. PLDI Tutorial (2009)
Nguyen, K., Wang, K., Bu, Y., Fang, L., Hu, J., Xu, G.: Facade: a compiler and runtime for (almost) object-bounded big data applications. In: Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 675–690 (2015)
Lu, L., Shi, X., Zhou, Y., Zhang, X., Jin, H., Pei, C., He, L., Geng, Y.: Lifetime-based memory management for distributed data processing systems. In: Proceedings of the VLDB Endowment (PVLDB), pp. 936–947 (2016)
Young, N.E.: The k-server dual and loose competitiveness for paging. Algorithmica 11(6), 525–541 (1994)
Article MathSciNet Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 15–28 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), p. 10 (2010)

Download references

Acknowledgments

This paper is partly supported by the NSFC under Grant Nos. 61433019 and 61370104, International Science and Technology Cooperation Program of China under Grant No. 2015DFE12860, National 863 Hi-Tech Research and Development Program under Grant No. 2014AA01A301.

Author information

Authors and Affiliations

Services Computing Technology and System Lab & Cluster and Grid Computing Lab & Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Yuanzhen Geng, Xuanhua Shi, Cheng Pei, Hai Jin & Wenbin Jiang

Authors

Yuanzhen Geng
View author publications
You can also search for this author in PubMed Google Scholar
Xuanhua Shi
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Pei
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuanhua Shi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Geng, Y., Shi, X., Pei, C. et al. LCS: An Efficient Data Eviction Strategy for Spark. Int J Parallel Prog 45, 1285–1297 (2017). https://doi.org/10.1007/s10766-016-0470-1

Download citation

Received: 18 October 2016
Accepted: 26 October 2016
Published: 02 November 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10766-016-0470-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LCS: An Efficient Data Eviction Strategy for Spark

Abstract

Access this article

Similar content being viewed by others

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Memory Management Approaches in Apache Spark: A Review

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LCS: An Efficient Data Eviction Strategy for Spark

Abstract

Access this article

Similar content being viewed by others

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Memory Management Approaches in Apache Spark: A Review

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation