Skip to main content
Log in

LCS: An Efficient Data Eviction Strategy for Spark

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space is not sufficient, and reduce 30–50% of the total execution time .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. We have made LCS open source at Github, and also added the patch at Apache Software Foundation. The web links are https://github.com/SCTS/Spark-LCS and https://issues.apache.org/jira/browse/SPARK-14289, respectively.

References

  1. Hadoop, A. http://hadoop.apache.org

  2. HiBench. https://github.com/intel-hadoop/HiBench

  3. Unified Memory Management. https://issues.apache.org/jira/browse/SPARK-10000

  4. Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 267–280 (2012)

  5. Boldi, P., Vigna, S.: The webgraph framework I: Compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)

  6. Bu, Y., Borkar, V., Xu, G., Carey, M.J.: A bloat-aware design for big data applications. In: Proceedings of the 2013 International Symposium on Memory Management (ISMM), pp. 119–130 (2013)

  7. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation (OSDI), pp. 137–150 (2004)

  9. Fan, B., Andersen, D.G., Kaminsky, M.: Memc3: compact and concurrent memcache with dumber caching and smarter hashing. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 371–384 (2013)

  10. Ghandeharizadeh, S., Irani, S., Lam, J., Yap, J.: Camp: a cost adaptive multiqueue eviction policy for key-value stores. In: Proceedings of the 15th International Middleware Conference (Middleware), pp. 289–300 (2014)

  11. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), pp. 599–613 (2014)

  12. Jalaparti, V., Bodik, P., Menache, I., Rao, S., Makarychev, K., Caesar, M.: Network-aware scheduling for data-parallel jobs: plan when you can. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 407–420 (2015)

  13. Li, C., Cox, A.L.: Gd-wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems (EuroSys), pp. 1–15 (2015)

  14. Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing (SoCC), pp. 1–15 (2014)

  15. Mitchell, N., Sevitsky, G.: Building memory-efficient java applications: practices and challenges. PLDI Tutorial (2009)

  16. Nguyen, K., Wang, K., Bu, Y., Fang, L., Hu, J., Xu, G.: Facade: a compiler and runtime for (almost) object-bounded big data applications. In: Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 675–690 (2015)

  17. Lu, L., Shi, X., Zhou, Y., Zhang, X., Jin, H., Pei, C., He, L., Geng, Y.: Lifetime-based memory management for distributed data processing systems. In: Proceedings of the VLDB Endowment (PVLDB), pp. 936–947 (2016)

  18. Young, N.E.: The k-server dual and loose competitiveness for paging. Algorithmica 11(6), 525–541 (1994)

    Article  MathSciNet  Google Scholar 

  19. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 15–28 (2012)

  20. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), p. 10 (2010)

Download references

Acknowledgments

This paper is partly supported by the NSFC under Grant Nos. 61433019 and 61370104, International Science and Technology Cooperation Program of China under Grant No. 2015DFE12860, National 863 Hi-Tech Research and Development Program under Grant No. 2014AA01A301.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuanhua Shi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geng, Y., Shi, X., Pei, C. et al. LCS: An Efficient Data Eviction Strategy for Spark. Int J Parallel Prog 45, 1285–1297 (2017). https://doi.org/10.1007/s10766-016-0470-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0470-1

Keywords

Navigation