Skip to main content
Log in

A comparative performance study of spark on kubernetes

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Kubernetes makes it easier to automate deployment and scale containerized applications to achieve a near-native performance. However, there is still a lack of systematic performance studies on how Spark applications perform on Kubernetes. In this paper, we first propose a model to capture the execution behavior of tasks, stages, and jobs, and present an implementation of a prototype system based on the model. The system is then used to collect and analyze various types of performance and system metrics, such as execution time and CPU utilization. Second, with the use of various Spark applications, we evaluate the performance of Spark on Kubernetes by comparing it with its baseline, i.e., Spark on bare metal. Based on the comparison and leveraging the system, we locate what stages suffer from the performance loss of these applications on Kubernetes, and then reveal the root causes of the loss by analyzing their work-flows, execution time and costs of system resources. Through extensive measurements, we find that Spark on Kubernetes falls behind its baseline in the range of \(-2.9\%\) to 83.9%. There are several root causes of the performance loss and benefits of Spark on Kubernetes. First, data locality deterioration by pods is a crucial root cause of the loss. To address the problem, we propose an approach to schedule tasks by taking both data locality and the utilization of executors into account. Experiments show that this approach increases the performance of Spark on Kubernetes by up to 32.2%. Second, the lower CPU usages of executors are another root cause of the performance loss, even if they have an equivalent CPU configuration on both Kubernetes and bare metal. In contrast, with the same memory configuration, executors use more memory on Kubernetes than on bare metal, contributing to the performance benefit of Spark on Kubernetes in some stages. Our research efforts in this paper benefit developers and researchers when they make valuable decisions on deploying Spark applications on Kubernetes for a better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. All experiment scripts have uploaded to https://github.com/zcp/Evaluation_of_Spark_on_kubernetes.

  2. In our experiments, CPU requests and CPU limits are assigned with the same value.

References

  1. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, vol 51. pp 137–150. https://doi.org/10.1145/1327452.1327492

  2. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on: 2010, vol 26. https://doi.org/10.1109/MSST.2010.5496972

  3. Zaharia M, Chowdhury NMM, Franklin M, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html

  4. Shoro TRSAG (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol

  5. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI 15–28

  6. Burns B, Grant B, Oppenheimer D, Brewer E, Wilkes J (2016) Borg, omega, and kubernetes. Queue 14(1):10–701093. https://doi.org/10.1145/2898442.2898444

    Article  Google Scholar 

  7. Running Spark on Kubernetes. https://spark.apache.org/docs/latest/running-on-kubernetes.html

  8. Felter W, Ferreira A, Rajamony R, Rubio J (2015) An updated performance comparison of virtual machines and linux containers. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172 (2015). https://doi.org/10.1109/ISPASS.2015.7095802

  9. Bhimani J, Yang Z, Leeser M, Mi N (2017) Accelerating big data applications using lightweight virtualization framework on enterprise cloud. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7 (2017). https://doi.org/10.1109/HPEC.2017.8091086

  10. Zhang Q, Liu L, Pu C, Dou Q, Wu L, Zhou W (2018) A comparative study of containers and virtual machines in big data environment. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pp 178–185 (2018). https://doi.org/10.1109/CLOUD.2018.00030

  11. Pereira Ferreira A, Sinnott R (2019) A performance evaluation of containers running on managed kubernetes services. In: 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp 199–208 (2019). https://doi.org/10.1109/CloudCom.2019.00038

  12. Ruan B, Huang H, Wu S, Jin H (2016) A performance study of containers in cloud environment 10065:343–356. https://doi.org/10.1007/978-3-319-49178-3_27

  13. Stan C, Pandelica A, Zamfir V, Stan R, Negru C (2019) Apache spark and apache ignite performance analysis. In: 2019 22nd International Conference on Control Systems and Computer Science (CSCS), pp 726–733

  14. Xavier MG, Neves MV, Rose CAFD (2014) A performance comparison of container-based virtualization systems for mapreduce clusters. In: Proceedings of the 2014 22Nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. PDP ’14, pp 299–306. IEEE Computer Society, Washington, DC, USA. https://doi.org/10.1109/PDP.2014.78

  15. Wang K, Khan MMH (2015) Performance prediction for apache spark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp 166–173. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.246

  16. Adinew DM, Shijie Z, Liao Y (2020) Spark performance optimization analysis in memory management with deploy mode in standalone cluster computing. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp 2049–2053. https://doi.org/10.1109/ICDE48307.2020.00242

  17. Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for big data applications. J Netw Comput Appl 142:63–75. https://doi.org/10.1016/j.jnca.2019.06.009

    Article  Google Scholar 

  18. Lu S, Wei X, Rao B, Tak B, Wang L, Wang L (2019) Ladra: Llg-based abnormal task detection and root-cause analysis in big data processing with spark. Fut Gen Comput Syst 95:392–403. https://doi.org/10.1016/j.future.2018.12.002

    Article  Google Scholar 

  19. Wang X, Yang LT, Liu H, Deen MJ (2018) A big data-as-a-service framework: state-of-the-art and perspectives. IEEE Trans Big Data 4(3):325–340. https://doi.org/10.1109/TBDATA.2017.2757942

    Article  Google Scholar 

  20. Mostafaeipour A, Rafsanjani AJ, Ahmadi M, Dhanraj JA (2020) Investigating the performance of hadoop and spark platforms on machine learning algorithms. J Supercomput pp 1–28

  21. Ahmed N, Barczak ALC, Susnjak T, Rashid M (2020) A comprehensive performance analysis of apache hadoop and apache spark for large scale data sets using hibench. J Big Data 7. https://doi.org/10.1186/s40537-020-00388-5

  22. Mavridis I, Karatza H (2017) Performance evaluation of cloud-based log file analysis with apache hadoop and apache spark. J Syst Softw 125:133–151. https://doi.org/10.1016/j.jss.2016.11.037

    Article  Google Scholar 

  23. Zhu C, Zhao Y, Han B, Zeng Q, Ma Y (2014) Runtime support for type-safe and context-based behavior adaptation. Front Comp Sci 8(1):17–32. https://doi.org/10.1007/s11704-013-2337-6

    Article  MathSciNet  Google Scholar 

  24. Sharma P, Chaufournier L, Shenoy P, Tay YC (2016) Containers and virtual machines at scale: A comparative study. In: Proceedings of the 17th International Middleware Conference. Middleware 16. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2988336.2988337

  25. Kozhirbayev Z, Sinnott RO (2017) A performance comparison of container-based technologies for the cloud. Fut Gen Comput Syst 68:175–182. https://doi.org/10.1016/j.future.2016.08.025

    Article  Google Scholar 

  26. Fayos-Jordan R, Felici-Castell S, Segura-Garcia J, Lopez-Ballester J, Cobos M (2020) Performance comparison of container orchestration platforms with low cost devices in the fog, assisting internet of things applications. J Netw Comput Appl 169:102788. https://doi.org/10.1016/j.jnca.2020.102788

    Article  Google Scholar 

  27. Medel V, Rana O, Banares J, Arronategui U (2016) Modelling performance resource management in kubernetes. In: 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC), pp 257–262

  28. Xu F, Zheng H, Jiang H, Shao W, Liu H, Zhou Z (2019) Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Trans Parallel Distrib Syst 30(5):1036–1051. https://doi.org/10.1109/TPDS.2018.2873397

    Article  Google Scholar 

  29. Zhu C, Han YZB (2022) A bi-metric autoscaling approach for \(<{\rm i}>{\rm n}</{\rm i}>\)-tier web applications on kubernetes. Front Comput Sci 16(3). https://doi.org/10.1007/s11704-021-0118-1

  30. Wang Q, Kanemasa Y, Li J, Jayasinghe D, Shimizu T, Matsubara M, Kawaba M, Pu C (2013) Detecting transient bottlenecks in n-tier applications through fine-grained analysis. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems, pp 31–40 (2013). https://doi.org/10.1109/ICDCS.2013.17

  31. Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in spark computing environment. Fut Gen Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027

    Article  Google Scholar 

  32. BigDataBench: A Big Data Benchmark Suite, BenchCouncil. https://www.benchcouncil.org/BigDataBench/index.html

  33. Fu Z, Tang Z, Yang L, Liu C (2020) An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans Parallel Distrib Syst 31(10):2406–2420. https://doi.org/10.1109/TPDS.2020.2992073

    Article  Google Scholar 

Download references

Acknowledgements

Thank to Jingyi Xu and Miaoyuan Liu for proofreading the original manuscript. The research was supported by the NSFC under grant No.61702063, Science and Technology Research Project of Chonging Education Commission under grant No.KJQN202001118 and China Scholarship Council under grant No.201708505099.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changpeng Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, C., Han, B. & Zhao, Y. A comparative performance study of spark on kubernetes. J Supercomput 78, 13298–13322 (2022). https://doi.org/10.1007/s11227-022-04381-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04381-y

Keywords

Navigation