A comparative performance study of spark on kubernetes

Zhu, Changpeng; Han, Bo; Zhao, Yinliang

doi:10.1007/s11227-022-04381-y

A comparative performance study of spark on kubernetes

Published: 16 March 2022

Volume 78, pages 13298–13322, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

590 Accesses
2 Citations
Explore all metrics

Abstract

Kubernetes makes it easier to automate deployment and scale containerized applications to achieve a near-native performance. However, there is still a lack of systematic performance studies on how Spark applications perform on Kubernetes. In this paper, we first propose a model to capture the execution behavior of tasks, stages, and jobs, and present an implementation of a prototype system based on the model. The system is then used to collect and analyze various types of performance and system metrics, such as execution time and CPU utilization. Second, with the use of various Spark applications, we evaluate the performance of Spark on Kubernetes by comparing it with its baseline, i.e., Spark on bare metal. Based on the comparison and leveraging the system, we locate what stages suffer from the performance loss of these applications on Kubernetes, and then reveal the root causes of the loss by analyzing their work-flows, execution time and costs of system resources. Through extensive measurements, we find that Spark on Kubernetes falls behind its baseline in the range of \(-2.9\%\) to 83.9%. There are several root causes of the performance loss and benefits of Spark on Kubernetes. First, data locality deterioration by pods is a crucial root cause of the loss. To address the problem, we propose an approach to schedule tasks by taking both data locality and the utilization of executors into account. Experiments show that this approach increases the performance of Spark on Kubernetes by up to 32.2%. Second, the lower CPU usages of executors are another root cause of the performance loss, even if they have an equivalent CPU configuration on both Kubernetes and bare metal. In contrast, with the same memory configuration, executors use more memory on Kubernetes than on bare metal, contributing to the performance benefit of Spark on Kubernetes in some stages. Our research efforts in this paper benefit developers and researchers when they make valuable decisions on deploying Spark applications on Kubernetes for a better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

Article Open access 28 June 2023

Notes

All experiment scripts have uploaded to https://github.com/zcp/Evaluation_of_Spark_on_kubernetes.
In our experiments, CPU requests and CPU limits are assigned with the same value.

References

Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, vol 51. pp 137–150. https://doi.org/10.1145/1327452.1327492
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on: 2010, vol 26. https://doi.org/10.1109/MSST.2010.5496972
Zaharia M, Chowdhury NMM, Franklin M, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html
Shoro TRSAG (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI 15–28
Burns B, Grant B, Oppenheimer D, Brewer E, Wilkes J (2016) Borg, omega, and kubernetes. Queue 14(1):10–701093. https://doi.org/10.1145/2898442.2898444
Article Google Scholar
Running Spark on Kubernetes. https://spark.apache.org/docs/latest/running-on-kubernetes.html
Felter W, Ferreira A, Rajamony R, Rubio J (2015) An updated performance comparison of virtual machines and linux containers. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172 (2015). https://doi.org/10.1109/ISPASS.2015.7095802
Bhimani J, Yang Z, Leeser M, Mi N (2017) Accelerating big data applications using lightweight virtualization framework on enterprise cloud. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7 (2017). https://doi.org/10.1109/HPEC.2017.8091086
Zhang Q, Liu L, Pu C, Dou Q, Wu L, Zhou W (2018) A comparative study of containers and virtual machines in big data environment. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pp 178–185 (2018). https://doi.org/10.1109/CLOUD.2018.00030
Pereira Ferreira A, Sinnott R (2019) A performance evaluation of containers running on managed kubernetes services. In: 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp 199–208 (2019). https://doi.org/10.1109/CloudCom.2019.00038
Ruan B, Huang H, Wu S, Jin H (2016) A performance study of containers in cloud environment 10065:343–356. https://doi.org/10.1007/978-3-319-49178-3_27
Stan C, Pandelica A, Zamfir V, Stan R, Negru C (2019) Apache spark and apache ignite performance analysis. In: 2019 22nd International Conference on Control Systems and Computer Science (CSCS), pp 726–733
Xavier MG, Neves MV, Rose CAFD (2014) A performance comparison of container-based virtualization systems for mapreduce clusters. In: Proceedings of the 2014 22Nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. PDP ’14, pp 299–306. IEEE Computer Society, Washington, DC, USA. https://doi.org/10.1109/PDP.2014.78
Wang K, Khan MMH (2015) Performance prediction for apache spark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp 166–173. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.246
Adinew DM, Shijie Z, Liao Y (2020) Spark performance optimization analysis in memory management with deploy mode in standalone cluster computing. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp 2049–2053. https://doi.org/10.1109/ICDE48307.2020.00242
Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for big data applications. J Netw Comput Appl 142:63–75. https://doi.org/10.1016/j.jnca.2019.06.009
Article Google Scholar
Lu S, Wei X, Rao B, Tak B, Wang L, Wang L (2019) Ladra: Llg-based abnormal task detection and root-cause analysis in big data processing with spark. Fut Gen Comput Syst 95:392–403. https://doi.org/10.1016/j.future.2018.12.002
Article Google Scholar
Wang X, Yang LT, Liu H, Deen MJ (2018) A big data-as-a-service framework: state-of-the-art and perspectives. IEEE Trans Big Data 4(3):325–340. https://doi.org/10.1109/TBDATA.2017.2757942
Article Google Scholar
Mostafaeipour A, Rafsanjani AJ, Ahmadi M, Dhanraj JA (2020) Investigating the performance of hadoop and spark platforms on machine learning algorithms. J Supercomput pp 1–28
Ahmed N, Barczak ALC, Susnjak T, Rashid M (2020) A comprehensive performance analysis of apache hadoop and apache spark for large scale data sets using hibench. J Big Data 7. https://doi.org/10.1186/s40537-020-00388-5
Mavridis I, Karatza H (2017) Performance evaluation of cloud-based log file analysis with apache hadoop and apache spark. J Syst Softw 125:133–151. https://doi.org/10.1016/j.jss.2016.11.037
Article Google Scholar
Zhu C, Zhao Y, Han B, Zeng Q, Ma Y (2014) Runtime support for type-safe and context-based behavior adaptation. Front Comp Sci 8(1):17–32. https://doi.org/10.1007/s11704-013-2337-6
Article MathSciNet Google Scholar
Sharma P, Chaufournier L, Shenoy P, Tay YC (2016) Containers and virtual machines at scale: A comparative study. In: Proceedings of the 17th International Middleware Conference. Middleware 16. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2988336.2988337
Kozhirbayev Z, Sinnott RO (2017) A performance comparison of container-based technologies for the cloud. Fut Gen Comput Syst 68:175–182. https://doi.org/10.1016/j.future.2016.08.025
Article Google Scholar
Fayos-Jordan R, Felici-Castell S, Segura-Garcia J, Lopez-Ballester J, Cobos M (2020) Performance comparison of container orchestration platforms with low cost devices in the fog, assisting internet of things applications. J Netw Comput Appl 169:102788. https://doi.org/10.1016/j.jnca.2020.102788
Article Google Scholar
Medel V, Rana O, Banares J, Arronategui U (2016) Modelling performance resource management in kubernetes. In: 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC), pp 257–262
Xu F, Zheng H, Jiang H, Shao W, Liu H, Zhou Z (2019) Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Trans Parallel Distrib Syst 30(5):1036–1051. https://doi.org/10.1109/TPDS.2018.2873397
Article Google Scholar
Zhu C, Han YZB (2022) A bi-metric autoscaling approach for \(<{\rm i}>{\rm n}</{\rm i}>\)-tier web applications on kubernetes. Front Comput Sci 16(3). https://doi.org/10.1007/s11704-021-0118-1
Wang Q, Kanemasa Y, Li J, Jayasinghe D, Shimizu T, Matsubara M, Kawaba M, Pu C (2013) Detecting transient bottlenecks in n-tier applications through fine-grained analysis. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems, pp 31–40 (2013). https://doi.org/10.1109/ICDCS.2013.17
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in spark computing environment. Fut Gen Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027
Article Google Scholar
BigDataBench: A Big Data Benchmark Suite, BenchCouncil. https://www.benchcouncil.org/BigDataBench/index.html
Fu Z, Tang Z, Yang L, Liu C (2020) An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans Parallel Distrib Syst 31(10):2406–2420. https://doi.org/10.1109/TPDS.2020.2992073
Article Google Scholar

Download references

Acknowledgements

Thank to Jingyi Xu and Miaoyuan Liu for proofreading the original manuscript. The research was supported by the NSFC under grant No.61702063, Science and Technology Research Project of Chonging Education Commission under grant No.KJQN202001118 and China Scholarship Council under grant No.201708505099.

Author information

Authors and Affiliations

Department of Data Science and Big Data, Chongqing University of Technology, Pufu Avenue, 401135, Chongqing, China
Changpeng Zhu
School of Computer Science, Xi’an Jiaotong University, West Xianning Road, Xi’an, 710049, Shaanxi, China
Bo Han & Yinliang Zhao

Authors

Changpeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Han
View author publications
You can also search for this author in PubMed Google Scholar
Yinliang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changpeng Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, C., Han, B. & Zhao, Y. A comparative performance study of spark on kubernetes. J Supercomput 78, 13298–13322 (2022). https://doi.org/10.1007/s11227-022-04381-y

Download citation

Accepted: 18 January 2022
Published: 16 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11227-022-04381-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative performance study of spark on kubernetes

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative performance study of spark on kubernetes

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation