TEES: topology-aware execution environment service for fast and agile application deployment in HPC

Shao, Mingtian; Lu, Kai; Chi, Wanqing; Wang, Ruibo; Dai, Yiqin; Zhang, Wenzhe

doi:10.1631/FITEE.2100284

TEES: topology-aware execution environment service for fast and agile application deployment in HPC

TEES:一种面向高性能计算快速、灵活应用程序部署的拓扑感知的运行环境服务

Published: 01 June 2022

Volume 23, pages 1631–1645, (2022)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

121 Accesses
Explore all metrics

Abstract

High-performance computing (HPC) systems are about to reach a new height: exascale. Application deployment is becoming an increasingly prominent problem. Container technology solves the problems of encapsulation and migration of applications and their execution environment. However, the container image is too large, and deploying the image to a large number of compute nodes is time-consuming. Although the peer-to-peer (P2P) approach brings higher transmission efficiency, it introduces larger network load. All of these issues lead to high startup latency of the application. To solve these problems, we propose the topology-aware execution environment service (TEES) for fast and agile application deployment on HPC systems. TEES creates a more lightweight execution environment for users, and uses a more efficient topology-aware P2P approach to reduce deployment time. Combined with a split-step transport and launch-in-advance mechanism, TEES reduces application startup latency. In the Tianhe HPC system, TEES realizes the deployment and startup of a typical application on 17 560 compute nodes within 3 s. Compared to container-based application deployment, the speed is increased by 12-fold, and the network load is reduced by 85%.

摘要

高性能计算(HPC)即将达到新的高度:百亿亿次。应用程序部署正成为一个日益突出的问题。容器技术解决了应用程序及其运行环境的封装和迁移问题。但是,容器镜像太过笨重,在大量计算结点上的部署过程非常耗时。虽然点对点(P2P)方式带来更高的传输效率,但也引入更大的网络负载。所有这些问题都会导致应用程序的高启动延迟。为解决这些问题,提出拓扑感知的运行环境服务(TEES),用于在高性能计算系统上快速、灵活地部署应用程序。TEES为用户创建了一个更轻量级的运行环境,并使用一种更有效的拓扑感知P2P方法减少部署时间。结合分步传输和提前启动机制,TEES降低了应用程序的启动延迟。在天河高性能计算系统中,TEES在3秒内实现了在17560个计算结点上的一个典型应用程序的部署和启动。与基于容器的应用程序部署方式相比,速度提高了12倍,网络负载减少了85%。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Running Kubernetes Workloads on HPC

A Methodology to Scale Containerized HPC Infrastructures in the Cloud

Interactive, Cloud-Native Workflows on HPC Using KNoC

References

Belkin M, Haas R, Arnold GW, et al., 2018. Container solutions for HPC systems: a case study of using shifter on blue waters. Proc Practice and Experience on Advanced Research Computing, Article 43. https://doi.org/10.1145/3219104.3219145
Boettiger C, 2015. An introduction to Docker for reproducible research. SIGOPS Oper Syst Rev, 49(1):71–79. https://doi.org/10.1145/2723872.2723882
Article Google Scholar
Boyle PA, 2012. The BlueGene/Q supercomputer. Proc 30^th Int Symp on Lattice Field Theory, Article 20. https://doi.org/10.22323/L164.0020
Chen JY, Guan Q, Liang X, et al., 2018. Build and execution environment (BEE): an encapsulated environment enabling HPC applications running everywhere. IEEE Int Conf on Big Data, p.1737–1746. https://doi.org/10.1109/BigData.2018.8622572
de Velp GE, Rivière E, Sadre R, 2020. Understanding the performance of container execution environments. Proc 6^th Int Workshop on Container Technologies and Container Clouds, p.37–42. https://doi.org/10.1145/3429885.3429967
di Nitto E, Gorroñogoitia J, Kumara I, et al., 2020. An approach to support automated deployment of applications on heterogeneous cloud-HPC infrastructures. Proc 22^nd Int Symp on Symbolic and Numeric Algorithms for Scientific Computing, p.133–140. https://doi.org/10.1109/SYNASC51798.2020.00031
Djemame K, Carr H, 2020. Exascale computing deployment challenges. Proc 17^th Int Conf on the Economics of Grids, Clouds, Systems, and Services, p.211–216. https://doi.org/10.1007/978-3-030-63058-4_19
Dongarra J, 2016. Report on the Sunway TaihuLight System. UT-EECS-16-742, University of Tennessee, Tennessee, USA.
Google Scholar
Du L, Wo TY, Yang RY, et al., 2017. Cider: a rapid Docker container deployment system through sharing network storage. IEEE 19^th Int Conf on High Performance Computing and Communications; IEEE 15^th Int Conf on Smart City; IEEE 3^rd Int Conf on Data Science and Systems, p.332–339.
Feng HH, Misra V, Rubenstein D, 2007. PBS: a unified priority-based scheduler. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems, p.203–214. https://doi.org/10.1145/1254882.1254906
Fu HH, Liao JF, Yang JZ, et al., 2016. The Sunway TaihuLight supercomputer: system and applications. Sci China Inform Sci, 59(7):072001. https://doi.org/10.1007/s11432-016-5588-7
Article Google Scholar
Gerhardt L, Bhimji W, Canon S, et al., 2017. Shifter: containers for HPC. J Phys Conf Ser, 898:082021. https://doi.org/10.1088/1742-6596/898/8/082021
Article Google Scholar
Godlove D, 2019. Singularity: simple, secure containers for compute-driven workloads. Proc Practice and Experience in Advanced Research Computing on Rise of the Machines, Article 24. https://doi.org/10.1145/3332186.3332192
Hardi N, Blomer J, Ganis G, et al., 2018. Making containers lazy with Docker and CernVM-FS. J Phys Conf Ser, 1085(3):032019. https://doi.org/10.1088/1742-6596/1085/3/032019
Article Google Scholar
Haring R, 2011. The Blue Gene/Q Compute chip. IEEE Hot Chips 23 Symp, p.1–20. https://doi.org/10.1109/HOTCHIPS.2011.7477488
Harter T, Salmon B, Liu R, et al., 2016. Slacker: fast distribution with lazy Docker containers. Proc 14^th USENIX Conf on File and Storage Technologies, p.181–195.
Höb M, Kranzlmüller D, 2020. Enabling EASEY deployment of containerized applications for future HPC systems. Proc 20^th Int Conf on Computational Science, p.206–219. https://doi.org/10.1007/978-3-030-50371-0_15
Huang Z, Wu S, Jiang S, et al., 2019. FastBuild: accelerating Docker image building for efficient development and deployment of container. 35^th Symp on Mass Storage Systems and Technologies, p.28–37. https://doi.org/10.1109/MSST.2019.00-18
Kurtzer GM, Sochat V, Bauer MW, 2017. Singularity: scientific containers for mobility of compute. PLoS ONE, 12(5):e0177459. https://doi.org/10.1371/journal.pone.0177459
Article Google Scholar
Li HB, Yuan YF, Du R, et al., 2020. DADI: block-level image service for agile and elastic application deployment. USENIX Annual Technical Conf, p.727–740.
Liu HF, Ding W, Chen Y, et al., 2019. CFS: a distributed file system for large scale container platforms. https://arxiv.org/abs/1911.03001
Meizner J, Nowakowski P, Kapala J, et al., 2020. Towards exascale computing architecture and its prototype: services and infrastructure. Comput Inform, 39(4):860–880. https://doi.org/10.31577/cai_2020_4_860
Article MathSciNet MATH Google Scholar
Merkel D, 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J, 2014(239):2.
Google Scholar
Shao MT, Lu K, Zhang WZ, 2022. Self-deployed execution environment for HPC. Front Inform Technol Electron Eng, early access. https://doi.org/10.1631/FITEE.2100016
Srirama SN, Adhikari M, Paul S, 2020. Application deployment using containers with auto-scaling for microservices in cloud environment. J Netw Comput Appl, 160: 102629. https://doi.org/10.1016/j.jnca.2020.102629
Article Google Scholar
Verma A, Pedrosa L, Korupolu M, et al., 2015. Large-scale cluster management at Google with Borg. Proc 10^th European Conf on Computer Systems, Article 18.
Wang KJ, Yang Y, Li Y, et al., 2017. FID: a faster image distribution system for Docker platform. IEEE 2^nd Int Workshops on Foundations and Applications of Self* Systems, p.191–198. https://doi.org/10.1109/FAS-W.2017.147
Yoo AB, Jette MA, Grondona M, 2003. SLURM: simple Linux utility for resource management. Proc 9^th Int Workshop on Job Scheduling Strategies for Parallel Processing, p.44–60. https://doi.org/10.1007/10968987_3
Zheng C, Rupprecht L, Tarasov V, et al., 2018. Wharf: sharing Docker images in a distributed file system. Proc ACM Symp on Cloud Computing, p.174–185. https://doi.org/10.1145/3267809.3267836

Download references

Acknowledgements

The authors wish to thank Yong DONG and Hao HAN for their help in the system debugging, and Zhenwei WU for improving the paper.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Mingtian Shao (邵明天), Kai Lu (卢凯), Wanqing Chi (迟万庆), Ruibo Wang (王睿伯), Yiqin Dai (戴屹钦) & Wenzhe Zhang (张文喆)

Authors

Mingtian Shao (邵明天)
View author publications
You can also search for this author in PubMed Google Scholar
Kai Lu (卢凯)
View author publications
You can also search for this author in PubMed Google Scholar
Wanqing Chi (迟万庆)
View author publications
You can also search for this author in PubMed Google Scholar
Ruibo Wang (王睿伯)
View author publications
You can also search for this author in PubMed Google Scholar
Yiqin Dai (戴屹钦)
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhe Zhang (张文喆)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mingtian SHAO designed the research. Kai LU, Wanqing CHI, Ruibo WANG, Yiqin DAI, and Wenzhe ZHANG improved the design. Mingtian SHAO and Wenzhe ZHANG implemented the system. Mingtian SHAO drafted the paper. Yiqin DAI helped organize the paper. Mingtian SHAO revised and finalized the paper.

Corresponding authors

Correspondence to Kai Lu (卢凯) or Wenzhe Zhang (张文喆).

Additional information

Compliance with ethics guidelines

Mingtian SHAO, Kai LU, Wanqing CHI, Ruibo WANG, Yiqin DAI, and Wenzhe ZHANG declare that they have no conflict of interest.

Project supported by the National Natural Science Foundation of China (No. 61902405), the Tianhe Supercomputer Project of China (No. 2018YFB0204301), the PDL Research Fund of China (No. 6142110190404), and the National High-Level Personnel for Defense Technology Program, China (No. 2017-JCJQ-ZQ-013)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, M., Lu, K., Chi, W. et al. TEES: topology-aware execution environment service for fast and agile application deployment in HPC. Front Inform Technol Electron Eng 23, 1631–1645 (2022). https://doi.org/10.1631/FITEE.2100284

Download citation

Received: 16 June 2021
Accepted: 24 October 2021
Published: 01 June 2022
Issue Date: November 2022
DOI: https://doi.org/10.1631/FITEE.2100284

Key words

CLC number

TP315

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TEES: topology-aware execution environment service for fast and agile application deployment in HPC

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Running Kubernetes Workloads on HPC

A Methodology to Scale Containerized HPC Infrastructures in the Cloud

Interactive, Cloud-Native Workflows on HPC Using KNoC

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Additional information

Compliance with ethics guidelines

Rights and permissions

About this article

Cite this article

Key words

CLC number

Subscribe and save

Buy Now

Navigation

TEES: topology-aware execution environment service for fast and agile application deployment in HPC

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Running Kubernetes Workloads on HPC

A Methodology to Scale Containerized HPC Infrastructures in the Cloud

Interactive, Cloud-Native Workflows on HPC Using KNoC

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Additional information

Compliance with ethics guidelines

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now

Search

Navigation