Skip to main content
Log in

Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system

面对E级超算系统的可扩展性和效率挑战: 神威E级原型系统并行支撑环境的实践

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

With the continuous improvement of supercomputer performance and the integration of artificial intelligence with traditional scientific computing, the scale of applications is gradually increasing, from millions to tens of millions of computing cores, which raises great challenges to achieve high scalability and efficiency of parallel applications on super-large-scale systems. Taking the Sunway exascale prototype system as an example, in this paper we first analyze the challenges of high scalability and high efficiency for parallel applications in the exascale era. To overcome these challenges, the optimization technologies used in the parallel supporting environment software on the Sunway exascale prototype system are highlighted, including the parallel operating system, input/output (I/O) optimization technology, ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method. Parallel operating systems and I/O optimization technology mainly support large-scale system scaling, while the ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method mainly enhance the efficiency of large-scale applications. Finally, the contributions to various applications running on the Sunway exascale prototype system are introduced, verifying the effectiveness of the parallel supporting environment design.

摘要

随着超级计算机性能不断提高,人工智能与传统科学计算的进一步融合,应用的并行规模逐渐增加,从数百万个计算核心到数千万个计算核心,这对超大规模系统上实现并行应用的高可扩展性和高效率提出巨大挑战。本文首先以神威E级原型系统为例,分析了E级时代并行应用的高可扩展性和高效率面临的挑战。为克服这些挑战,重点介绍了神威E级原型系统上并行支撑环境软件的优化技术,包括并行操作系统、I/O优化技术、超大规模并行调试技术、千万核心并行算法、混合精度方法等。并行操作系统和I/O优化技术主要支持大规模系统扩展,而超大规模并行调试技术、千万核心并行算法和混合精度方法主要提升大规模应用的效率。最后,介绍了运行在神威E级原型系统上的应用程序取得的重要成果,从而验证了并行支撑环境设计的有效性。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Xin LIU and Dexun CHEN designed the research. Heng GUO, Yuling YANG, and Jie GAO performed the simulations. Yunlong FENG and Longde CHEN analyzed the results. Xiaobin HE and Xin CHEN drafted the paper. Xiaona DIAO and Zuoning CHEN helped organize the paper. Xiaobin HE and Xin CHEN revised and finalized the paper.

Corresponding authors

Correspondence to Xin Liu  (刘鑫) or Dexun Chen  (陈德训).

Additional information

Compliance with ethics guidelines

Xiaobin HE, Xin CHEN, Heng GUO, Xin LIU, Dexun CHEN, Yuling YANG, Jie GAO, Yunlong FENG, Longde CHEN, Xiaona DIAO, and Zuoning CHEN declare that they have no conflict of interest.

Xiaobin HE received his BE degree from the Harbin Institute of Technology, Harbin, China, in 2006, and his MS degree from Shanghai Jiao Tong University, Shanghai, China, in 2009. He is currently an associate researcher at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His main research interests include high-performance computing and distributed storage systems.

Xin CHEN received his BE degree from the National Digital Switching System Engineering & Technological Research Center (NDSC), Zhengzhou, China, in 2016, and his MS degree from NDSC in 2018. He is a research assistant at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His research activities focus on high-performance parallel computation and applications.

Xin LIU received her PhD degree from PLA Information Engineering University, Zhengzhou, China, in 2006. She is currently a research fellow at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. She is a designer of the scientific and engineering application platform of the Sunway TaihuLight System, responsible for the large-scale parallel algorithm research and application software development. Her research interests include parallel algorithms and parallel application software.

Dexun CHEN received his PhD degree from Tsinghua University, Beijing, China, in 2021. He is currently a research fellow at the National Research Center of Parallel Computer Engineering and Technology, Beijing, China. His research interests include high-performance computing and parallel application software.

Project supported by the Key R&D Program of Zhejiang Province, China (No. 2022C01250) and the National Key R&D Program of China (No. 2019YFA0709402)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, X., Chen, X., Guo, H. et al. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. Front Inform Technol Electron Eng 24, 41–58 (2023). https://doi.org/10.1631/FITEE.2200412

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2200412

Key words

关键词

CLC number

Navigation