Establishing high performance AI ecosystem on Sunway platform

Liu, Sha; Gao, Jie; Liu, Xin; Huang, Zeqiang; Zheng, Tianyu

doi:10.1007/s42514-021-00072-x

Establishing high performance AI ecosystem on Sunway platform

Regular Paper
Published: 28 September 2021

Volume 3, pages 224–241, (2021)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Sha Liu¹,
Jie Gao²,
Xin Liu ORCID: orcid.org/0000-0002-7870-6535²,
Zeqiang Huang² &
…
Tianyu Zheng³

319 Accesses
6 Citations
Explore all metrics

Abstract

To meet the demand of large computing power for training complex deep neural networks (DNN), we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers (HPC). We provide a specially optimized accelerating library for DNN operators on Sunway, namely SWDNNv2, supporting both single-precision and half-precision. Based on the highly efficient library, we refactor the PyTorch framework to fit the Sunway platform by adopting hardware-specific acceleration and MPI backend support. A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models. Some techniques about training large models are also dicussed, including mixed-precision and hybrid parallelism. The toolkits in the AI ecosystem have been applied to actual projects, such as training large scale multi-modality model. We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100. The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator, which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU. The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

A review on the long short-term memory model

Article 13 May 2020

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems (2015)
Bradski, G.: The opencv library. Dr Dobbs J. Softw. Tools 25, 120–125 (2000)
Google Scholar
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Amodei, D.: Language models are few-shot learners (2020)
Chen, C., Peng, X., Xing, Z., Sun, J., Wang, X., Zhao, Y., Zhao, W.: IEEE Trans. Softw. Eng. (2020). https://doi.org/10.1109/TSE.2021.3074309
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Shelhamer, E.: Cudnn: Efficient primitives for deep learning. Comput. (2014)
Corporation, N.: Cublas library (2008)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018)
Fang, J., Fu, H., Zhao, W., Chen, B., Yang, G.: Swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE international parallel and distributed processing symposium (IPDPS) (2017)
Fang, J., Li, L., Fu, H., Jiang, J., Zhao, W., He, C., You, X., Yang, G.: swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight (2019)
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity (2021)
Forum, M.P.: MPI: a message-passing interface standard. MPI: A Message-Passing Interface Standard (1994)
Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., Zhao, W.: The sunway taihu light supercomputer: system and applications. Sci. China (Inf. Sci.) 59(7), 113–128 (2016)
Google Scholar
Gao, J., Zhou, J., Zhou, C., Yu, J.X.: Glog: a high level graph analysis system using mapreduce. In: IEEE, pp. 544–555 (2014)
Gaskill, B.: Onnx: the open neural network exchange format. Linux J (2018)
Hak, M.: Gad-el: flow control : passive, active, and reactive flow management (2000)
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q.V., Chen, Z.: Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv:1811.06965 (2019)
Intel.: Mkl-dnn for scalable deep learning (2017)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. ACM (2014)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lepikhin, D., Lee, H.J., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: scaling giant models with conditional computation and automatic sharding (2020)
Luitjens, J.: Cuda streams: best practices and common pitfalls (2014)
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G.: Mixed precision training (2017)
Myers, J.L., Well, A.D., Lorch, R.: Research design and statistical analysis. Res. Des. Stat. Anal. (2013)
Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20 (2007)
Article Google Scholar
Oliphant, T.E.: Guide to NumPy. Guide to NumPy (2015)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Paszke, A., Gross, S., Massa, F., Lerer, A., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimization towards training a trillion parameter models (2019)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow (2018)
Shoeybi, M., Patwary, M., Puri, R., Legresley, P., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using gpu model parallelism (2019)
Whitehead, M.: Creating fast and accurate machine learning ensembles through training dataset preprocessing. Ph.D. thesis, Indiana University (2010)
Zhang, H., Cheng, X., Zang, H., Park, D.H.: Compiler-level matrix multiplication optimization for deep learning (2019)
Zhao, R., Vogel, B., Ahmed, T.: Adaptive loss scaling for mixed precision training (2019)

Download references

Acknowledgements

This work is partially supported by PACMAN (Parallel Architecture and Compiler technology of Mobile, Accelerated, and Networked systems) Laboratory of Tsinghua University. The author thanks the support and cooperation from Ma Zixuan, Qiu Jiezhong, He Jiaao and their team.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Sha Liu
National Research Centre of Parallel Computer Engineering and Technology, Beijing, China
Jie Gao, Xin Liu & Zeqiang Huang
Shandong University, Jinan, China
Tianyu Zheng

Authors

Sha Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zeqiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, S., Gao, J., Liu, X. et al. Establishing high performance AI ecosystem on Sunway platform. CCF Trans. HPC 3, 224–241 (2021). https://doi.org/10.1007/s42514-021-00072-x

Download citation

Received: 16 April 2021
Accepted: 12 July 2021
Published: 28 September 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s42514-021-00072-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Establishing high performance AI ecosystem on Sunway platform

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A review on the long short-term memory model

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Establishing high performance AI ecosystem on Sunway platform

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A review on the long short-term memory model

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation