Skip to main content
Log in

Establishing high performance AI ecosystem on Sunway platform

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

To meet the demand of large computing power for training complex deep neural networks (DNN), we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers (HPC). We provide a specially optimized accelerating library for DNN operators on Sunway, namely SWDNNv2, supporting both single-precision and half-precision. Based on the highly efficient library, we refactor the PyTorch framework to fit the Sunway platform by adopting hardware-specific acceleration and MPI backend support. A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models. Some techniques about training large models are also dicussed, including mixed-precision and hybrid parallelism. The toolkits in the AI ecosystem have been applied to actual projects, such as training large scale multi-modality model. We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100. The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator, which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU. The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems (2015)

  • Bradski, G.: The opencv library. Dr Dobbs J. Softw. Tools 25, 120–125 (2000)

    Google Scholar 

  • Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Amodei, D.: Language models are few-shot learners (2020)

  • Chen, C., Peng, X., Xing, Z., Sun, J., Wang, X., Zhao, Y., Zhao, W.: IEEE Trans. Softw. Eng. (2020). https://doi.org/10.1109/TSE.2021.3074309

  • Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Shelhamer, E.: Cudnn: Efficient primitives for deep learning. Comput. (2014)

  • Corporation, N.: Cublas library (2008)

  • Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018)

  • Fang, J., Fu, H., Zhao, W., Chen, B., Yang, G.: Swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE international parallel and distributed processing symposium (IPDPS) (2017)

  • Fang, J., Li, L., Fu, H., Jiang, J., Zhao, W., He, C., You, X., Yang, G.: swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight (2019)

  • Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity (2021)

  • Forum, M.P.: MPI: a message-passing interface standard. MPI: A Message-Passing Interface Standard (1994)

  • Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., Zhao, W.: The sunway taihu light supercomputer: system and applications. Sci. China (Inf. Sci.) 59(7), 113–128 (2016)

    Google Scholar 

  • Gao, J., Zhou, J., Zhou, C., Yu, J.X.: Glog: a high level graph analysis system using mapreduce. In: IEEE, pp. 544–555 (2014)

  • Gaskill, B.: Onnx: the open neural network exchange format. Linux J (2018)

  • Hak, M.: Gad-el: flow control : passive, active, and reactive flow management (2000)

  • Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q.V., Chen, Z.: Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv:1811.06965 (2019)

  • Intel.: Mkl-dnn for scalable deep learning (2017)

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. ACM (2014)

  • Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  • Lepikhin, D., Lee, H.J., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: scaling giant models with conditional computation and automatic sharding (2020)

  • Luitjens, J.: Cuda streams: best practices and common pitfalls (2014)

  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G.: Mixed precision training (2017)

  • Myers, J.L., Well, A.D., Lorch, R.: Research design and statistical analysis. Res. Des. Stat. Anal. (2013)

  • Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20 (2007)

    Article  Google Scholar 

  • Oliphant, T.E.: Guide to NumPy. Guide to NumPy (2015)

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library (2019)

  • Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

  • Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimization towards training a trillion parameter models (2019)

  • Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow (2018)

  • Shoeybi, M., Patwary, M., Puri, R., Legresley, P., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using gpu model parallelism (2019)

  • Whitehead, M.: Creating fast and accurate machine learning ensembles through training dataset preprocessing. Ph.D. thesis, Indiana University (2010)

  • Zhang, H., Cheng, X., Zang, H., Park, D.H.: Compiler-level matrix multiplication optimization for deep learning (2019)

  • Zhao, R., Vogel, B., Ahmed, T.: Adaptive loss scaling for mixed precision training (2019)

Download references

Acknowledgements

This work is partially supported by PACMAN (Parallel Architecture and Compiler technology of Mobile, Accelerated, and Networked systems) Laboratory of Tsinghua University. The author thanks the support and cooperation from Ma Zixuan, Qiu Jiezhong, He Jiaao and their team.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, S., Gao, J., Liu, X. et al. Establishing high performance AI ecosystem on Sunway platform. CCF Trans. HPC 3, 224–241 (2021). https://doi.org/10.1007/s42514-021-00072-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-021-00072-x

Keywords

Navigation