Skip to main content
Log in

Towards optimized tensor code generation for deep learning on sunway many-core processor

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM} that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM} achieve 1.79× improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel L D, Monfort M, Muller U, Zhang J K, Zhang X, Zhao J, Zieba K. End to end learning for self-driving cars. 2016, arXiv preprint arXiv: 1604.07316

  2. Zhang K P, Zhang Z P, Li Z F, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499–1503

    Article  Google Scholar 

  3. Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014

  4. Abadi M, Barham P, Chen J M, Chen Z F, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X Q. Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016, 265–283

  5. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 8026–8037

  6. Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M, J Xiao T J, Xu B, Zhang C Y, Zhang Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. 2015, arXiv preprint arXiv: 1512.01274

  7. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, 675–678

  8. Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X H. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(3): 513–517

    Google Scholar 

  9. Jouppi N P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture. 2017, 1–12

  10. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014, arXiv preprint arXiv: 1410.0759

  11. Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J. Intel math kernel library. In: Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J, eds. High-Performance Computing on the Intel® Xeon Phi™. Cham: Springer, 2014, 167–188

    Chapter  Google Scholar 

  12. Rotem N, Fix J, Abdulrasool S, Catron G, Deng S, Dzhabarov R, Gibson N, Hegeman J, Lele M, Levenstein R, Montgomery J, Maher B, Nadathur S, Olesen J, Park J, Rakhov A, Smelyanskiy M, Wang M. Glow: graph lowering compiler techniques for neural networks. 2018, arXiv preprint arXiv: 1805.00907

  13. Cyphers S, Bansal A, Bhiwandiwalla A, Bobba J, Brookhart M, Chakraborty A, Constable W, Convey C, Cook L, Kanawi O, Kimball O, Knight J, Korovaiko N, Kumar V, Lao Y X, Lishka C R, Menon J, Jennifer Myers, Narayana S A, Procter A, Webb T J. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018, arXiv preprint arXiv: 1801.08058

  14. Vasilache N, Zinenko O, Theodoridis T, Goyal P, DeVito Z, Moses W S, Verdoolaege S, Adams A, Cohen A. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018, arXiv preprint arXiv: 1802.04730

  15. Chen T Q, Moreau T, Jiang Z H, Zheng L M, Yan E Q, Shen H C, Cowan M, Wang L Y, Hu Y W, Ceze L, Guestrin C, Krishnamurthy A. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 578–594

  16. Baghdadi R, Ray J, Ben Romdhane M, Del Sozzo E, Akkas A, Zhang Y M, Suriana P, Kamil S, Amarasinghe S. Tiramisu: a polyhedral compiler for expressing fast and portable code. In: Proceedings of 2019 IEEE/ACM International Symposium on Code Generation and Optimization. 2019, 193–205

  17. Li M Z, Liu Y, Liu X Y, Sun Q X, You X, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3): 708–727

    Article  Google Scholar 

  18. Lin H, Tang X C, Yu B W, Zhuo Y W, Chen W G, Zhai J D, Yin W W, Zheng W M. Scalable graph traversal on sunway taihulight with ten million cores. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 635–645

  19. Liu C X, Xie B W, Liu X, Xue W, Yang H L, Liu X. Towards efficient SpMV on sunway manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363–373

  20. Li M Z, Liu Y, Yang H L, Luan Z Z, Qian D P. Multi-role SpTRSV on sunway many-core architecture. In: 2018 IEEE the 20th International Conference on High Performance Computing and Communications; IEEE the 16th International Conference on Smart City; IEEE the 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018, 594–601

  21. Wang X L, Liu W F, Xue W, Wu L. SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338–353

  22. Xu Z G, Lin J, Matsuoka S. Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743–752

  23. Fang J R, Fu H H, Zhao W L, Chen B W, Zheng W J, Yang G W. swDNN: A library for accelerating deep learning applications on Sunway taihulight. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 615–624

  24. Li L D, Fang J R, Fu H H, Jiang J L, Zhao W L, He C H, You X, Yang G W. swCaffe: a parallel framework for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 413–422

  25. Ma L X, Xie Z Q, Yang Z, Xue J L, Miao Y S, Cui W, Hu W X, Yang F, Zhang L T, Zhou L D. RAMMER: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 50

  26. Wang H J, Zhai J D, Gao M Y, Ma Z X, Tang S Z, Zheng L Y, Li Y Z, Rong K Y, Chen Y Y, Jia Z H. PET: optimizing tensor programs with partially equivalent transformations and automated corrections. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 37–54

  27. Zheng Z, Yang X D, Zhao P Z, Long G P, Zhu K, Zhu F W, Zhao W Y, Liu X Y, Yang J, Zhai J D, Song S L, Lin W. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022, 359–373

  28. Zheng L M, Jia C F, Sun M M, Wu Z, Yu C H, Haj-Ali A, Wang Y D, Yang J, Zhuo D Y, Sen K, Gonzalez J, Stoica I. Ansor: generating highperformance tensor programs for deep learning. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 49

  29. Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2021, 2–14

  30. Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. CGO 2004. 2004, 75–86

  31. Wei R, Schwartz L, Adve S V. DLVM: A modern compiler infrastructure for deep learning systems. In: Proceedings of the 6th International Conference on Learning Representations, 2018

  32. Zhao J, Li B J, Nie W, Geng Z, Zhang R W, Gao X, Cheng B, Wu C, Cheng Y, Li Z, Di P, Zhang K, Jin X F. AKG: automatic kernel generation for neural processing units using polyhedral transformations. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021, 1233–1248

  33. Zhu K, Zhao W Y, Zheng Z, Guo T Y, Zhao P Z, Bai J J, Yang J, Liu X Y, Diao L S, Lin W. DISC: a dynamic shape compiler for machine learning workloads. In: Proceedings of the 1st Workshop on Machine Learning and Systems. 2021, 89–95

  34. Jia Z H, Padon O, Thomas J, Warszawski T, Zaharia M, Aiken A. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, 47–62

  35. Zhang J, Zhou C B, Wang Y G, Ju L L, Du Q, Chi X B, Xu D S, Chen D X, Liu Y, Liu Z. Extreme-scale phase field simulations of coarsening dynamics on the Sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 34–45

  36. Fu H H, He C H, Chen B W, Yin Z K, Zhang Z G, Zhang W Q, Zhang T J, Xue W, Liu W G, Yin W W, Yang G W, Chen X F. 18.9-Pflops nonlinear earthquake simulation on Sunway taihulight: Enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1–12

  37. Yang C, Xue W, Fu H H, You H T, Wang X L, Ao Y L, Liu F F, Gan L, Xu P, Wang L N, Yang G W, Zheng W M. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 6

  38. Li M Z, Liu Y, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. Accelerating sparse cholesky factorization on Sunway manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(7): 1636–1650

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2020YFB1506703), the National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002), the State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Additional information

Mingzhen Li is a PhD student in School of Computer Science and Engineering, Beihang University, China. He is currently working on identifying performance opportunities for scientific applications. His research interests include deep learning system, performance optimization, and code generation.

Changxi Liu is a PhD student at the National University of Singapore, Singapore. He received a Master’s and Bachelor’s degree in computer science both from Beihang University, China. His research interests include simulation, compilers, high-performance computing, and computer architecture exploration.

Jianjin Liao is a master student in School of Computer Science and Engineering, Beihang University, China. He is currently working on performance optimization for scientific applications and deep learning. His research interests include performance optimization and deep learning compilation optimization.

Xuegui Zheng is a master student in School of Computer Science and Engineering, Beihang University, China. His research interests include compiler and performance optimization. He received the Bachelor’s degree in computer science and technology from Fuzhou University, China.

Hailong Yang is an associate professor in School of Computer Science and Engineering, Beihang University, China. He received the PhD degree in the School of Computer Science and Engineering, Beihang University, China in 2014. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency.

Rujun Sun is a Ph.D student in the State Key Laboratory of Mathematical Engineering and Advanced Computing, China. Her research interests include high performance computing, deep learning and computation models.

Jun Xu is a senior engineer in Beijing Simulation Center of the Second Institute of CASIC, China. She received the PhD degree of computer science and technology in Zhejiang University, China in 2011. Her research interest is modeling and simulation of weapon equipment system.

Lin Gan is an assistant researcher in the Department of Computer Science and Technology at Tsinghua University, and the assistant director of the National Supercomputing Center in China. His research interests include high performance computing solutions based on hybrid platforms such as GPUs, FPGAs, and Sunway CPUs. Gan received a PhD in computer science from Tsinghua University, China. He is the recipient of the 2016 ACM Gordon Bell Prize, the 2017 ACM Gordon Bell Prize Finalist, the 2018 IEEE-CS TCHPC Early Career Researchers Award for Excellence in HPC, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE.

Guangwen Yang is a professor in the Department of Computer Science and Technology at Tsinghua University, and the director of the National Supercomputing Center in China. His research interests include parallel algorithms, cloud computing, and the earth system model. Yang received a PhD in computer science from Tsinghua University, China. He has received the ACM Gordon Bell Prize in the year of 2016 and 2017, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE.

Zhongzhi Luan received the PhD in the School of Computer Science of Xi’an Jiaotong University, China. He is an Associate Professor of Computer Science and Engineering, and Assistant Director of the Sino-German Joint Software Institute (JSI) Laboratory at Beihang University, China. Since 2003, His research interests including distributed computing, parallel computing, grid computing, HPC and the new generation of network technology.

Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas, USA in 1984. He is currently serving as the chief scientist of China National High Technology Program (863 Program) on high productivity computer and service environment. He is also a fellow of China Computer Federation (CCF). His research interests include innovative technologies in distributed computing, high performance computing and computer architecture.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Liu, C., Liao, J. et al. Towards optimized tensor code generation for deep learning on sunway many-core processor. Front. Comput. Sci. 18, 182101 (2024). https://doi.org/10.1007/s11704-022-2440-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-2440-7

Keywords

Navigation