Towards optimized tensor code generation for deep learning on sunway many-core processor

Li, Mingzhen; Liu, Changxi; Liao, Jianjin; Zheng, Xuegui; Yang, Hailong; Sun, Rujun; Xu, Jun; Gan, Lin; Yang, Guangwen; Luan, Zhongzhi; Qian, Depei

doi:10.1007/s11704-022-2440-7

Towards optimized tensor code generation for deep learning on sunway many-core processor

Research Article
Published: 13 September 2023

Volume 18, article number 182101, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Mingzhen Li^1,2^na1,
Changxi Liu³^na1,
Jianjin Liao¹,
Xuegui Zheng¹,
Hailong Yang^1,2,
Rujun Sun⁴,
Jun Xu⁵,
Lin Gan⁶,
Guangwen Yang⁶,
Zhongzhi Luan¹ &
…
Depei Qian¹

175 Accesses
2 Citations
Explore all metrics

Abstract

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM} that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM} achieve 1.79× improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Processing of Convolutional Neural Networks on SW26010

Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight

Article 18 January 2019

swTensor: accelerating tensor decomposition on Sunway architecture

Article 20 November 2019

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel L D, Monfort M, Muller U, Zhang J K, Zhang X, Zhao J, Zieba K. End to end learning for self-driving cars. 2016, arXiv preprint arXiv: 1604.07316
Zhang K P, Zhang Z P, Li Z F, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499–1503
Article Google Scholar
Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014
Abadi M, Barham P, Chen J M, Chen Z F, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X Q. Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016, 265–283
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 8026–8037
Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M, J Xiao T J, Xu B, Zhang C Y, Zhang Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. 2015, arXiv preprint arXiv: 1512.01274
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, 675–678
Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X H. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(3): 513–517
Google Scholar
Jouppi N P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture. 2017, 1–12
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014, arXiv preprint arXiv: 1410.0759
Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J. Intel math kernel library. In: Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J, eds. High-Performance Computing on the Intel® Xeon Phi™. Cham: Springer, 2014, 167–188
Chapter Google Scholar
Rotem N, Fix J, Abdulrasool S, Catron G, Deng S, Dzhabarov R, Gibson N, Hegeman J, Lele M, Levenstein R, Montgomery J, Maher B, Nadathur S, Olesen J, Park J, Rakhov A, Smelyanskiy M, Wang M. Glow: graph lowering compiler techniques for neural networks. 2018, arXiv preprint arXiv: 1805.00907
Cyphers S, Bansal A, Bhiwandiwalla A, Bobba J, Brookhart M, Chakraborty A, Constable W, Convey C, Cook L, Kanawi O, Kimball O, Knight J, Korovaiko N, Kumar V, Lao Y X, Lishka C R, Menon J, Jennifer Myers, Narayana S A, Procter A, Webb T J. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018, arXiv preprint arXiv: 1801.08058
Vasilache N, Zinenko O, Theodoridis T, Goyal P, DeVito Z, Moses W S, Verdoolaege S, Adams A, Cohen A. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018, arXiv preprint arXiv: 1802.04730
Chen T Q, Moreau T, Jiang Z H, Zheng L M, Yan E Q, Shen H C, Cowan M, Wang L Y, Hu Y W, Ceze L, Guestrin C, Krishnamurthy A. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 578–594
Baghdadi R, Ray J, Ben Romdhane M, Del Sozzo E, Akkas A, Zhang Y M, Suriana P, Kamil S, Amarasinghe S. Tiramisu: a polyhedral compiler for expressing fast and portable code. In: Proceedings of 2019 IEEE/ACM International Symposium on Code Generation and Optimization. 2019, 193–205
Li M Z, Liu Y, Liu X Y, Sun Q X, You X, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3): 708–727
Article Google Scholar
Lin H, Tang X C, Yu B W, Zhuo Y W, Chen W G, Zhai J D, Yin W W, Zheng W M. Scalable graph traversal on sunway taihulight with ten million cores. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 635–645
Liu C X, Xie B W, Liu X, Xue W, Yang H L, Liu X. Towards efficient SpMV on sunway manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363–373
Li M Z, Liu Y, Yang H L, Luan Z Z, Qian D P. Multi-role SpTRSV on sunway many-core architecture. In: 2018 IEEE the 20th International Conference on High Performance Computing and Communications; IEEE the 16th International Conference on Smart City; IEEE the 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018, 594–601
Wang X L, Liu W F, Xue W, Wu L. SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338–353
Xu Z G, Lin J, Matsuoka S. Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743–752
Fang J R, Fu H H, Zhao W L, Chen B W, Zheng W J, Yang G W. swDNN: A library for accelerating deep learning applications on Sunway taihulight. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 615–624
Li L D, Fang J R, Fu H H, Jiang J L, Zhao W L, He C H, You X, Yang G W. swCaffe: a parallel framework for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 413–422
Ma L X, Xie Z Q, Yang Z, Xue J L, Miao Y S, Cui W, Hu W X, Yang F, Zhang L T, Zhou L D. RAMMER: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 50
Wang H J, Zhai J D, Gao M Y, Ma Z X, Tang S Z, Zheng L Y, Li Y Z, Rong K Y, Chen Y Y, Jia Z H. PET: optimizing tensor programs with partially equivalent transformations and automated corrections. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 37–54
Zheng Z, Yang X D, Zhao P Z, Long G P, Zhu K, Zhu F W, Zhao W Y, Liu X Y, Yang J, Zhai J D, Song S L, Lin W. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022, 359–373
Zheng L M, Jia C F, Sun M M, Wu Z, Yu C H, Haj-Ali A, Wang Y D, Yang J, Zhuo D Y, Sen K, Gonzalez J, Stoica I. Ansor: generating highperformance tensor programs for deep learning. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 49
Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2021, 2–14
Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. CGO 2004. 2004, 75–86
Wei R, Schwartz L, Adve S V. DLVM: A modern compiler infrastructure for deep learning systems. In: Proceedings of the 6th International Conference on Learning Representations, 2018
Zhao J, Li B J, Nie W, Geng Z, Zhang R W, Gao X, Cheng B, Wu C, Cheng Y, Li Z, Di P, Zhang K, Jin X F. AKG: automatic kernel generation for neural processing units using polyhedral transformations. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021, 1233–1248
Zhu K, Zhao W Y, Zheng Z, Guo T Y, Zhao P Z, Bai J J, Yang J, Liu X Y, Diao L S, Lin W. DISC: a dynamic shape compiler for machine learning workloads. In: Proceedings of the 1st Workshop on Machine Learning and Systems. 2021, 89–95
Jia Z H, Padon O, Thomas J, Warszawski T, Zaharia M, Aiken A. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, 47–62
Zhang J, Zhou C B, Wang Y G, Ju L L, Du Q, Chi X B, Xu D S, Chen D X, Liu Y, Liu Z. Extreme-scale phase field simulations of coarsening dynamics on the Sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 34–45
Fu H H, He C H, Chen B W, Yin Z K, Zhang Z G, Zhang W Q, Zhang T J, Xue W, Liu W G, Yin W W, Yang G W, Chen X F. 18.9-Pflops nonlinear earthquake simulation on Sunway taihulight: Enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1–12
Yang C, Xue W, Fu H H, You H T, Wang X L, Ao Y L, Liu F F, Gan L, Xu P, Wang L N, Yang G W, Zheng W M. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 6
Li M Z, Liu Y, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. Accelerating sparse cholesky factorization on Sunway manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(7): 1636–1650
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2020YFB1506703), the National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002), the State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06), and the Fundamental Research Funds for the Central Universities.

Author information

These authors contributed equally to this work.

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beijing, 100191, China
Mingzhen Li, Jianjin Liao, Xuegui Zheng, Hailong Yang, Zhongzhi Luan & Depei Qian
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Mingzhen Li & Hailong Yang
National University of Singapore, Singapore, 119077, Singapore
Changxi Liu
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214000, China
Rujun Sun
Science and Technology on Special System Simulation Laboratory Beijing Simulation Center, Beijing, 100854, China
Jun Xu
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Lin Gan & Guangwen Yang

Authors

Mingzhen Li
View author publications
Search author on:PubMed Google Scholar
Changxi Liu
View author publications
Search author on:PubMed Google Scholar
Jianjin Liao
View author publications
Search author on:PubMed Google Scholar
Xuegui Zheng
View author publications
Search author on:PubMed Google Scholar
Hailong Yang
View author publications
Search author on:PubMed Google Scholar
Rujun Sun
View author publications
Search author on:PubMed Google Scholar
Jun Xu
View author publications
Search author on:PubMed Google Scholar
Lin Gan
View author publications
Search author on:PubMed Google Scholar
Guangwen Yang
View author publications
Search author on:PubMed Google Scholar
Zhongzhi Luan
View author publications
Search author on:PubMed Google Scholar
Depei Qian
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Additional information

Mingzhen Li is a PhD student in School of Computer Science and Engineering, Beihang University, China. He is currently working on identifying performance opportunities for scientific applications. His research interests include deep learning system, performance optimization, and code generation.

Changxi Liu is a PhD student at the National University of Singapore, Singapore. He received a Master’s and Bachelor’s degree in computer science both from Beihang University, China. His research interests include simulation, compilers, high-performance computing, and computer architecture exploration.

Jianjin Liao is a master student in School of Computer Science and Engineering, Beihang University, China. He is currently working on performance optimization for scientific applications and deep learning. His research interests include performance optimization and deep learning compilation optimization.

Xuegui Zheng is a master student in School of Computer Science and Engineering, Beihang University, China. His research interests include compiler and performance optimization. He received the Bachelor’s degree in computer science and technology from Fuzhou University, China.

Hailong Yang is an associate professor in School of Computer Science and Engineering, Beihang University, China. He received the PhD degree in the School of Computer Science and Engineering, Beihang University, China in 2014. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency.

Rujun Sun is a Ph.D student in the State Key Laboratory of Mathematical Engineering and Advanced Computing, China. Her research interests include high performance computing, deep learning and computation models.

Jun Xu is a senior engineer in Beijing Simulation Center of the Second Institute of CASIC, China. She received the PhD degree of computer science and technology in Zhejiang University, China in 2011. Her research interest is modeling and simulation of weapon equipment system.

Lin Gan is an assistant researcher in the Department of Computer Science and Technology at Tsinghua University, and the assistant director of the National Supercomputing Center in China. His research interests include high performance computing solutions based on hybrid platforms such as GPUs, FPGAs, and Sunway CPUs. Gan received a PhD in computer science from Tsinghua University, China. He is the recipient of the 2016 ACM Gordon Bell Prize, the 2017 ACM Gordon Bell Prize Finalist, the 2018 IEEE-CS TCHPC Early Career Researchers Award for Excellence in HPC, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE.

Guangwen Yang is a professor in the Department of Computer Science and Technology at Tsinghua University, and the director of the National Supercomputing Center in China. His research interests include parallel algorithms, cloud computing, and the earth system model. Yang received a PhD in computer science from Tsinghua University, China. He has received the ACM Gordon Bell Prize in the year of 2016 and 2017, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE.

Zhongzhi Luan received the PhD in the School of Computer Science of Xi’an Jiaotong University, China. He is an Associate Professor of Computer Science and Engineering, and Assistant Director of the Sino-German Joint Software Institute (JSI) Laboratory at Beihang University, China. Since 2003, His research interests including distributed computing, parallel computing, grid computing, HPC and the new generation of network technology.

Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas, USA in 1984. He is currently serving as the chief scientist of China National High Technology Program (863 Program) on high productivity computer and service environment. He is also a fellow of China Computer Federation (CCF). His research interests include innovative technologies in distributed computing, high performance computing and computer architecture.

Electronic supplementary material