Abstract
High-performance computing and deep learning domains have been motivating the design of domain-specific processors. Although these processors can provide promising computation capability, they are notorious for exotic programming paradigms. To improve programming productivity and fully exploit the performance potential of these processors, domain-specific compilers (DSCs) have been proposed. However, building DSCs for emerging processors requires tremendous engineering efforts because the commonly used compilation stack is difficult to be reused. Owing to the advent of multilevel intermediate representation (MLIR), DSC developers can leverage reusable infrastructure to extend their customized functionalities without rebuilding the entire compilation stack. In this paper, we further demonstrate the effectiveness of MLIR by extending its reusable infrastructure to embrace a heterogeneous many-core processor (Sunway processor). In particular, we design a new Sunway dialect and corresponding backend for the Sunway processor, fully exploiting its architectural advantage and hiding its programming complexity. To show the ease of building a DSC, we leverage the Sunway dialect and existing MLIR dialects to build a stencil compiler for the Sunway processor. The experimental results show that our stencil compiler, built with a reusable approach, can even perform better than state-of-the-art stencil compilers.
References
Fu H H, Liao J F, Yang J Z, et al. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001
Li M Z, Liu Y, Liu X Y, et al. The deep learning compiler: a comprehensive survey. IEEE Trans Parallel Distrib Syst, 2021, 32: 708–727
Leary C, Wang T. XLA: TensorFlow, compiled. TensorFlow Dev Summit, 2017. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
Chen T Q, Moreau T, Jiang Z H, et al. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, 2018. 578–594
Bondhugula U, Hartono A, Ramanujam J, et al. A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, 2008. 101–113
Gysi T, Osuna C, Fuhrer O, et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, 2015. 1–12
Lattner C, Adve V. LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, San Jose, 2004. 75–86
Lattner C, Amini M, Bondhugula U, et al. MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of International Symposium on Code Generation and Optimization, Seoul, 2021. 2–14
Vasilache N, Zinenko O, Bik A J C. Composable and modular code generation in MLIR: a structured and retargetable approach to tensor compiler construction. 2022. ArXiv:2202.03293
Bik A J C, Koanantakool P, Shpeisman T, et al. Compiler support for sparse tensor computations in MLIR. ACM Trans Archit Code Optim, 2022, 19: 1–25
Tian R Q, Guo L Z, Li, J J, et al. A high performance sparse tensor algebra compiler in MLIR. In: Proceedings of Workshop on the LLVM Compiler Infrastructure in HPC, St. Louis, 2021. 27–38
Jeong G, Kestor G, Chatarasi P, et al. Union: a unified HW-SW co-design ecosystem in MLIR for evaluating tensor operations on spatial accelerators. In: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, Atlanta, 2021. 30–44
Li M Z, Liu Y, Hu Y M, et al. Automatic code generation and optimization of large-scale stencil computation on many-core processors. In: Proceedings of the International Conference on Parallel Processing, Lemont, 2021. 1–12
Yang C, Xue W, Fu H H, et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, 2016. 57–68
Chen B W, Fu H H, Wei Y W, et al. Simulating the Wenchuan earthquake with accurate surface topography on Sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, 2018. 517–528
Duan X H, Gao P, Zhang T J, et al. Redesigning LAMMPS for peta-scale and hundred-billion-atom simulation on Sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, 2018. 148–159
Liu Y, Liu X, Li F, et al. Closing the “Quantum Supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, New York, 2021. 1–12
Liu C X, Xie B W, Liu X, et al. Towards efficient SpMV on Sunway Manycore architectures. In: Proceedings of the International Conference on Supercomputing, Beijing, 2018. 363–373
Li M Z, Liu Y, Yang H L, et al. Multi-role SpTRSV on Sunway many-core architecture. In: Proceedings of International Conference on High Performance Computing and Communications, Exeter, 2018. 594–601
Wang X L, Liu W F, Xue W, et al. SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on Sunway architectures. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vienna, 2018. 338–353
Li M Z, Liu Y, Yang H L, et al. Accelerating sparse Cholesky factorization on Sunway Manycore architecture. IEEE Trans Parallel Distrib Syst, 2020, 31: 1636–1650
Fang J R, Fu H H, Zhao W L, et al. swDNN: a library for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of International Parallel and Distributed Processing Symposium, Orlando, 2017. 615–624
Han Q C, Yang H L, Dun M, et al. Towards efficient tile low-rank GEMM computation on Sunway many-core processors. J Supercomput, 2021, 77: 4533–4564
Zhong X G, Li M Z, Yang H L, et al. swMR: a framework for accelerating MapReduce applications on Sunway Taihulight. IEEE Trans Emerg Top Comput, 2018, 9: 1020–1030
Zerrell T, Bruestle J. Stripe: tensor compilation via the nested polyhedral model. 2019. ArXiv:1903.06498
Jin T, Bercea G T, Le T D. Compiling ONNX neural network models using MLIR. 2020. ArXiv:2008.08272
Katel N, Khandelwal V, Bondhugula U. High performance GPU code generation for matrix-matrix multiplication using MLIR: some early results. 2021. ArXiv:2108.13191
Komisarczyk K, Chelini L, Vadivel K, et al. PET-to-MLIR: a polyhedral front-end for MLIR. In: Proceedings of Euromicro Conference on Digital System Design, Kranj, 2020. 551–556
Majumder K, Bondhugula U. HIR: an MLIR-based intermediate representation for hardware accelerator description. 2021. ArXiv:2103.00194
Zhao R Z, Cheng J Y. Phism: polyhedral high-level synthesis in MLIR. 2021. ArXiv:2103.15103
Yount C, Tobin J, Breuer A, et al. YASK-Yet another stencil kernel: a framework for HPC stencil code-generation and tuning. In: Proceedings of International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, Salt Lake City, 2016. 30–39
Maruyama N, Nomura T, Sato K, et al. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, 2011. 1–12
Ragan-Kelley J, Barnes C, Adams A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, 2013. 519–530
Rawat P S, Vaidya M, Sukumaran-Rajam A, et al. On optimizing complex stencils on GPUs. In: Proceedings of International Parallel and Distributed Processing Symposium, Rio de Janeiro, 2019. 641–652
Hagedorn B, Stoltzfus L, Steuwer M, et al. High performance stencil code generation with lift. In: Proceedings of the International Symposium on Code Generation and Optimization, Vienna, 2018. 100–112
Ansel J, Kamil S, Veeramachaneni K, et al. OpenTuner: an extensible framework for program autotuning. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Edmonton, 2014. 303–316
Sun Q X, Liu Y, Yang H L, et al. csTuner: scalable auto-tuning framework for complex stencil computation on GPUs. In: Proceedings of International Conference on Cluster Computing, Portland, 2021. 192–203
Gysi T, Müller C, Zinenko O, et al. Domain-specific multi-level IR rewriting for GPU: the open earth compiler for GPU-accelerated climate simulation. ACM Trans Archit Code Optim, 2021, 18: 1–23
Acknowledgements
This work was supported by National Key Research and Development Program of China (Grant No. 2020YFB1506703), National Natural Science Foundation of China (Grant Nos. 62072018, 61732002, U22A2028), State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2021ZX-06), and Fundamental Research Funds for the Central Universities (Grant No. YWF-22-L-1127).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Li, M., Liu, Y., Chen, B. et al. Building a domain-specific compiler for emerging processors with a reusable approach. Sci. China Inf. Sci. 67, 112101 (2024). https://doi.org/10.1007/s11432-022-3727-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-022-3727-6