Optimal Kernel Orchestration for Tensor Programs with Korch

Authors:
Muyan Hu

University of Illinois at Urbana-Champaign, Urbana Champaign, USA

University of Illinois at Urbana-Champaign, Urbana Champaign, USA

https://orcid.org/0009-0001-4096-0511
View Profile

,
Ashwin Venkatram

Advanced Micro Devices, San Jose, USA

Advanced Micro Devices, San Jose, USA

https://orcid.org/0009-0005-4661-0060
View Profile

,
Shreyashri Biswas

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA

https://orcid.org/0009-0002-6656-1030
View Profile

,
Balamurugan Marimuthu

Sambanova Systems, Palo Alto, USA

Sambanova Systems, Palo Alto, USA

https://orcid.org/0000-0001-6292-5066
View Profile

,
Bohan Hou

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA

https://orcid.org/0000-0001-5718-3387
View Profile

,
Gabriele Oliaro

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA

https://orcid.org/0000-0001-5406-0736
View Profile

,
Haojie Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0003-4605-148X
View Profile

,
Liyan Zheng

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0001-7327-748X
View Profile

,
Xupeng Miao

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA

https://orcid.org/0000-0002-9371-8358
View Profile

,
Jidong Zhai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0002-7656-6428
View Profile

,
Zhihao Jia

Carnegie Mellon University, Pittsburgh, United States of America

Carnegie Mellon University, Pittsburgh, United States of America

https://orcid.org/0000-0002-1270-5185
View Profile

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024Pages 755–769https://doi.org/10.1145/3620666.3651383

Published:27 April 2024Publication History

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 755–769

ABSTRACT

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration.

This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7× on V100 GPUs and up to 1.6× on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

References

Amazon ec2 p3 instances. https://aws.amazon.com/ec2/instance-types/p3/, 2022.Google Scholar
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pages 265--283, 2016.Google ScholarDigital Library
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, et al. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), 38(4):1--12, 2019.Google Scholar
Luke Anderson, Andrew Adams, Karima Ma, Tzu-Mao Li, Tian Jin, and Jonathan Ragan-Kelley. Efficient automatic scheduling of imaging and vision pipelines for the gpu. Proceedings of the ACM on Programming Languages, 5(OOPSLA):1--28, 2021.Google ScholarDigital Library
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.Google Scholar
Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17302--17313, 2023.Google ScholarCross Ref
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.Google Scholar
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. Advances in Neural Information Processing Systems, 31, 2018.Google Scholar
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.Google Scholar
Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas, 2016.Google Scholar
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359, 2022.Google Scholar
Frederica Darema, David A George, V Alan Norton, and Gregory F Pfister. A single-program-multiple-data computational model for epex/fortran. Parallel Computing, 7(1):11--24, 1988.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.Google Scholar
Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. Hidet: Task-mapping programming paradigm for deep learning tensor programs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 370--384, 2023.Google ScholarDigital Library
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.Google Scholar
Deborah D Heisley and Sidney J Levy. Autodriving: A photoelicitation technique. Journal of consumer Research, 18(3):257--272, 1991.Google Scholar
Abhinav Jangda and Uday Bondhugula. An effective fusion and tile size model for optimizing image processing pipelines. ACM SIGPLAN Notices, 53(1):261--275, 2018.Google ScholarDigital Library
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47--62, 2019.Google ScholarDigital Library
Zhihao Jia, James Thomas, Todd Warzawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. Optimizing dnn computation with relaxed graph substitutions. In Proceedings of the 2nd Conference on Systems and Machine Learning, SysML'19, 2019.Google Scholar
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694--711. Springer, 2016.Google ScholarCross Ref
Stuart Mitchell, Michael OSullivan, and Iain Dunning. Pulp: a linear programming toolkit for python. The University of Auckland, Auckland, New Zealand, 65, 2011.Google Scholar
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. Automatically scheduling halide image processing pipelines. ACM Trans. Graph., 35(4), 2016.Google ScholarDigital Library
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 883--898, 2021.Google ScholarDigital Library
ONNX: Open neural network exchange. https://onnx.ai/, 2022.Google Scholar
ONNX Operators. https://github.com/onnx/onnx/blob/main/docs/Operators.md, 2022.Google Scholar
Constantine Papageorgiou and Tomaso Poggio. A trainable system for object detection. International journal of computer vision, 38(1):15--33, 2000.Google Scholar
Tensors and Dynamic neural networks in Python with strong GPU acceleration. https://pytorch.org, 2023.Google Scholar
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, 2013.Google ScholarDigital Library
Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. Tensor program optimization with probabilistic programs. Advances in Neural Information Processing Systems, 35:35783--35796, 2022.Google Scholar
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.Google Scholar
Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34:24829--24840, 2021.Google Scholar
NVIDIA TensorRT: Programmable inference accelerator. https://developer.nvidia.com/tensorrt, 2017.Google Scholar
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19, 2019.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017.Google ScholarDigital Library
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. Pet: Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37--54, 2021.Google Scholar
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077--12090, 2021.Google Scholar
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: generating high-performance tensor programs for deep learning. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 863--879, 2020.Google Scholar
Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, et al. {EINNET}: Optimizing tensor programs with {Derivation-Based} transformations. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 739--755, 2023.Google Scholar
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859--873, 2020.Google ScholarDigital Library

Recommendations

Minimax optimal alternating minimization for kernel nonparametric tensor learning
NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems

We investigate the statistical performance and computational efficiency of the alternating minimization procedure for nonparametric tensor learning. Tensor modeling has been widely used for capturing the higher order relations between multimodal data ...
Read More
TLP: A Deep Learning-Based Cost Model for Tensor Program Tuning
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models ...
Read More
Learning to optimize tensor programs
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
April 2024
1106 pages
ISBN:9798400703867
DOI:10.1145/3620666
General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir
Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Check for updates
Badges
Author Tags
tensor program
kernel orchestration
machine learning compiler
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 368
  Total Downloads
- Downloads (Last 12 months)368
- Downloads (Last 6 weeks)368
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimal Kernel Orchestration for Tensor Programs with Korch

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

ABSTRACT

References

Cited By

Recommendations

Minimax optimal alternating minimization for kernel nonparametric tensor learning

TLP: A Deep Learning-Based Cost Model for Tensor Program Tuning

Learning to optimize tensor programs