ABSTRACT
In recent years, memory-intensive operations are becoming dominant in efficiency of running novel neural networks. Just-in-time operator fusion on accelerating devices like GPU proves an effective method for optimizing memory-intensive operations, and suits the numerous varying model structures. In particular, we find memory-intensive operations on tensor views are ubiquitous in neural network implementations. Tensors are the de facto representation for numerical data in deep learning areas, while tensor views cover a bunch of sophisticated syntax, which allow various interpretations on the underlying tensor data without memory copy. The support of views in deep learning compilers could greatly enlarge operator fusion scope, and appeal to optimizing novel neural networks. Nevertheless, mainstream solutions in state-of-the-art deep learning compilers exhibit imperfections either in view syntax representations or operator fusion. In this article, we propose EasyView, which enables and schedules tensor views in an end-to-end workflow from neural networks onto devices. Aiming at maximizing memory utilization and reducing data movement, we categorize various view contexts in high-level language, and lower views in accordance with different scenarios. Reference-semantic in terms of views are kept in the lowering from native high-level language features to intermediate representations. Based on the reserved reference-semantics, memory activities related to data dependence of read and write are tracked for further compute and memory optimization. Besides, ample operator fusion is applied to memory-intensive operations with views. In our tests, the proposed work could get average 5.63X, 2.44X, and 4.67X speedup compared with the XLA, JAX, and TorchScript, respectively for hotspot Python functions. In addition, operation fusion with views could bring 8.02% performance improvement in end-to-end neural networks.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.Google ScholarDigital Library
- Akshay Agrawal, Akshay Naresh Modi, Alexandre Passos, Allen Lavoie, Ashish Agarwal, Asim Shankar, Igor Ganichev, Josh Levenberg, Mingsheng Hong, Rajat Monga, 2019. TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning. arXiv preprint arXiv:1903.01855(2019).Google Scholar
- Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9157–9166.Google ScholarCross Ref
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.Google ScholarDigital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.Google Scholar
- Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning(2018).Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.Google ScholarCross Ref
- JAX authors. 2020. JAX Quickstart. https://jax.readthedocs.io/en/latest/notebooks/quickstart.html.Google Scholar
- Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. 1–6.Google ScholarDigital Library
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2–14.Google ScholarDigital Library
- Zhihao Li, Haipeng Jia, Yunquan Zhang, Tun Chen, Liang Yuan, Luning Cao, and Xiao Wang. 2019. AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.Google ScholarDigital Library
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21–37.Google ScholarCross Ref
- mmdetection contributors. 2022. MMDetction Quickstart. https://github.com/open-mmlab/mmdetection.Google Scholar
- NVIDIA Corporation. 2022. Quickstart. https://developer.nvidia.com/cudnn.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.Google Scholar
- Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Automatic kernel fusion for image processing DSLs. In Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems. 76–85.Google ScholarDigital Library
- Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 242–253.Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48, 6 (2013), 519–530.Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015), 91–99.Google Scholar
- Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. 1988. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 12–27.Google ScholarDigital Library
- Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907(2018).Google Scholar
- The Cutlass Contributors. 2021. Quickstart. https://github.com/NVIDIA/cutlass/blob/master/media/docs/quickstart.md.Google Scholar
- The NumPy Community. 2021. Indexing on ndarrays. https://numpy.org/doc/stable/user/basics.indexing.html##basics-indexing.Google Scholar
- The Torch Contributors. 2019. TORCHSCRIPT. https://pytorch.org/docs/stable/jit.html.Google Scholar
- The XLA Community. 2021. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla.Google Scholar
- Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in science & engineering 13, 2 (2011), 22–30.Google Scholar
- Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. In 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP). IEEE, 19–24.Google ScholarCross Ref
- Mohamed Wahib and Naoya Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 191–202.Google ScholarDigital Library
- Wolfram language contributors. 2022. Wolfram Language Guide: Functional Programming. https://reference.wolfram.com/language/guide/FunctionalProgramming.html.Google Scholar
- Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 107–118.Google ScholarDigital Library
- Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing batched winograd convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 32–44.Google ScholarDigital Library
- Jie Zhao and Peng Di. 2020. Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 427–441.Google ScholarCross Ref
- Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 863–879.Google Scholar
- Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, 2022. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373.Google ScholarDigital Library
Index Terms
- EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers
Recommendations
DNNFusion: accelerating deep neural networks execution with advanced operator fusion
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and ImplementationDeep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and ...
Towards neural architecture-aware exploration of compiler optimizations in a deep learning {graph} compiler
CF '22: Proceedings of the 19th ACM International Conference on Computing FrontiersDeep Neural Networks (DNN) form the basis for many existing and emerging applications. Many DL compilers analyze the computation graphs and apply various optimizations at different stages. These high-level optimizations are applied using compiler passes ...
DyCL: Dynamic Neural Network Compilation Via Program Rewriting and Graph Optimization
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and AnalysisThe deep learning (DL) compiler serves as a vital infrastructure component to enable the deployment of deep neural networks on diverse hardware platforms such as mobile devices and Raspberry Pi. DL compiler’s primary function is to translate DNN ...
Comments