skip to main content
10.1145/3620666.3651333acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Boost Linear Algebra Computation Performance via Efficient VNNI Utilization

Published: 27 April 2024 Publication History

Abstract

Intel's Vector Neural Network Instruction (VNNI) provides higher efficiency on calculating dense linear algebra (DLA) computations than conventional SIMD instructions. However, existing auto-vectorizers frequently deliver suboptimal utilization of VNNI by either failing to recognize VNNI's unique computation pattern at the innermost loops/basic blocks, or producing inferior code through constrained and rudimentary peephole optimizations/pattern matching techniques. Auto-tuning frameworks might generate proficient code but are hampered by the necessity for sophisticated pattern templates and extensive search processes.
This paper introduces a novel compilation methodology that generates high-performance VNNI-enabled code. By leveraging DLA's salient characteristics to identify VNNI's utilization opportunities, it proceeds to pinpoint the most effective strategies for feeding VNNI's inputs and hinding VNNI's execution latency via efficient memory access and register/cache reuse and exploited instruction-level parallelism. A tailored static cost analysis model guides this exploration, determining critical parameters for effective code transformation and generation. The evaluation on DLA and DNN workloads show that our framework outperforms state-of-the-art industrial compilers and research works.

References

[1]
MLIR's affine dialect. https://mlir.llvm.org/docs/Dialects/Affine/, . Accessed: 2024-01-08.
[2]
BLIS Git repository. https://github.com/flame/blis, . Accessed: 2024-01-08.
[3]
Clang. https://clang.llvm.org/, . Accessed: 2024-01-08.
[4]
GNU and LLVM compiler flags. https://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/gcc-compiler-flags/, . Accessed: 2024-01-08.
[5]
GCC. https://gcc.gnu.org/, . Accessed: 2024-01-08.
[6]
Intel® oneAPI Toolkits. https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit, . Accessed: 2024-01-08.
[7]
Intel compiler flags. https://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/intel-compiler-flags/, . Accessed: 2024-01-08.
[8]
Porting guide for ICC users to DPCPP or ICX. https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-icc-users-to-dpcpp-or-icx.html, . Accessed: 2024-01-08.
[9]
LLVM. https://llvm.org/, . Accessed: 2024-01-08.
[10]
Intel® optimized math library for numerical computing. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, . Accessed: 2024-01-08.
[11]
OpenBLAS - An optimized BLAS library. https://www.openblas.net/, . Accessed: 2024-01-08.
[12]
mlir-clang from Polygeist's Git repository. https://github.com/llvm/Polygeist, . Accessed: 2024-01-08.
[13]
MLIR's scf dialect. https://mlir.llvm.org/docs/Dialects/SCFDialect/, . Accessed: 2024-01-08.
[14]
llvm::targettransforminfo class reference. https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html, . Accessed: 2024-01-08.
[15]
Working with operators using tensor expression. https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html, . Accessed: 2024-01-08.
[16]
VeGen Git repository. https://github.com/ychen306/vegen, . Accessed: 2024-01-08.
[17]
Tuning guide for deep learning with Intel® AVX512 and Intel® deep learning boost on 3rd generation Intel® Xeon® scalable processors. https://www.intel.com/content/www/us/en/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html, . Accessed: 2024-01-08.
[18]
VPDPBUSD - Multiply and Add Unsigned and Signed Bytes. https://www.felixcloutier.com/x86/vpdpbusd, . Accessed: 2024-01-08.
[19]
Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems (TOPLAS), 9, 04 2000.
[20]
Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. FlexVec: Auto-vectorization for irregular loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, page 697--710, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342612.
[21]
Sorav Bansal and Alex Aiken. Automatic generation of peephole superoptimizers. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, page 394--403, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595934510.
[22]
Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cesar Kunz, and Mark Marron. From relational verification to SIMD loop synthesis. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, page 123--134, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450319225.
[23]
Uday Bondhugula. High performance code generation in MLIR: An early case study with GEMM, 2020.
[24]
Sebastian Buchwald, Andreas Fried, and Sebastian Hack. Synthesizing an instruction selection rule library from semantic specifications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, page 300--313, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356176.
[25]
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal. Progressive raising in Multi-Level IR. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 15--26. IEEE Press, 2021. ISBN 9781728186139. .
[26]
Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. VeGen: A vectorizer generator for SIMD and beyond. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, page 902--914, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172.
[27]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding, 2019.
[28]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale, 2021.
[29]
Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, page 82--93, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138075.
[30]
Navdeep Katel, Vivek Khandelwal, and Uday Bondhugula. MLIR-based code generation for GPU tensor cores. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, CC 2022, page 117--128, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391832.
[31]
Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, page 145--156, New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581131992.
[32]
C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75--86, 2004.
[33]
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 2--14. IEEE Press, 2021. ISBN 9781728186139.
[34]
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. A compiler framework for extracting superword level parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, page 347--358, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450312059.
[35]
Charith Mendis and Saman Amarasinghe. GoSLP: Globally optimized superword level parallelism framework. Proc. ACM Program. Lang., 2 (OOPSLA), oct 2018.
[36]
William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising C to polyhedral MLIR. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 45--59, 2021.
[37]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, page 132--143, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933204.
[38]
Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. Scaling up superoptimization. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, page 297--310, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450340915.
[39]
Vasileios Porpodas. SuperGraph-SLP auto-vectorization. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 330--342, 2017.
[40]
Vasileios Porpodas and Timothy M. Jones. Throttling automatic vectorization: When less is more. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 432--444, 2015.
[41]
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. PSLP: Padded SLP automatic vectorization. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 190--201, 2015.
[42]
Vasileios Porpodas, Rodrigo C. O. Rocha, and Luís F. W. Góes. VW-SLP: Auto-vectorization with adaptive vector width. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT '18, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450359863.
[43]
Vasileios Porpodas, Rodrigo C. O. Rocha, Evgueni Brevnov, Luís F. W. Góes, and Timothy Mattson. Super-Node SLP: Optimized vectorization for code sequences containing operators and their inverse elements. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 206--216, 2019.
[44]
Rodrigo C. O. Rocha, Vasileios Porpodas, Pavlos Petoumenos, Luís F. W. Góes, Zheng Wang, Murray Cole, and Hugh Leather. Vectorization-aware loop unrolling with seed forwarding. CC 2020, page 1--13, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371209.
[45]
Nicholai Tukanov, Rajalakshmi Srinivasaraghavan, José E. Moreira, and Tze Meng Low. Modeling matrix engines for portability and performance. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1173--1183, 2022.
[46]
Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3), jun 2015. ISSN 0098-3500.
[47]
Nicolas Vasilache, Oleksandr Zinenko, Aart J. C. Bik, Mahesh Ravishankar, Thomas Raoux, Alexander Belyaev, Matthias Springer, Tobias Gysi, Diego Caballero, Stephan Herhut, Stella Laurenzo, and Albert Cohen. Composable and modular code generation in MLIR: A structured and retargetable approach to tensor compiler construction, 2022.
[48]
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki. UNIT: Unifying tensorized instruction compilation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 77--89. IEEE Press, 2021. ISBN 9781728186139.
[49]
Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. AMOS: Enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 874--887, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104.
[50]
Hao Zhou and Jingling Xue. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO '16, page 59--69, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450337786.
[51]
Hao Zhou and Jingling Xue. A compiler approach for exploiting partial SIMD parallelism. ACM Trans. Archit. Code Optim., 13(1), mar 2016. ISSN 1544-3566.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
April 2024
1106 pages
ISBN:9798400703867
DOI:10.1145/3620666
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • National Key Research & Development Program of China
  • NSFC
  • STCSM
  • Shanghai Science & Technology Development Funds
  • Shanghai Pujiang Program

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 265
    Total Downloads
  • Downloads (Last 12 months)265
  • Downloads (Last 6 weeks)10
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media