research-article

Boost Linear Algebra Computation Performance via Efficient VNNI Utilization

Authors:

Jianguo YaoAuthors Info & Claims

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 149 - 163

https://doi.org/10.1145/3620666.3651333

Published: 27 April 2024 Publication History

Abstract

Intel's Vector Neural Network Instruction (VNNI) provides higher efficiency on calculating dense linear algebra (DLA) computations than conventional SIMD instructions. However, existing auto-vectorizers frequently deliver suboptimal utilization of VNNI by either failing to recognize VNNI's unique computation pattern at the innermost loops/basic blocks, or producing inferior code through constrained and rudimentary peephole optimizations/pattern matching techniques. Auto-tuning frameworks might generate proficient code but are hampered by the necessity for sophisticated pattern templates and extensive search processes.

This paper introduces a novel compilation methodology that generates high-performance VNNI-enabled code. By leveraging DLA's salient characteristics to identify VNNI's utilization opportunities, it proceeds to pinpoint the most effective strategies for feeding VNNI's inputs and hinding VNNI's execution latency via efficient memory access and register/cache reuse and exploited instruction-level parallelism. A tailored static cost analysis model guides this exploration, determining critical parameters for effective code transformation and generation. The evaluation on DLA and DNN workloads show that our framework outperforms state-of-the-art industrial compilers and research works.

References

[1]

MLIR's affine dialect. https://mlir.llvm.org/docs/Dialects/Affine/, . Accessed: 2024-01-08.

[2]

BLIS Git repository. https://github.com/flame/blis, . Accessed: 2024-01-08.

[3]

Clang. https://clang.llvm.org/, . Accessed: 2024-01-08.

[4]

GNU and LLVM compiler flags. https://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/gcc-compiler-flags/, . Accessed: 2024-01-08.

[5]

GCC. https://gcc.gnu.org/, . Accessed: 2024-01-08.

[6]

Intel® oneAPI Toolkits. https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit, . Accessed: 2024-01-08.

[7]

Intel compiler flags. https://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/intel-compiler-flags/, . Accessed: 2024-01-08.

[8]

Porting guide for ICC users to DPCPP or ICX. https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-icc-users-to-dpcpp-or-icx.html, . Accessed: 2024-01-08.

[9]

LLVM. https://llvm.org/, . Accessed: 2024-01-08.

[10]

Intel® optimized math library for numerical computing. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, . Accessed: 2024-01-08.

[11]

OpenBLAS - An optimized BLAS library. https://www.openblas.net/, . Accessed: 2024-01-08.

[12]

mlir-clang from Polygeist's Git repository. https://github.com/llvm/Polygeist, . Accessed: 2024-01-08.

[13]

MLIR's scf dialect. https://mlir.llvm.org/docs/Dialects/SCFDialect/, . Accessed: 2024-01-08.

[14]

llvm::targettransforminfo class reference. https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html, . Accessed: 2024-01-08.

[15]

Working with operators using tensor expression. https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html, . Accessed: 2024-01-08.

[16]

VeGen Git repository. https://github.com/ychen306/vegen, . Accessed: 2024-01-08.

[17]

Tuning guide for deep learning with Intel® AVX512 and Intel® deep learning boost on 3rd generation Intel® Xeon® scalable processors. https://www.intel.com/content/www/us/en/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html, . Accessed: 2024-01-08.

[18]

VPDPBUSD - Multiply and Add Unsigned and Signed Bytes. https://www.felixcloutier.com/x86/vpdpbusd, . Accessed: 2024-01-08.

[19]

Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems (TOPLAS), 9, 04 2000.

Digital Library

[20]

Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. FlexVec: Auto-vectorization for irregular loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, page 697--710, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342612.

Digital Library

[21]

Sorav Bansal and Alex Aiken. Automatic generation of peephole superoptimizers. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, page 394--403, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595934510.

Digital Library

[22]

Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cesar Kunz, and Mark Marron. From relational verification to SIMD loop synthesis. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, page 123--134, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450319225.

Digital Library

[23]

Uday Bondhugula. High performance code generation in MLIR: An early case study with GEMM, 2020.

[24]

Sebastian Buchwald, Andreas Fried, and Sebastian Hack. Synthesizing an instruction selection rule library from semantic specifications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, page 300--313, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356176.

Digital Library

[25]

Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal. Progressive raising in Multi-Level IR. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 15--26. IEEE Press, 2021. ISBN 9781728186139. .

Digital Library

[26]

Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. VeGen: A vectorizer generator for SIMD and beyond. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, page 902--914, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172.

Digital Library

[27]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding, 2019.

[28]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale, 2021.

[29]

Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, page 82--93, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138075.

Digital Library

[30]

Navdeep Katel, Vivek Khandelwal, and Uday Bondhugula. MLIR-based code generation for GPU tensor cores. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, CC 2022, page 117--128, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391832.

Digital Library

[31]

Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, page 145--156, New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581131992.

Digital Library

[32]

C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75--86, 2004.

[33]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 2--14. IEEE Press, 2021. ISBN 9781728186139.

Digital Library

[34]

Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. A compiler framework for extracting superword level parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, page 347--358, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450312059.

Digital Library

[35]

Charith Mendis and Saman Amarasinghe. GoSLP: Globally optimized superword level parallelism framework. Proc. ACM Program. Lang., 2 (OOPSLA), oct 2018.

Digital Library

[36]

William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising C to polyhedral MLIR. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 45--59, 2021.

Digital Library

[37]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, page 132--143, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933204.

Digital Library

[38]

Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. Scaling up superoptimization. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, page 297--310, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450340915.

Digital Library

[39]

Vasileios Porpodas. SuperGraph-SLP auto-vectorization. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 330--342, 2017.

[40]

Vasileios Porpodas and Timothy M. Jones. Throttling automatic vectorization: When less is more. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 432--444, 2015.

Digital Library

[41]

Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. PSLP: Padded SLP automatic vectorization. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 190--201, 2015.

[42]

Vasileios Porpodas, Rodrigo C. O. Rocha, and Luís F. W. Góes. VW-SLP: Auto-vectorization with adaptive vector width. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT '18, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450359863.

Digital Library

[43]

Vasileios Porpodas, Rodrigo C. O. Rocha, Evgueni Brevnov, Luís F. W. Góes, and Timothy Mattson. Super-Node SLP: Optimized vectorization for code sequences containing operators and their inverse elements. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 206--216, 2019.

[44]

Rodrigo C. O. Rocha, Vasileios Porpodas, Pavlos Petoumenos, Luís F. W. Góes, Zheng Wang, Murray Cole, and Hugh Leather. Vectorization-aware loop unrolling with seed forwarding. CC 2020, page 1--13, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371209.

Digital Library

[45]

Nicholai Tukanov, Rajalakshmi Srinivasaraghavan, José E. Moreira, and Tze Meng Low. Modeling matrix engines for portability and performance. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1173--1183, 2022.

[46]

Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3), jun 2015. ISSN 0098-3500.

Digital Library

[47]

Nicolas Vasilache, Oleksandr Zinenko, Aart J. C. Bik, Mahesh Ravishankar, Thomas Raoux, Alexander Belyaev, Matthias Springer, Tobias Gysi, Diego Caballero, Stephan Herhut, Stella Laurenzo, and Albert Cohen. Composable and modular code generation in MLIR: A structured and retargetable approach to tensor compiler construction, 2022.

[48]

Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki. UNIT: Unifying tensorized instruction compilation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '21, page 77--89. IEEE Press, 2021. ISBN 9781728186139.

Digital Library

[49]

Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. AMOS: Enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 874--887, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104.

Digital Library

[50]

Hao Zhou and Jingling Xue. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO '16, page 59--69, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450337786.

Digital Library

[51]

Hao Zhou and Jingling Xue. A compiler approach for exploiting partial SIMD parallelism. ACM Trans. Archit. Code Optim., 13(1), mar 2016. ISSN 1544-3566.

Digital Library

Recommendations

Efficient superscalar performance through boosting

The foremost goal of superscalar processor design is to increase performance through the exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates ...
Efficient superscalar performance through boosting
ASPLOS V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systems

The foremost goal of superscalar processor design is to increase performance through the exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates ...
Efficient resource utilization for an extensible processor through dynamic instruction set adaptation

State-of-the-art application-specific instruction set processors (ASIPs) allow the designer to define individual prefabrication customizations, thus improving the degree of specialization towards the actual application requirements, e.g., the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

April 2024

1106 pages

ISBN:9798400703867

DOI:10.1145/3620666

General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Qualifiers

Research-article

Funding Sources

National Key Research & Development Program of China
NSFC
STCSM
Shanghai Science & Technology Development Funds
Shanghai Pujiang Program

Conference

ASPLOS '24

Sponsor:

ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

April 27 - May 1, 2024

CA, La Jolla, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
265
Total Downloads

Downloads (Last 12 months)265
Downloads (Last 6 weeks)10

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten