ABSTRACT
As modern hardware architectures evolve to support increasingly diverse, complex instruction sets for meeting the performance demands of modern workloads in image processing, deep learning, etc., it has become ever more crucial for compilers to provide robust support for evolution of their internal abstractions and retargetable code generation support to keep pace with emerging instruction sets. We propose Hydride, a novel approach to compiling for complex, emerging hardware architectures. Hydride uses vendor-defined pseudocode specifications of multiple hardware ISAs to automatically design retargetable instructions for AutoLLVM IR, an extensible compiler IR which consists of (formally defined) language-independent and target-independent LLVM IR instructions to compile to those ISAs, and automatically generated instruction selection passes to lower AutoLLVM IR to each of the specified hardware ISAs. Hydride also includes a code synthesizer that automatically generates code generation support for schedule-based languages, such as Halide, to optimally generate AutoLLVM IR. Our results show that Hydride is able to represent 3,557 instructions combined in x86, Hexagon, ARM architectures using only 397 AutoLLVM IR instructions, including (Intel) SSE2, SSE4, AVX, AVX2, AVX512, (Qualcomm) Hexagon HVX, and (ARM) NEON vector ISAs. We created a new Halide compiler with Hydride using only a formal semantics of Halide IR, leveraging the auto-generated AutoLLVM IR and back-ends for the three hardware architectures. Across kernels from deep learning and image processing, this compiler is able to perform just as well as the mature, production Halide compiler on Hexagon, and outperform on x86 by 8% and ARM by 3%. Hydride also outperforms the production Halide's LLVM back end by 12% on x86, 100% on HVX, and 26% on ARM across the same kernels.
- Maaz Bin Safeer Ahmad, Alexander J Root, Andrew Adams, Shoaib Kamil, and Alvin Cheung. Vector instruction selection for digital signal processors using program synthesis. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1004--1016, 2022.Google ScholarDigital Library
- ARM. ARM Developer Intrinsics. https://developer.arm.com/architectures/instruction-sets/intrinsics/f:@navigationhierarchiessimdisa=[Neon].Google Scholar
- Alasdair Armstrong, Thomas Bauereiss, Brian Campbell, Alastair Reid, Kathryn E Gray, Robert Norton-Wright, Prashanth Mundkur, Mark Wassell, Jon French, Christopher Pulte, et al. Isa semantics for armv8-a, risc-v, and cheri-mips. 2019.Google ScholarDigital Library
- Sorav Bansal and Alex Aiken. Automatic generation of peephole super-optimizers. ACM SIGARCH Computer Architecture News, 34(5):394--403, 2006.Google ScholarDigital Library
- Sebastian Buchwald, Andreas Fried, and Sebastian Hack. Synthesizing an instruction selection rule library from semantic specifications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, pages 300--313, 2018.Google ScholarDigital Library
- RG Cattell. Automatic derivation of code generators from machine descriptions. ACM Transactions on Programming Languages and Systems (TOPLAS), 2(2):173--190, 1980.Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18), pages 578--594, 2018.Google Scholar
- Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. Vegen: a vectorizer generator for simd and beyond. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 902--914, 2021.Google ScholarDigital Library
- Lucian Codrescu. Architecture of the hexagon™ 680 dsp for mobile imaging and computer vision. In 2015 IEEE Hot Chips 27 Symposium (HCS), pages 1--26. IEEE, 2015.Google ScholarCross Ref
- Meghan Cowan, Deeksha Dangwal, Armin Alaghi, Caroline Trippel, Vincent T Lee, and Brandon Reagen. Porcupine: A synthesizing compiler for vectorized homomorphic encryption. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 375--389, 2021.Google ScholarDigital Library
- Halide. Halide. https://github.com/halide/Halide, 2021.Google Scholar
- Intel. Intel Deep Learning Boost. https://www.intel.com/content/dam/www/public/us/en/documents/product-overviews/dl-boost-product-overview.pdf, 2019.Google Scholar
- Intel. Intel Intrinsics Guide. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html, 2023.Google Scholar
- Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75--86. IEEE, 2004.Google ScholarCross Ref
- Zhengyang Liu, Stefan Mada, and John Regehr. Minotaur: A simd-oriented synthesizing superoptimizer. arXiv preprint arXiv:2306.00229, 2023.Google Scholar
- Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. Scaling up superoptimization. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 297--310, 2016.Google ScholarDigital Library
- Qualcomm. Exploring the AI capabilities of the Qualcomm Snapdragon 888 Mobile Platform [video]. https://www.qualcomm.com/news/onq/2020/12/02/exploring-ai-capabilities-qualcomm-snapdragon-888-mobile-platform, 2020.Google Scholar
- Qualcomm. Qualcomm Hexagon V66 HVX Programmer's Reference Manual. https://developer.qualcomm.com/downloads/qualcomm-hexagon-v66-hvx-programmer-s-reference-manual, 2022.Google Scholar
- Alexander J Root, Maaz Bin Safeer Ahmad, Dillon Sharlet, Andrew Adams, Shoaib Kamil, and Jonathan Ragan-Kelley. Fast instruction selection for fast digital signal processing. 2023.Google Scholar
- Alexander James Root. Optimizing Vector Instruction Selection for Digital Signal Processing. PhD thesis, Massachusetts Institute of Technology, 2022.Google Scholar
- Raimondas Sasnauskas, Yang Chen, Peter Collingbourne, Jeroen Ketema, Gratian Lup, Jubi Taneja, and John Regehr. Souper: A synthesizing superoptimizer. arXiv preprint arXiv:1711.04422, 2017.Google Scholar
- Armando Solar-Lezama, Christopher Grant Jones, and Rastislav Bodik. Sketching concurrent data structures. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, page 136--148, 2008.Google ScholarDigital Library
- Tensorflow XLA Team. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla, 2022.Google Scholar
- The LLVM Project. LLVM Language Reference Manual. https://llvm.org/docs/LangRef.html, 2022.Google Scholar
- Samuel Thomas and James Bornholt. Automatic generation of vectorizing compilers for customizable digital signal processors. 2024.Google ScholarDigital Library
- Emina Torlak and Rastislav Bodik. Growing solver-aided languages with rosette. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software, pages 135--152, 2013.Google ScholarDigital Library
- Alexa VanHattum, Rachit Nigam, Vincent T Lee, James Bornholt, and Adrian Sampson. Vectorization for digital signal processors via equality saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 874--886, 2021.Google ScholarDigital Library
Recommendations
A Retargetable MATLAB-to-C Compiler Exploiting Custom Instructions and Data Parallelism
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing CompilersThis article presents a MATLAB-to-C compiler that exploits custom instructions present in state-of-the-art processor architectures and supports semi-automatic vectorization. A parameterized processor model is used to describe the target instruction set ...
A retargetable VLIW compiler framework for DSPs with instruction-level parallelism
A standard design methodology for embedded processors today is the system-on-a-chip design with potentially multiple heterogeneous processing elements on a chip, such as a very long instruction word (VLIW) processor, digital signal processor (DSP), and ...
Comments