ABSTRACT
Compiling from sequential C programs using LLVM for the wide Connex vector accelerator, a competitive customizable architecture for embedded applications with 32 to 4096 16-bit integer lanes, is challenging.
Our compiler targets Opincaa, a JIT assembler and coordination C++ library for Connex, which is able to run portable programs w.r.t. the vector width. For this to work, our back end needs to handle symbolic C/C++ expressions represented as adjacent inline assembly strings, which are used as scalar immediate operands in the vector code.
Also, our back end for Connex needs to lower code to emulate efficiently arithmetic operations for non-native types such as 32-bit integer and 16-bit floating point. To simplify the work of the compiler writer we conceive a method to code generate how we lower these operations inside LLVM's instruction selection pass.
We report speedup factors of up to 12.24 when running on a Connex processor with 128 lanes w.r.t. the dual-core ARM Cortex A9 clocked at a frequency 6.67 times higher, and an energy efficiency improvement average of 1.07 times. However, note that a Connex IC can achieve an order of magnitude more energy efficiency than our FPGA implementation.
- LLVM Documentation: TableGen, available at http://llvm.org/docs/TableGen/.Google Scholar
- Connex Opincaa LLVM compiler, http://gitlab.dcae.pub.ro/research/ConnexRelated/OpincaaLLVM.Google Scholar
- The Connex Opincaa library, http://gitlab.dcae.pub.ro/research/opincaa.Google Scholar
- S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989. Google ScholarDigital Library
- ARM Manchester Design Center. Support for Scalable Vector Architectures in LLVM IR, 2016.Google Scholar
- K. Asanovic. Vector Microprocessors. PhD thesis, 1998. Google ScholarDigital Library
- R. Auler, P. C. Centoducatte, et al. ACCGen: An Automatic ArchC Compiler Generator. ISCA-HPC '12. Google ScholarDigital Library
- C. Bîră, R. Hobincu, et al. Energy-Efficient Computation of L1 and L2 Norms on a FPGA SIMD Accelerator, with Applications to Visual Search. In CSCC'14.Google Scholar
- C. Bîră, L. Petrică, and R. Hobincu. OPINCAA: A Lightweight and Flexible Programming Environment For Parallel SIMD Accelerators. RJIST'13.Google Scholar
- R. L. Bocchino, Jr. and V. S. Adve. Vector LLVA: A Virtual Vector Instr. Set for Media Processing. VEE'06. Google ScholarDigital Library
- D. Brooks and M. Martonosi. Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance. HPCA'99. Google ScholarDigital Library
- Francesco Petrogalli. A Sneak Peek into SVE and VLA Programming, ARM White Paper, 2016.Google Scholar
- Gheorghe M. Ştefan. The Connex Instruction Set Architecture, 2015.Google Scholar
- T. Grosser and T. Hoefler. Polly-ACC Transparent Compilation to Heterogeneous Hardware. ICS'16. Google ScholarDigital Library
- J. Hauser. SoftFloat, http://www.jhauser.us/arithmetic/SoftFloat.html.Google Scholar
- K. Karuri, R. Leupers, G. Ascheid, et al. Design and Implementation of a Modular and Portable IEEE 754 Compliant Floating-point Unit. DATE'06. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO'04. Google ScholarDigital Library
- B. C. Lopes and R. Auler. Getting Started with LLVM Core Libraries. Packt Publishing, 2014. Google ScholarDigital Library
- M. Maliţa and G. M. Ştefan. Map-scan Node Accelerator for Big-data. In 2017 IEEE Big Data, pages 3524--3529.Google Scholar
- G. Mendonça et al. DawnCC: Automatic Annotation for Data Parallelism and Offloading. TACO'17.Google Scholar
- A. Munshi et al. OpenCL Programming Guide. 2011. Google ScholarDigital Library
- C. Nugteren et al. A Detailed GPU Cache Model Based on Reuse Distance Theory. In HPCA'14.Google Scholar
- D. Nuzman et al. Vapor SIMD: Auto-vectorize Once, Run Everywhere. CGO'11. Google ScholarDigital Library
- M. Pandey and S. Sarda. LLVM Cookbook. Packt, 2015. Google ScholarDigital Library
- G. Pokam et al. Speculative Software Management of Datapath-width for Energy Optimization. LCTES'04. Google ScholarDigital Library
- D. B. Skillicorn and D. Talia. Models and Languages for Parallel Computation. ACM Comput. Surv., June 1998. Google ScholarDigital Library
- G. Ştefan, C. Bîră, R. Hobincu, and M. Maliţa. FPGA-Based Programmable Accelerator for Hybrid Processing, ROMJIST 2016.Google Scholar
- G. M. Ştefan and M. Maliţa. Can One-Chip Parallel Computing Be Liberated From Ad Hoc Solutions? In CSCC'14.Google Scholar
- N. Stephens, S. Biles, M. Boettcher, J. Eapen, et al. The ARM Scalable Vector Extension. IEEE Micro'17. Google ScholarDigital Library
- Y. Wu, J. Nunez-Yanez, R. Woods, and D. S. Nikolopoulos. Power Modelling and Capping for Heterogeneous ARM/FPGA SoCs. In FPT'14.Google Scholar
- Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor
Recommendations
A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing CompilersCompiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging.
Our ...
Compiling for Reduced Bit-Width Queue Processors
Embedded systems are characterized by the requirement of demanding small memory footprint code. A popular architectural modification to improve code density in RISC embedded processors is to use a reduced bit-width instruction set. This approach reduces ...
Register Allocation for Compressed ISAs in LLVM
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler ConstructionWe present an adaptation to the LLVM greedy register allocator to improve code density for compressed RISC ISAs.
Many RISC architectures have extensions defining smaller encodings for common instructions, typically 16 rather than 32 bits wide. However,...
Comments