research-article

Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor

Author:
Alexandru Emilian Şuşu

ETTI department, Politehnica University of Bucharest

ETTI department, Politehnica University of Bucharest
View Profile

WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector ProcessingFebruary 2019Article No.: 1Pages 1–8https://doi.org/10.1145/3303117.3306166

Published:16 February 2019Publication History

WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing

Pages 1–8

ABSTRACT

Compiling from sequential C programs using LLVM for the wide Connex vector accelerator, a competitive customizable architecture for embedded applications with 32 to 4096 16-bit integer lanes, is challenging.

Our compiler targets Opincaa, a JIT assembler and coordination C++ library for Connex, which is able to run portable programs w.r.t. the vector width. For this to work, our back end needs to handle symbolic C/C++ expressions represented as adjacent inline assembly strings, which are used as scalar immediate operands in the vector code.

Also, our back end for Connex needs to lower code to emulate efficiently arithmetic operations for non-native types such as 32-bit integer and 16-bit floating point. To simplify the work of the compiler writer we conceive a method to code generate how we lower these operations inside LLVM's instruction selection pass.

We report speedup factors of up to 12.24 when running on a Connex processor with 128 lanes w.r.t. the dual-core ARM Cortex A9 clocked at a frequency 6.67 times higher, and an energy efficiency improvement average of 1.07 times. However, note that a Connex IC can achieve an order of magnitude more energy efficiency than our FPGA implementation.

References

LLVM Documentation: TableGen, available at http://llvm.org/docs/TableGen/.Google Scholar
Connex Opincaa LLVM compiler, http://gitlab.dcae.pub.ro/research/ConnexRelated/OpincaaLLVM.Google Scholar
The Connex Opincaa library, http://gitlab.dcae.pub.ro/research/opincaa.Google Scholar
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989. Google ScholarDigital Library
ARM Manchester Design Center. Support for Scalable Vector Architectures in LLVM IR, 2016.Google Scholar
K. Asanovic. Vector Microprocessors. PhD thesis, 1998. Google ScholarDigital Library
R. Auler, P. C. Centoducatte, et al. ACCGen: An Automatic ArchC Compiler Generator. ISCA-HPC '12. Google ScholarDigital Library
C. Bîră, R. Hobincu, et al. Energy-Efficient Computation of L1 and L2 Norms on a FPGA SIMD Accelerator, with Applications to Visual Search. In CSCC'14.Google Scholar
C. Bîră, L. Petrică, and R. Hobincu. OPINCAA: A Lightweight and Flexible Programming Environment For Parallel SIMD Accelerators. RJIST'13.Google Scholar
R. L. Bocchino, Jr. and V. S. Adve. Vector LLVA: A Virtual Vector Instr. Set for Media Processing. VEE'06. Google ScholarDigital Library
D. Brooks and M. Martonosi. Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance. HPCA'99. Google ScholarDigital Library
Francesco Petrogalli. A Sneak Peek into SVE and VLA Programming, ARM White Paper, 2016.Google Scholar
Gheorghe M. Ştefan. The Connex Instruction Set Architecture, 2015.Google Scholar
T. Grosser and T. Hoefler. Polly-ACC Transparent Compilation to Heterogeneous Hardware. ICS'16. Google ScholarDigital Library
J. Hauser. SoftFloat, http://www.jhauser.us/arithmetic/SoftFloat.html.Google Scholar
K. Karuri, R. Leupers, G. Ascheid, et al. Design and Implementation of a Modular and Portable IEEE 754 Compliant Floating-point Unit. DATE'06. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO'04. Google ScholarDigital Library
B. C. Lopes and R. Auler. Getting Started with LLVM Core Libraries. Packt Publishing, 2014. Google ScholarDigital Library
M. Maliţa and G. M. Ştefan. Map-scan Node Accelerator for Big-data. In 2017 IEEE Big Data, pages 3524--3529.Google Scholar
G. Mendonça et al. DawnCC: Automatic Annotation for Data Parallelism and Offloading. TACO'17.Google Scholar
A. Munshi et al. OpenCL Programming Guide. 2011. Google ScholarDigital Library
C. Nugteren et al. A Detailed GPU Cache Model Based on Reuse Distance Theory. In HPCA'14.Google Scholar
D. Nuzman et al. Vapor SIMD: Auto-vectorize Once, Run Everywhere. CGO'11. Google ScholarDigital Library
M. Pandey and S. Sarda. LLVM Cookbook. Packt, 2015. Google ScholarDigital Library
G. Pokam et al. Speculative Software Management of Datapath-width for Energy Optimization. LCTES'04. Google ScholarDigital Library
D. B. Skillicorn and D. Talia. Models and Languages for Parallel Computation. ACM Comput. Surv., June 1998. Google ScholarDigital Library
G. Ştefan, C. Bîră, R. Hobincu, and M. Maliţa. FPGA-Based Programmable Accelerator for Hybrid Processing, ROMJIST 2016.Google Scholar
G. M. Ştefan and M. Maliţa. Can One-Chip Parallel Computing Be Liberated From Ad Hoc Solutions? In CSCC'14.Google Scholar
N. Stephens, S. Biles, M. Boettcher, J. Eapen, et al. The ARM Scalable Vector Extension. IEEE Micro'17. Google ScholarDigital Library
Y. Wu, J. Nunez-Yanez, R. Woods, and D. S. Nikolopoulos. Power Modelling and Capping for Heterogeneous ARM/FPGA SoCs. In FPT'14.Google Scholar

Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers

Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging.

Our ...
Read More
Compiling for Reduced Bit-Width Queue Processors

Embedded systems are characterized by the requirement of demanding small memory footprint code. A popular architectural modification to improve code density in RISC embedded processors is to use a reduced bit-width instruction set. This approach reduces ...
Read More
Register Allocation for Compressed ISAs in LLVM
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

We present an adaptation to the LLVM greedy register allocator to improve code density for compressed RISC ISAs.
Many RISC architectures have extensions defining smaller encodings for common instructions, typically 16 rather than 32 bits wide. However,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing
February 2019
35 pages
ISBN:9781450362917
DOI:10.1145/3303117
Editors:
Jan Eitzinger
University Erlangen-Nuremberg, Germany
,
Sylvain Jubertie
University of Orleans, France
,
Lionel Lacassagne
Sorbonne University, France
,
Bertrand Le Gal
Bordeaux-INP, France
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LLVM
code generation for the instr. selection pass for arithmetic emulation
energy profiling
portable compilation for custom-width vector processor
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate20of30submissions,67%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor

WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing

ABSTRACT

References

Cited By

Recommendations

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Compiling for Reduced Bit-Width Queue Processors

Register Allocation for Compressed ISAs in LLVM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor

WPMVP'19: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing

ABSTRACT

References

Cited By

Recommendations

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Compiling for Reduced Bit-Width Queue Processors

Register Allocation for Compressed ISAs in LLVM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media