research-article

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

Authors:

Carl-Johannes Johnsen,

Tiziano De Matteis,

Johannes de Fine Licht,

Torsten HoeflerAuthors Info & Claims

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

Article No.: 85, Pages 1 - 9

https://doi.org/10.1145/3508352.3549374

Published: 22 December 2022 Publication History

Abstract

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization --- a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for finegrained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.

References

[1]

Anderson, J., Beidas, R., Chacko, V., Hsiao, H., Ling, X., Ragheb, O., Wang, X., and Yu, T. CGRA-ME: An open-source framework for CGRA architecture and CAD research. In 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP) (2021), IEEE, pp. 156--162.

[2]

ARM. AMBA® 4 AXI4-stream protocol-specification. https://developer.arm.com/documentation/ihi0051/a/Introduction/About-the-AXI4-Stream-protocol, 2021. [Accessed online; 11th November 2021].

[3]

Ben-Nun, T., de Fine Licht, J., Ziogas, A. N., Schneider, T., and Hoefler, T. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2019), SC '19.

Digital Library

[4]

Canis, A., Anderson, J. H., and Brown, S. D. Multi-pumping for resource reduction in fpga high-level synthesis, 2013.

[5]

Canis, A., Anderson, J. H., and Brown, S. D. Multi-pumping for resource reduction in FPGA high-level synthesis. In 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2013), IEEE, pp. 194--197.

[6]

Canis, A., Choi, J., Fort, B., Lian, R., Huang, Q., Calagar, N., Gort, M., Qin, J. J., Aldham, M., Czajkowski, T., et al. From software to accelerators with LegUp high-level synthesis. In 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2013), IEEE, pp. 1--9.

Digital Library

[7]

Choi, J., Nam, K., Canis, A., Anderson, J., Brown, S., and Czajkowski, T. Impact of cache architecture and interface on performance and area of FPGA-based processor/parallel-accelerator systems. In 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines (2012), IEEE, pp. 17--24.

Digital Library

[8]

de Fine Licht, J., Besta, M., Meierhans, S., and Hoefler, T. Transformations of high-level synthesis codes for high-performance computing. IEEE Transactions on Parallel and Distributed Systems (TPDS) 32, 5 (2020), 1014--1029.

[9]

de Fine Licht, J., Kuster, A., De Matteis, T., Ben-Nun, T., Hofer, D., and Hoefler, T. StencilFlow: Mapping large stencil programs to distributed spatial computing systems. To appear in Proceedings of the 19th ACM/IEEE International Symposium on Code Generation and Optimization (CGO'21) (2021).

[10]

de Fine Licht, J., Kwasniewski, G., and Hoefler, T. Flexible communication avoiding matrix multiplication on fpga with high-level synthesis. FPGA '20, Association for Computing Machinery, p. 244--254.

[11]

Guo, L., Chi, Y., Wang, J., Lau, J., Qiao, W., Ustun, E., Zhang, Z., and Cong, J. Autobridge: Coupling coarse-grained floorplanning and pipelining for high-frequency hls design on multi-die fpgas. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2021), FPGA '21, Association for Computing Machinery, p. 81--92.

[12]

Intel. Intel® FPGA SDK for OpenCL™ pro edition - best practices guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf, 2021. [Accessed online; 11th November 2021].

[13]

Ronak, B., and Fahmy, S. A. Multipumping flexible DSP blocks for resource reduction on Xilinx FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 9 (2016), 1471--1482.

[14]

Shi, R., Ding, Y., Wei, X., Li, H., Liu, H., So, H. K.-H., and Ding, C. FTDL: a tailored FPGA-overlay for deep learning with high scalability. In 2020 57th ACM/IEEE Design Automation Conference (DAC) (2020), IEEE, pp. 1--6.

[15]

Xilinx. AXI4-Stream infrastructure IP suite v3.0. https://www.xilinx.com/support/documentation/ip_documentation/axis_infrastructure_ip_suite/v1_1/pg085-axi4stream-infrastructure.pdf, 2018. [Accessed online; 23rd October 2021].

[16]

Xilinx. Alveo U280 data center accelerator card. https://www.mouser.com/pdfDocs/u280userguide.pdf, 2021. [Accessed online; 11th November 2021].

[17]

Xilinx. Virtex UltraScale+ FPGA data sheet: DC and AC switching characteristics. https://www.xilinx.com/support/documentation/data_sheets/ds923-virtex-ultrascale-plus.pdf, 2021. [Accessed online; 22nd October 2021].

[18]

Xilinx. Vivado design suite user guide - high-level synthesis. https://www.xilinx.com/content/dam/xilinx/support/documentation/sw_manuals/xilinx2020_1/ug902-vivado-high-level-synthesis.pdf, 2021. [Accessed online; 11th November 2021].

[19]

Yantir, H. E., Bayar, S., and Yurdakul, A. Efficient implementations of multi-pumped multi-port register files in FPGAs. In 2013 Euromicro Conference on Digital System Design (2013), IEEE, pp. 185--192.

Digital Library

[20]

Zhao, R., Todman, T., Luk, W., and Niu, X. DeepPump: Multi-pumping deep neural networks. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) (2017), IEEE, pp. 206--206.

[21]

Zurich, S. E. Scalable matrix matrix multiplication on FPGA. https://github.com/spcl/gemm_hls. Accessed: May-2022.

Index Terms

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
    2. Logic synthesis
      1. Circuit optimization
  2. Integrated circuits
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Retargetable compilers
      2. Source code generation

Index terms have been assigned to the content through auto-classification.

Recommendations

Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Vectorisation avoidance
Haskell '12: Proceedings of the 2012 Haskell Symposium

Flattening nested parallelism is a vectorising code transform that converts irregular nested parallelism into flat data parallelism. Although the result has good asymptotic performance, flattening thoroughly restructures the code. Many intermediate data ...
Vectorisation avoidance
Haskell '12

Flattening nested parallelism is a vectorising code transform that converts irregular nested parallelism into flat data parallelism. Although the result has good asymptotic performance, flattening thoroughly restructures the code. Many intermediate data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

October 2022

1467 pages

ISBN:9781450392174

DOI:10.1145/3508352

Conference Chair:
Tulika Mitra
National University of Singapore
,
Program Chairs:
Evangeline Young
The Chinese University of Hong Kong
,
Jinjun Xiong
University at Buffalo (UB)

Copyright © 2022 ACM.

© 2022 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE-EDS: Electronic Devices Society
IEEE CAS
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation (Ambizione Project)
European Research Council grant PSAP
Horizon Europe DEEP-SEA Programme
Innovation Fund Denmark

Conference

ICCAD '22

Sponsor:

SIGDA

ICCAD '22: IEEE/ACM International Conference on Computer-Aided Design

October 30 - November 3, 2022

California, San Diego

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
76
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten