research-article

Throughput-oriented kernel porting onto FPGAs

Authors:
Alexandros Papakonstantinou

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

,
Deming Chen

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

,
Wen-Mei Hwu

University of Illinois, Urbana-Champaign, IL

University of Illinois, Urbana-Champaign, IL
View Profile

,
Jason Cong

University of California, Los Angeles, California

University of California, Los Angeles, California
View Profile

,
Yun Liang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

DAC '13: Proceedings of the 50th Annual Design Automation ConferenceMay 2013Article No.: 11Pages 1–10https://doi.org/10.1145/2463209.2488747

Published:29 May 2013Publication History

DAC '13: Proceedings of the 50th Annual Design Automation Conference

Pages 1–10

ABSTRACT

Reconfigurable devices are often employed in heterogeneous systems due to their low power and parallel processing advantages. An important usability requirement is the support of a homogeneous programming interface. Nevertheless, homogeneous programming interfaces do not eliminate the need for code tweaking to enable efficient mapping of the computation across heterogeneous architectures. In this work we propose a code optimization framework which analyzes and restructures CUDA kernels that are optimized for GPU devices in order to facilitate synthesis of high-throughput custom accelerators on FPGAs. The proposed framework enables efficient performance porting without manual code tweaking or annotation by the user. A hierarchical region graph in tandem with code motions and graph coloring of array variables is employed to restructure the kernel for high throughput execution on FPGAs.

References

AMD Fusion family of APUs: Enabling a superior, immersive PC experience. White Paper. http://sites.amd.com/us/Documents/48423B\_fusion\_whitepaper\_WEB.pdf, Mar. 2010.Google Scholar
The OpenCL specification. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf, Sept. 2010.Google Scholar
The OpenACC application programming interface. http://www.openacc.org/sites/default/files/OpenACC.1.0\_0.pdf, Nov. 2011.Google Scholar
Vivado design suite user guide: High-level synthesis. UG902(v2012.2). http://www.xilinx.com/support/documentation/sw\_manuals/xilinx2012\_2/ug902-vivado-high-level-synthesis.pdf, July 2012.Google Scholar
R. Allen and K. Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann, first edition, 2002. Google ScholarDigital Library
W. Blume and R. Eigenmann. The range test: A dependence test for symbolic, non-linear expression. In Proc. ACM/IEEE Conf. on Supercomputing (SC'94), Nov. 1994. Google ScholarDigital Library
P. Briggs, K. D. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Transactions on Prog. Languages and Systems, 16(3):428--455, May 1994. Google ScholarDigital Library
G. Chaitin. Register allocation and spilling via graph coloring. ACM SIGPLAN Notices - Best of PLDI 1979--1999, 39(4):66--74, Apr. 2004. Google ScholarDigital Library
C. Dave, H. Bae, S. J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, 42(12):36--42, Dec. 2009. Google ScholarDigital Library
M. Girkar and C. Polychronopoulos. Extracting task-level parallelism. ACM Transactions on Prog. Languages and Systems, 17(4):600--634, 1995. Google ScholarDigital Library
Z. Guo, E. Z. Zhang, and X. Shen. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Proc. ACM Int'l Conference on Parallel Architectures and Compilation Techniques (PACT'11), Sept. 2011. Google ScholarDigital Library
S. Gupta, R. Gupta, and N. Dutt. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems, 9(4):441--470, 2004. Google ScholarDigital Library
S. Gurumani, K. Rupnow, Y. Liang, H. Cholakkail, and D. Chen. High level synthesis of multiple dependent CUDA kernels for FPGAs. In Proc. IEEE/ACM Asia and South Pacific Design Automation Conference, Jan. 2013.Google ScholarCross Ref
The Convey HC-1: The world's first hybrid core computer. Datasheet. http://www.conveycomputer.com/Resources/HC-1\%20Data\%20Sheet.pdf, 2009.Google Scholar
CUDA: Parallel programming and computing platform. http://www.nvidia.com/object/cuda_home_new.html, 2012.Google Scholar
Zynq-7000 all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/index.htm, 2012.Google Scholar
Tegra super processors. http://www.nvidia.com/object/tegra-4-processor.html, 2013.Google Scholar
S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann, first edition, 1997. Google ScholarDigital Library
M. Owaida, N. Bellas, K. Daloukas, and C. Antonopoulos. Synthesis of platform architectures from opencl programs. In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'11), May 2011. Google ScholarDigital Library
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W. Hwu. FCUDA: enabling efficient compilation of cuda kernels onto FPGAs. In Proc. IEEE Symposium on Application Specific Processors, June 2009.Google ScholarCross Ref
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W. Hwu. Efficient compilation of CUDA kernels for high-performance computing on FPGAs. ACM Transactions in Embedded Computing Systems, Vol. 13, 2014.Google Scholar
A. Papakonstantinou, Y. Liang, J. Stratton, K. Gururaj, D. Chen, W. Hwu, and J. Cong. Multilevel granularity parallelism synthesis on FPGAs. In Proc. IEEE Int'l Symposium on Field-Programmable Custom Computing Machines, May 2011. Google ScholarDigital Library
J. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W. Hwu. Efficient compilation of fine-grained SPMD-threaded programs for multicore cpus. In Proc. ACM Int'l Symposium on Code Generation and Optimization (CGO'10), Feb. 2010. Google ScholarDigital Library
Z. Y. Zhang, F. W. Jiang, G. Han, C. Yang, and J. Cong. Autopilot: A platform-based ESL synthesis system. In P. Coussy and A. Moraviec, editors, High-Level Synthesis: From Algorithm to Digital Circuit, chapter 6, pages 99--112. Springer, 2008.Google Scholar

Index Terms

Throughput-oriented kernel porting onto FPGAs
1. Hardware
  1. Hardware validation
  2. Very large scale integration design
    1. Application-specific VLSI designs
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages

Recommendations

Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads
Abstract
The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are ...
Highlights
- Refactoring of OpenCL GPU code to efficiently run on multiple FPGAs.
- Multi-...
Read More
Porting Batched Iterative Solvers onto Intel GPUs with SYCL
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, ...
Read More
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on Supercomputing

In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '13: Proceedings of the 50th Annual Design Automation Conference
May 2013
1285 pages
ISBN:9781450320719
DOI:10.1145/2463209

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 May 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 248
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Throughput-oriented kernel porting onto FPGAs

DAC '13: Proceedings of the 50th Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

High-performance CUDA kernel execution on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Throughput-oriented kernel porting onto FPGAs

DAC '13: Proceedings of the 50th Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

High-performance CUDA kernel execution on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media