ABSTRACT
Reconfigurable devices are often employed in heterogeneous systems due to their low power and parallel processing advantages. An important usability requirement is the support of a homogeneous programming interface. Nevertheless, homogeneous programming interfaces do not eliminate the need for code tweaking to enable efficient mapping of the computation across heterogeneous architectures. In this work we propose a code optimization framework which analyzes and restructures CUDA kernels that are optimized for GPU devices in order to facilitate synthesis of high-throughput custom accelerators on FPGAs. The proposed framework enables efficient performance porting without manual code tweaking or annotation by the user. A hierarchical region graph in tandem with code motions and graph coloring of array variables is employed to restructure the kernel for high throughput execution on FPGAs.
- AMD Fusion family of APUs: Enabling a superior, immersive PC experience. White Paper. http://sites.amd.com/us/Documents/48423B\_fusion\_whitepaper\_WEB.pdf, Mar. 2010.Google Scholar
- The OpenCL specification. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf, Sept. 2010.Google Scholar
- The OpenACC application programming interface. http://www.openacc.org/sites/default/files/OpenACC.1.0\_0.pdf, Nov. 2011.Google Scholar
- Vivado design suite user guide: High-level synthesis. UG902(v2012.2). http://www.xilinx.com/support/documentation/sw\_manuals/xilinx2012\_2/ug902-vivado-high-level-synthesis.pdf, July 2012.Google Scholar
- R. Allen and K. Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann, first edition, 2002. Google ScholarDigital Library
- W. Blume and R. Eigenmann. The range test: A dependence test for symbolic, non-linear expression. In Proc. ACM/IEEE Conf. on Supercomputing (SC'94), Nov. 1994. Google ScholarDigital Library
- P. Briggs, K. D. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Transactions on Prog. Languages and Systems, 16(3):428--455, May 1994. Google ScholarDigital Library
- G. Chaitin. Register allocation and spilling via graph coloring. ACM SIGPLAN Notices - Best of PLDI 1979--1999, 39(4):66--74, Apr. 2004. Google ScholarDigital Library
- C. Dave, H. Bae, S. J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, 42(12):36--42, Dec. 2009. Google ScholarDigital Library
- M. Girkar and C. Polychronopoulos. Extracting task-level parallelism. ACM Transactions on Prog. Languages and Systems, 17(4):600--634, 1995. Google ScholarDigital Library
- Z. Guo, E. Z. Zhang, and X. Shen. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Proc. ACM Int'l Conference on Parallel Architectures and Compilation Techniques (PACT'11), Sept. 2011. Google ScholarDigital Library
- S. Gupta, R. Gupta, and N. Dutt. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems, 9(4):441--470, 2004. Google ScholarDigital Library
- S. Gurumani, K. Rupnow, Y. Liang, H. Cholakkail, and D. Chen. High level synthesis of multiple dependent CUDA kernels for FPGAs. In Proc. IEEE/ACM Asia and South Pacific Design Automation Conference, Jan. 2013.Google ScholarCross Ref
- The Convey HC-1: The world's first hybrid core computer. Datasheet. http://www.conveycomputer.com/Resources/HC-1\%20Data\%20Sheet.pdf, 2009.Google Scholar
- CUDA: Parallel programming and computing platform. http://www.nvidia.com/object/cuda_home_new.html, 2012.Google Scholar
- Zynq-7000 all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/index.htm, 2012.Google Scholar
- Tegra super processors. http://www.nvidia.com/object/tegra-4-processor.html, 2013.Google Scholar
- S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann, first edition, 1997. Google ScholarDigital Library
- M. Owaida, N. Bellas, K. Daloukas, and C. Antonopoulos. Synthesis of platform architectures from opencl programs. In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'11), May 2011. Google ScholarDigital Library
- A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W. Hwu. FCUDA: enabling efficient compilation of cuda kernels onto FPGAs. In Proc. IEEE Symposium on Application Specific Processors, June 2009.Google ScholarCross Ref
- A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W. Hwu. Efficient compilation of CUDA kernels for high-performance computing on FPGAs. ACM Transactions in Embedded Computing Systems, Vol. 13, 2014.Google Scholar
- A. Papakonstantinou, Y. Liang, J. Stratton, K. Gururaj, D. Chen, W. Hwu, and J. Cong. Multilevel granularity parallelism synthesis on FPGAs. In Proc. IEEE Int'l Symposium on Field-Programmable Custom Computing Machines, May 2011. Google ScholarDigital Library
- J. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W. Hwu. Efficient compilation of fine-grained SPMD-threaded programs for multicore cpus. In Proc. ACM Int'l Symposium on Code Generation and Optimization (CGO'10), Feb. 2010. Google ScholarDigital Library
- Z. Y. Zhang, F. W. Jiang, G. Han, C. Yang, and J. Cong. Autopilot: A platform-based ESL synthesis system. In P. Coussy and A. Moraviec, editors, High-Level Synthesis: From Algorithm to Digital Circuit, chapter 6, pages 99--112. Springer, 2008.Google Scholar
Index Terms
- Throughput-oriented kernel porting onto FPGAs
Recommendations
Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads
AbstractThe recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are ...
Highlights- Refactoring of OpenCL GPU code to efficiently run on multiple FPGAs.
- Multi-...
Porting Batched Iterative Solvers onto Intel GPUs with SYCL
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisBatched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, ...
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on SupercomputingIn this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Comments