research-article

Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

Authors:

Zvonko Vranesic,

Stephen BrownAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 6, Issue 4

Article No.: 16, Pages 1 - 37

https://doi.org/10.1145/2535932

Published: 01 December 2013 Publication History

Abstract

This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. It allows system designers to express applications in a way that exposes parallelism, which can be exploited through hardware implementation. FPGA Brook supports replication, allowing parts of an application to be implemented as multiple hardware units operating in parallel. Hardware units are interconnected through FIFO buffers which use the small memory modules available in FPGAs. The FPGA Brook automated design flow uses a source-to-source compiler, developed as a part of this work, and combines it with a commercial behavioral synthesis tool to generate the hardware implementation. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results indicate that performance of many applications scales well with replication. Our benchmark applications also achieve significantly better results than corresponding implementations using a commercial behavioral synthesis tool. We conclude that using an automated design flow for implementation of streaming applications in FPGAs is a promising methodology.

References

[1]

Altera. 2012a. Altera Corporation: Nios II C-to-Hardware Acceleration Compiler. http://www.altera.com/devices/processor/nios2/tools/c2h/ni2-c2h.html (Last accessed 8/12).

[2]

Altera. 2012b. Altera Corporation: C2H Compiler Mandelbrot Design Example. http://www.altera.com/support/examples/nios2/exm-c2h-mandelbrot.html (Last accessed 8/12).

[3]

Altera. 2012c. Altera Corporation: Cyclone II FPGA Family Overview. http://www.altera.com/devices/fpga/cyclone2/overview/cy2-overview.html (Last accessed 8/12).

[4]

Altera. 2012d. Altera Corporation: DE2 Development and Education Board. http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html (Last accessed 8/12).

[5]

Altera. 2012e. Altera Corporation: Implementing FPGA Design with the OpenCL Standard. http://www.altera.com/literature/wp/wp-01173-opencl.pdf (Last accessed 8/12).

[6]

Altera. 2012f. Altera Corporation: Optimizing Nios II C2H Compiler Results. http://www.altera.com/ literature/hb/nios2/edh_ed51005.pdf (Last accessed 8/12).

[7]

Altera. 2012g. Altera Corporation: Stratix III FPGA Family Overview. http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iii/overview/st3-overview.html (Last accessed 8/12).

[8]

Altera. 2012h. Altera Corporation: Stratix V Device Overview. http://www.altera.com/literature/hb/stratix-v/stx5_51001.pdf (Last accessed 8/12).

[9]

ATI. 2012. ATI Stream Software Development Kit. http://developer.amd.com/archive/gpu/ ATIStreamSDKv1.4Beta/pages/default.aspx (Last accessed 8/12).

[10]

Nikolaos Bellas, Sek M. Chai, Malcolm Dwyer, and Dan Linzmeier. 2006. Template-based generation of streaming accelerators from a high level presentation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines. IEEE, 345--346.

Digital Library

[11]

Ian Buck. 2003. Brook specification v0.2. Technical rep. CSTR 2003-04 10/31/03 12/5/03. Department of Computer Science, Stanford University, Palo Alto, CA.

[12]

Ian Buck. 2006. Stream computing on graphics hardware. Ph.D. dissertation. Stanford University, Palo Alto, CA.

Digital Library

[13]

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786.

Digital Library

[14]

CAST. 2012. CAST, Inc.: 2-D inverse discrete cosine transform megafunction. http://www.cast-inc.com/ip-cores/multimedia/idct/cast_idct-a.pdf (Last accessed 8/12).

[15]

William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. 2003. Merrimac: Supercomputing with streams. In Proceedings of the ACM/IEEE Conference on supercomputing. ACM, 35--42.

Digital Library

[16]

Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York, NY.

Digital Library

[17]

Ian Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA.

Digital Library

[18]

FPGA Brook. 2012. FPGA Brook Homepage. http://www.eecg.toronto.edu/~plavec/fpgabrook/ (Last accessed 8/12).

[19]

Martin Charles Golumbic. 1976. Combinatorial merging. IEEE Trans. Comput. C-25, 11, 1164--1167.

Digital Library

[20]

Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 291--303.

Digital Library

[21]

GPGPU. 2012. Brook for GPUs Forum. http://www.gpgpu.org/forums/index.php?c=5 (Last accessed 8/12).

[22]

GPU Brook. 2012a. GPU Brook Source Code. http://sourceforge.net/projects/brook/ (Last accessed 8/12).

[23]

GPU Brook. 2012b. GPU Brook: Current Issues and Restrictions. http://graphics.stanford.edu/projects/brookgpu/issues.html (Last accessed 8/12).

[24]

Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th International Symposium on Microarchitecture. IEEE, 343--354.

Digital Library

[25]

John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach (3rd Ed.). Morgan Kaufmann Publishers, San Francisco, CA.

Digital Library

[26]

Amir H. Hormati, Manjunath Kudlur, David Bacon, Scott Mahlke, and Rodric Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 41--50.

Digital Library

[27]

Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, and Oliver Pell. 2006. Comparing FPGAs to graphics accelerators and the Playstation 2 using a unified source description. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 1--6.

[28]

Impulse. 2012. Impulse Accelerated Technologies: Impulse CoDeveloper C-to-FPGA Tools. http://www.impulseaccelerated.com/products_universal.htm (Last accessed 8/12).

[29]

Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE Trans. VLSI Syst. 13, 11, 1305--1319.

Digital Library

[30]

Y. Y. Leow, C. Y. Ng, and W. F. Wong. 2006. Generating hardware from OpenMP programs. In Proceedings of the IEEE International Conference on Field Programmable Technology. IEEE, 73--80.

[31]

Shih-Wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. 2006. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 196--207.

Digital Library

[32]

Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463.

Digital Library

[33]

Mentor Graphics. 2012. Mentor Graphics: Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview (Last accessed 8/12).

[34]

Joan L. Mitchell, William B. Pennebaker, Chad Fogg, and Didier J. Legall. 1997. MPEG Video Compression Standard. Chapman & Hall, New York, NY.

Digital Library

[35]

Stephen Neuendorffer and Kees Vissers. 2008. Streaming systems in FPGAs. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. Lecture Notes in Computer Science, vol. 5114, Springer-Verlag, Berlin Heidelberg, 147--156.

Digital Library

[36]

NVIDIA. 2012. NVIDIA Corporation: CUDA Zone. http://developer.nvidia.com/category/zone/cuda-zone (Last accessed 8/12).

[37]

Muhsen Owaida, Nikolaos Bellas, Christos D. Antonopoulos, Konstantis Daloukas, and Charalambos Antoniadis. 2011. Massively parallel programming models used as hardware description languages: The OpenCL case. In Proceedings of the International Conference on Computer-Aided Design. IEEE, 326--333.

Digital Library

[38]

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the 7th Symposium on Application Specific Processors. IEEE, 35--42.

[39]

David Pellerin and Scott Thibault. 2005. Practical FPGA Programming in C. Prentice Hall, Upper Saddle River, NJ.

Digital Library

[40]

Franjo Plavec. 2010. Stream computing on FPGAs. Ph.D. dissertation, University of Toronto, Toronto, Canada.

Digital Library

[41]

Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2008. Towards compilation of streaming programs into FPGA hardware. In Proceedings of the Forum on Specification and Design Languages. IEEE, 67--72.

[42]

Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009a. Enhancements to FPGA design methodology using streaming. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 294--301.

[43]

Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009b. Stream programming for FPGAs. In Languages for Embedded Systems and their Applications, Martin Radetzki Ed., Lecture Notes in Electrical Engineering, Vol. 36, Springer Netherlands, 241--253.

[44]

Michael J. Quinn. 2004. Parallel programming in C with MPI and OpenMP. McGraw-Hill, Dubuque, IA.

Digital Library

[45]

Claus Schneider, Martin Kayss, Thomas Hollstein, and Jurgen Deicke. 1998. From algorithms to hardware architectures: A comparison of regular and irregular structured IDCT algorithms. In Proceedings of Design, Automation and Test in Europe. IEEE, 186--190.

Digital Library

[46]

Jeffrey Sheldon, Walter Lee, Ben Greenwald, and Saman Amarasinghe. 2003. Strength reduction of integer division and modulo operations. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 2624, Springer, Berlin, 254--273.

Digital Library

[47]

Robert Stephens. 1997. A survey of stream processing. Acta Informatica 34, 7, 491--541.

[48]

David Tarditi, Sidd Puri, and Jose Oglesby. 2006. Accelerator: Using data parallelism to program GPUs for general-purpose uses. Technical rep. MSR-TR-2005-184. Microsoft Research.

Digital Library

[49]

Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2004. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams In Proceedings of the 31st International Symposium on Computer Architecture. IEEE, 2--13.

Digital Library

[50]

Matjaz Verderber, Andrej Zemva, and Andrej Trost. 2003. HW/SW codesign of the MPEG-2 video decoder. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE, 7 pp.

Digital Library

[51]

Xilinx. 2012a. Xilinx Inc. Virtex-II Platform FPGAs: Complete Data Sheet. http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf (Last accessed 8/12).

[52]

Xilinx. 2012b. Xilinx Inc. AutoESL High-Level Synthesis Tool. http://www.xilinx.com/products/design-tools/autoesl/ (Last accessed 8/12).

[53]

Peter Yiannacouras. 2009. FPGA-based soft vector processors. Ph.D. dissertation, University of Toronto, Toronto, Canada.

[54]

Nikos D. Zervas. 2010. Alma Technologies S.A.: Private Communication.

Cited By

Pervan BKnezovic J(2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
https://doi.org/10.23919/MIPRO48935.2020.9245341
Rouhani BMirhoseini ASonghori EKoushanfar F(2016)Automated Real-Time Analysis of Streaming Big and Dense Data on Reconfigurable PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/297402310:1(1-22)Online publication date: 19-Dec-2016
https://dl.acm.org/doi/10.1145/2974023
Schmid MReiche OHannig FTeich J(2015)Loop coarsening in C-based High-Level Synthesis2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2015.7245730(166-173)Online publication date: Jul-2015
https://doi.org/10.1109/ASAP.2015.7245730

Index Terms

Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs
1. Hardware

Recommendations

LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems
Special issue on application-specific processors

It is generally accepted that a custom hardware implementation of a set of computations will provide superior speed and energy efficiency relative to a software implementation. However, the cost and difficulty of hardware design is often prohibitive, ...
Impact of FPGA architecture on resource sharing in high-level synthesis
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Resource sharing is a key area-reduction approach in high-level synthesis (HLS) in which a single hardware functional unit is used to implement multiple operations in the high-level circuit specification. We show that the utility of sharing depends on ...
Exploiting Memory-Level Parallelism in Reconfigurable Accelerators
FCCM '12: Proceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

As memory accesses increasingly limit the overall performance of reconfigurable accelerators, it is important for high level synthesis (HLS) flows to discover and exploit memory-level parallelism. This paper develops 1) a framework where parallelism ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 6, Issue 4

December 2013

89 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2558905

Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2013

Accepted: 01 March 2013

Revised: 01 December 2012

Received: 01 August 2012

Published in TRETS Volume 6, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Canadian Microelectronics Corporation
Altera Corporation
Natural Sciences and Engineering Research Council of Canada
University of Toronto
Edward S. Rogers Sr. Graduate Scholarship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
327
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pervan BKnezovic J(2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
https://doi.org/10.23919/MIPRO48935.2020.9245341
Rouhani BMirhoseini ASonghori EKoushanfar F(2016)Automated Real-Time Analysis of Streaming Big and Dense Data on Reconfigurable PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/297402310:1(1-22)Online publication date: 19-Dec-2016
https://dl.acm.org/doi/10.1145/2974023
Schmid MReiche OHannig FTeich J(2015)Loop coarsening in C-based High-Level Synthesis2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2015.7245730(166-173)Online publication date: Jul-2015
https://doi.org/10.1109/ASAP.2015.7245730

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents