skip to main content
research-article

Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

Published: 01 December 2013 Publication History

Abstract

This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. It allows system designers to express applications in a way that exposes parallelism, which can be exploited through hardware implementation. FPGA Brook supports replication, allowing parts of an application to be implemented as multiple hardware units operating in parallel. Hardware units are interconnected through FIFO buffers which use the small memory modules available in FPGAs. The FPGA Brook automated design flow uses a source-to-source compiler, developed as a part of this work, and combines it with a commercial behavioral synthesis tool to generate the hardware implementation. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results indicate that performance of many applications scales well with replication. Our benchmark applications also achieve significantly better results than corresponding implementations using a commercial behavioral synthesis tool. We conclude that using an automated design flow for implementation of streaming applications in FPGAs is a promising methodology.

References

[1]
Altera. 2012a. Altera Corporation: Nios II C-to-Hardware Acceleration Compiler. http://www.altera.com/devices/processor/nios2/tools/c2h/ni2-c2h.html (Last accessed 8/12).
[2]
Altera. 2012b. Altera Corporation: C2H Compiler Mandelbrot Design Example. http://www.altera.com/support/examples/nios2/exm-c2h-mandelbrot.html (Last accessed 8/12).
[3]
Altera. 2012c. Altera Corporation: Cyclone II FPGA Family Overview. http://www.altera.com/devices/fpga/cyclone2/overview/cy2-overview.html (Last accessed 8/12).
[4]
Altera. 2012d. Altera Corporation: DE2 Development and Education Board. http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html (Last accessed 8/12).
[5]
Altera. 2012e. Altera Corporation: Implementing FPGA Design with the OpenCL Standard. http://www.altera.com/literature/wp/wp-01173-opencl.pdf (Last accessed 8/12).
[6]
Altera. 2012f. Altera Corporation: Optimizing Nios II C2H Compiler Results. http://www.altera.com/ literature/hb/nios2/edh_ed51005.pdf (Last accessed 8/12).
[7]
Altera. 2012g. Altera Corporation: Stratix III FPGA Family Overview. http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iii/overview/st3-overview.html (Last accessed 8/12).
[8]
Altera. 2012h. Altera Corporation: Stratix V Device Overview. http://www.altera.com/literature/hb/stratix-v/stx5_51001.pdf (Last accessed 8/12).
[9]
ATI. 2012. ATI Stream Software Development Kit. http://developer.amd.com/archive/gpu/ ATIStreamSDKv1.4Beta/pages/default.aspx (Last accessed 8/12).
[10]
Nikolaos Bellas, Sek M. Chai, Malcolm Dwyer, and Dan Linzmeier. 2006. Template-based generation of streaming accelerators from a high level presentation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines. IEEE, 345--346.
[11]
Ian Buck. 2003. Brook specification v0.2. Technical rep. CSTR 2003-04 10/31/03 12/5/03. Department of Computer Science, Stanford University, Palo Alto, CA.
[12]
Ian Buck. 2006. Stream computing on graphics hardware. Ph.D. dissertation. Stanford University, Palo Alto, CA.
[13]
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786.
[14]
CAST. 2012. CAST, Inc.: 2-D inverse discrete cosine transform megafunction. http://www.cast-inc.com/ip-cores/multimedia/idct/cast_idct-a.pdf (Last accessed 8/12).
[15]
William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. 2003. Merrimac: Supercomputing with streams. In Proceedings of the ACM/IEEE Conference on supercomputing. ACM, 35--42.
[16]
Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York, NY.
[17]
Ian Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA.
[18]
FPGA Brook. 2012. FPGA Brook Homepage. http://www.eecg.toronto.edu/~plavec/fpgabrook/ (Last accessed 8/12).
[19]
Martin Charles Golumbic. 1976. Combinatorial merging. IEEE Trans. Comput. C-25, 11, 1164--1167.
[20]
Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 291--303.
[21]
GPGPU. 2012. Brook for GPUs Forum. http://www.gpgpu.org/forums/index.php?c=5 (Last accessed 8/12).
[22]
GPU Brook. 2012a. GPU Brook Source Code. http://sourceforge.net/projects/brook/ (Last accessed 8/12).
[23]
GPU Brook. 2012b. GPU Brook: Current Issues and Restrictions. http://graphics.stanford.edu/projects/brookgpu/issues.html (Last accessed 8/12).
[24]
Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th International Symposium on Microarchitecture. IEEE, 343--354.
[25]
John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach (3rd Ed.). Morgan Kaufmann Publishers, San Francisco, CA.
[26]
Amir H. Hormati, Manjunath Kudlur, David Bacon, Scott Mahlke, and Rodric Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 41--50.
[27]
Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, and Oliver Pell. 2006. Comparing FPGAs to graphics accelerators and the Playstation 2 using a unified source description. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 1--6.
[28]
Impulse. 2012. Impulse Accelerated Technologies: Impulse CoDeveloper C-to-FPGA Tools. http://www.impulseaccelerated.com/products_universal.htm (Last accessed 8/12).
[29]
Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE Trans. VLSI Syst. 13, 11, 1305--1319.
[30]
Y. Y. Leow, C. Y. Ng, and W. F. Wong. 2006. Generating hardware from OpenMP programs. In Proceedings of the IEEE International Conference on Field Programmable Technology. IEEE, 73--80.
[31]
Shih-Wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. 2006. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 196--207.
[32]
Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463.
[33]
Mentor Graphics. 2012. Mentor Graphics: Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview (Last accessed 8/12).
[34]
Joan L. Mitchell, William B. Pennebaker, Chad Fogg, and Didier J. Legall. 1997. MPEG Video Compression Standard. Chapman & Hall, New York, NY.
[35]
Stephen Neuendorffer and Kees Vissers. 2008. Streaming systems in FPGAs. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. Lecture Notes in Computer Science, vol. 5114, Springer-Verlag, Berlin Heidelberg, 147--156.
[36]
NVIDIA. 2012. NVIDIA Corporation: CUDA Zone. http://developer.nvidia.com/category/zone/cuda-zone (Last accessed 8/12).
[37]
Muhsen Owaida, Nikolaos Bellas, Christos D. Antonopoulos, Konstantis Daloukas, and Charalambos Antoniadis. 2011. Massively parallel programming models used as hardware description languages: The OpenCL case. In Proceedings of the International Conference on Computer-Aided Design. IEEE, 326--333.
[38]
Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the 7th Symposium on Application Specific Processors. IEEE, 35--42.
[39]
David Pellerin and Scott Thibault. 2005. Practical FPGA Programming in C. Prentice Hall, Upper Saddle River, NJ.
[40]
Franjo Plavec. 2010. Stream computing on FPGAs. Ph.D. dissertation, University of Toronto, Toronto, Canada.
[41]
Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2008. Towards compilation of streaming programs into FPGA hardware. In Proceedings of the Forum on Specification and Design Languages. IEEE, 67--72.
[42]
Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009a. Enhancements to FPGA design methodology using streaming. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 294--301.
[43]
Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009b. Stream programming for FPGAs. In Languages for Embedded Systems and their Applications, Martin Radetzki Ed., Lecture Notes in Electrical Engineering, Vol. 36, Springer Netherlands, 241--253.
[44]
Michael J. Quinn. 2004. Parallel programming in C with MPI and OpenMP. McGraw-Hill, Dubuque, IA.
[45]
Claus Schneider, Martin Kayss, Thomas Hollstein, and Jurgen Deicke. 1998. From algorithms to hardware architectures: A comparison of regular and irregular structured IDCT algorithms. In Proceedings of Design, Automation and Test in Europe. IEEE, 186--190.
[46]
Jeffrey Sheldon, Walter Lee, Ben Greenwald, and Saman Amarasinghe. 2003. Strength reduction of integer division and modulo operations. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 2624, Springer, Berlin, 254--273.
[47]
Robert Stephens. 1997. A survey of stream processing. Acta Informatica 34, 7, 491--541.
[48]
David Tarditi, Sidd Puri, and Jose Oglesby. 2006. Accelerator: Using data parallelism to program GPUs for general-purpose uses. Technical rep. MSR-TR-2005-184. Microsoft Research.
[49]
Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2004. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams In Proceedings of the 31st International Symposium on Computer Architecture. IEEE, 2--13.
[50]
Matjaz Verderber, Andrej Zemva, and Andrej Trost. 2003. HW/SW codesign of the MPEG-2 video decoder. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE, 7 pp.
[51]
Xilinx. 2012a. Xilinx Inc. Virtex-II Platform FPGAs: Complete Data Sheet. http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf (Last accessed 8/12).
[52]
Xilinx. 2012b. Xilinx Inc. AutoESL High-Level Synthesis Tool. http://www.xilinx.com/products/design-tools/autoesl/ (Last accessed 8/12).
[53]
Peter Yiannacouras. 2009. FPGA-based soft vector processors. Ph.D. dissertation, University of Toronto, Toronto, Canada.
[54]
Nikos D. Zervas. 2010. Alma Technologies S.A.: Private Communication.

Cited By

View all
  • (2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
  • (2016)Automated Real-Time Analysis of Streaming Big and Dense Data on Reconfigurable PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/297402310:1(1-22)Online publication date: 19-Dec-2016
  • (2015)Loop coarsening in C-based High-Level Synthesis2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2015.7245730(166-173)Online publication date: Jul-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 6, Issue 4
December 2013
89 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/2558905
  • Editor:
  • Steve Wilton
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2013
Accepted: 01 March 2013
Revised: 01 December 2012
Received: 01 August 2012
Published in TRETS Volume 6, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data parallelism
  2. behavioral synthesis
  3. field-programmable gate arrays
  4. high-level synthesis
  5. parallel reduction
  6. replication
  7. scalability
  8. streaming
  9. task parallelism
  10. throughput

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
  • (2016)Automated Real-Time Analysis of Streaming Big and Dense Data on Reconfigurable PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/297402310:1(1-22)Online publication date: 19-Dec-2016
  • (2015)Loop coarsening in C-based High-Level Synthesis2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2015.7245730(166-173)Online publication date: Jul-2015

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media