research-article

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs

Authors:

Yu WangAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 4

Article No.: 116, Pages 1 - 26

https://doi.org/10.1145/3092950

Published: 13 July 2017 Publication History

Abstract

With the unique feature of fine-grained parallelism, field-programmable gate arrays (FPGAs) show great potential for streaming algorithm acceleration. However, the lack of a design framework, restrictions on FPGAs, and ineffective tools impede the utilization of FPGAs in practice. In this study, we provide a design paradigm to support streaming algorithm acceleration on FPGAs. We first propose an abstract model to describe streaming algorithms with homogeneous sub-functions (HSF) and stable data dependency (SDD), which we call the HSF-SDD model. Using this model, we then develop an FPGA framework, PE-Ring, that has the advantages of (1) fully exploiting algorithm parallelism to achieve high performance, (2) leveraging block RAM to serve large scale parameters, and (3) enabling flexible parameter adjustments. Based on the proposed model and framework, we finally implement a specific converter to generate the register-transfer level representation of the PE-Ring. Experimental results show that our method outperforms ordinary FPGA design tools by one to two orders of magnitude. Experiments also demonstrate the scalability of the PE-Ring.

References

[1]

Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Trans. Database Syst. 38, 4 (2013), 26.

Digital Library

[2]

Arvind Arasu, Ken Eguro, Manas Joglekar, Raghav Kaushik, Donald Kossmann, and Ravi Ramamurthy. 2015. Transaction processing on confidential data using cipherbase. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 435--446.

[3]

Subi Arumugam, Alin Dobra, Christopher M. Jermaine, Niketan Pansare, and Luis Perez. 2010. The DataPath system: A data-centric analytic processing engine for large data warehouses. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 519--530.

Digital Library

[4]

Jeff A. Bilmes et al. 1998. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Int. Comput. Sci. Inst. 4, 510 (1998), 126.

[5]

Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. 2008. Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endow. 1, 2 (2008), 1542--1552.

Digital Library

[6]

Ce Guo, Haohuan Fu, and Wayne Luk. 2012. A fully-pipelined expectation-maximization engine for Gaussian mixture models. In Proceedings of the 2012 International Conference on Field-Programmable Technology (FPT’12). IEEE, 182--189.

[7]

Informix. 2015. Informix-subsequence similarity search. Retrieved October 25, 2016 from https://crl.ptopenlab.com:8800/accelerator/accelerator/4/.

[8]

Changhoon Kim, Matthew Caesar, Alexandre Gerber, and Jennifer Rexford. 2009. Revisiting route caching: The world should be flat. In Proceedings of the International Conference on Passive and Active Network Measurement. Springer, 3--12.

Digital Library

[9]

NSL Phani Kumar, Sanjiv Satoor, and Ian Buck. 2009. Fast parallel expectation maximization for gaussian mixture models on GPUs using CUDA. In Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009 (HPCC’09). IEEE, 103--109.

[10]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO’04). IEEE, 75--86.

Digital Library

[11]

Oskar Mencer. 2006. ASC: A stream compiler for computing with FPGAs. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 25, 9 (2006), 1603--1617.

Digital Library

[12]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31, 3 (2006), 1095--1133.

Digital Library

[13]

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2010. Glacier: A query-to-hardware compiler. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 1159--1162.

Digital Library

[14]

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB J. 21, 1 (2012), 1--23.

Digital Library

[15]

Netezza. 2011. Retrieved October 25, 2016 from http://www.ibm.com/software/data/netezza.

[16]

OpenCL. 2013. Retieved October 25, 2016 from https://www.altera.com/products/design-software/embedded-software-devel opers/opencl/overview.html.

[17]

RIFFA. 2013. http://riffa.ucsd.edu/. (2013).

[18]

Doruk Sart, Abdullah Mueen, Walid Najjar, Eamonn Keogh, and Vit Niennattrakul. 2010. Accelerating dynamic time warping subsequence search with GPUs and FPGAs. In Proceedings of the 2010 IEEE International Conference on Data Mining. IEEE, 1001--1006.

Digital Library

[19]

Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 411--420.

Digital Library

[20]

Takashi Takenaka, Masamichi Takagi, and Hiroaki Inoue. 2012. A scalable complex event processing framework for combination of SQL-based continuous queries and C/C++ functions. In Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL). IEEE, 237--242.

[21]

Jens Teubner and Rene Mueller. 2011. How soccer players would do stream joins. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 625--636.

Digital Library

[22]

Jens Teubner, Rene Muller, and Gustavo Alonso. 2011. Frequent item computation on a chip. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1169--1181.

Digital Library

[23]

UCRSuite. 2012. Retrieved October 25, 2016 from http://www.cs.ucr.edu/%7eeamonn/UCRsuite.html.

[24]

Vivado. 2012. Retrieved October 25, 2016 http://www.xilinx.com/products/design-tools/vivado.html.

[25]

Haixun Wang and Carlo Zaniolo. 1999. User-defined aggregates in database languages. In Proceedings of the International Symposium on Database Programming Languages. Springer, 43--60.

[26]

Zilong Wang, Sitao Huang, Lanjun Wang, Hao Li, Yu Wang, and Huazhong Yang. 2013. Accelerating subsequence similarity search based on dynamic time warping distance with FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 53--62.

Digital Library

[27]

WebDocs. 2003. October 25, 2016 from http://fimi.ua.ac.be/data/.

[28]

Xuechao Wei, Yun Liang, Tao Wang, Songwu Lu, and Jason Cong. 2017. Throughput optimization for streaming applications on CPU-FPGA heterogeneous systems. In Proceedings of the 22th Asia and South Pacific Design Automation Conference (ASP-DAC).

[29]

Louis Woods, Gustavo Alonso, and Jens Teubner. 2015. Parallelizing data processing on FPGAs with shifter lists. ACM Trans. Reconfig. Technol. Syst. 8, 2 (2015), 7.

Digital Library

[30]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In ACM Sigmod Record, Vol. 25. ACM, 103--114.

Digital Library

[31]

Wei Zuo, Yun Liang, Peng Li, Kyle Rupnow, Deming Chen, and Jason Cong. 2013. Improving high level synthesis optimization opportunity through polyhedral transformations. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 9--18.

Digital Library

Cited By

Pazzaglia PMaggio M(2022)Characterizing the Effect of Deadline Misses on Time-Triggered Task ChainsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319914641:11(3957-3968)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2022.3199146

Index Terms

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Programmable logic elements

Recommendations

Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. ...
Software-programmable digital pre-distortion on new generation FPGAs

In this paper we present a software programmable design flow that facilitates the implementation and integration of efficient digital pre-distortion (DPD) solutions on the leading-edge field programmable gate arrays, combining industry-standard embedded ...
Overview of a compiler for synthesizing MATLAB programs onto FPGAs
Special section on the 2002 international symposium on low-power electronics and design (ISLPED)

This paper describes a behavioral synthesis tool called AccelFPGA which reads in high-level descriptions of digital signal processing (DSP) applications written in MATLAB, and automatically generates synthesizable register transfer level (RTL) models ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 4

Special Issue on Secure and Fault-Tolerant Embedded Computing and Regular Papers

November 2017

614 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3092956

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 13 July 2017

Accepted: 01 April 2017

Revised: 01 February 2017

Received: 01 October 2016

Published in TECS Volume 16, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

973 project
Huawei Technologies Co. Ltd
National Natural Science Foundation of China
Joint fund of Equipment pre-Research and Ministry of Education

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
272
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pazzaglia PMaggio M(2022)Characterizing the Effect of Deadline Misses on Time-Triggered Task ChainsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319914641:11(3957-3968)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2022.3199146

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents