research-article

Exploiting HBM on FPGAs for Data Processing

Authors:

Christoph Hagleitner,

Dionysios Diamantopoulos,

Dimitris Syrivelis,

Gustavo AlonsoAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 15, Issue 4

Article No.: 36, Pages 1 - 27

https://doi.org/10.1145/3491238

Published: 09 December 2022 Publication History

Abstract

Field Programmable Gate Arrays (FPGAs) are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility, since they can be used as accelerators, smart-NICs, or stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this article, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics-oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM-based solutions are able to surpass the highest performance provided by either a two-socket POWER9 system or a 14-core Xeon E5 by up to 5.9× (range selection), 18.3× (hash join), and 6.1× (SGD).

References

[1]

Intel. 2016. Intel Xeon Processor E5-2690 v4. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/91770/intel-xeon-processor-e5-2690-v4-35m-cache-2-60-ghz.html.

[2]

AWS. 2017. AWS F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.

[3]

Oracle. 2017. Oracle Data Mining. Retrieved from https://www.oracle.com/technetwork/database/enterprise-edition/odm-techniques-algorithms-097163.html.

[4]

Alpha Data. 2019. Alpha Data ADM-PCIE-9H7. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7.

[5]

IBM. 2019. POWER9 LaGrange Single-Chip Module Datasheet v1.8, OpenPOWER. Retrieved from https://www-50.ibm.com/systems/power/openpower/posting.xhtml?postingId=0646B83F1D410C28852580110015080A.

[6]

Xilinx. 2019. Xilinx VCU1525. Retrieved from https://www.xilinx.com/support/documentation/boards_and_kits/vcu1525/ug1268-vcu1525-reconfig-accel-platform.pdf.

[7]

Baidu. 2020. Baidu FPGA Instances. Retrieved from https://cloud.baidu.com/product/fpga.html.

[8]

IBM. 2020. IBM DB2 Machine Learning. Retrieved from https://www.ibm.com/cloud/garage/dte/tutorial/database-machine-learning-ibm-db2-warehouse-cloud/.

[9]

Xilinx. 2021. AXI HBM IP Documentation by Xilinx. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf.

[10]

Xilinx. 2021. New Intel XPU Innovations Target HPC and AI. Retrieved from https://www.intel.com/content/www/us/en/newsroom/news/new-intel-xpu-innovations-target-hpc-ai.html.

[11]

Xilinx. 2021. UltraScale Architecture-based FPGAs Memory IP. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/ultrascale_memory_ip/v1_4/pg150-ultrascale-memory-ip.pdf.

[12]

Xilinx. 2021. Xilinx Ultrascale+ Devices. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf.

[13]

Gustavo Alonso, Zsolt Istvan, Kaan Kara, Muhsen Owaida, and David Sidler. 2019. doppioDB 1.0: Machine learning inside a relational engine. IEEE Data Eng. Bull. 42, 2 (2019), 19–31.

[14]

Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Main-memory Hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 362–373.

Digital Library

[15]

James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281–305.

[16]

Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking. Springer, 61–76.

[17]

Peter A. Boncz, Stefan Manegold, Martin L. Kersten, et al. 1999. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the Very Large Data Base Conference (VLDB’99), Vol. 99. 54–65.

[18]

Sébastien Bubeck et al. 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 3–4 (2015), 231–357.

Digital Library

[19]

Jared Casper and Kunle Olukotun. 2014. Hardware acceleration of database operations. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 151–160.

Digital Library

[20]

Xuntao Cheng, Bingsheng He, Eric Lo, Wei Wang, Shengliang Lu, and Xinyu Chen. 2019. Deploying Hash tables on die-stacked high bandwidth memory. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 239–248.

Digital Library

[21]

Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 116–126.

Digital Library

[22]

Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.

[23]

Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 93–96.

[24]

Jian Fang, Yvo T. B. Mulder, Jan Hidders, Jinho Lee, and H. Peter Hofstee. 2019. In-memory database acceleration on FPGAs: A survey. VLDB J. (2019), 1–27.

[25]

Jeremy Fowers, Joo-Young Kim, Doug Burger, and Scott Hauck. 2015. A scalable high-bandwidth architecture for lossless compression on FPGAs. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE, 52–59.

Digital Library

[26]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, 1–14.

Digital Library

[27]

Pouya Haghi, Tong Geng, Anqi Guo, Tianqi Wang, and Martin Herbordt. 2020. FP-AMG: FPGA-based acceleration framework for algebraic multigrid solvers. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 148–156.

[28]

Hongjing Huang, Zeke Wang, Jie Zhang, Zhenhao He, Chao Wu, Jun Xiao, and Gustavo Alonso. 2021. Shuhai: A tool for benchmarking HighBandwidth memory on FPGAs. IEEE Trans. Comput. (2021).

[29]

Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40–45.

[30]

Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. 2014. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems. 3068–3076.

[31]

Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B. Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, and others. 2021. MicroRec: Efficient recommendation inference by hardware and data structure solutions. Proc. Mach. Learn. Syst. 3, 1 (2021), 845–859.

[32]

Wenqi Jiang, Zhenhao He, Shuai Zhang, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, et al. 2021. FleetRec: Large-scale recommendation inference on hybrid GPU-FPGA clusters. In Proceedings of the 27th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21).

Digital Library

[33]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 1–12.

Digital Library

[34]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In Proceedings of the 8th International Workshop on Data Management on New Hardware. ACM, 55–62.

Digital Library

[35]

Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 160–167.

[36]

Kaan Kara and Gustavo Alonso. 2016. Fast and robust hashing for database operators. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.

[37]

Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proc. VLDB Endow. 12, 4 (2018), 348–361.

Digital Library

[38]

Kaan Kara, Jana Giceva, and Gustavo Alonso. 2017. FPGA-based data partitioning. In Proceedings of the ACM International Conference on Management of Data. ACM, 433–445.

Digital Library

[39]

Kaan Kara, Christoph Hagleitner, Dionysios Diamantopoulos, Dimitris Syrivelis, and Gustavo Alonso. 2020. High bandwidth memory on FPGAs: A data analytics perspective. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 1–8.

[40]

Kaan Kara, Zeke Wang, Ce Zhang, and Gustavo Alonso. 2019. doppioDB 2.0: Hardware techniques for improved integration of machine learning into databases. Proc. VLDB Endow. 12, 12 (2019), 1818–1821.

Digital Library

[41]

Alec Lu, Zhenman Fang, Weihua Liu, and Lesley Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105–115.

Digital Library

[42]

John MacGregor. 2013. Predictive Analysis with SAP: The Comprehensive Guide. SAP Press.

Digital Library

[43]

Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS hardware acceleration of advanced analytics. Proc. VLDB Endow. 11, 11 (2018), 1317–1331.

Digital Library

[44]

Susumu Mashimo, Thiem Van Chu, and Kenji Kise. 2017. High-performance hardware merge sorter. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 1–8.

[45]

Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin. 2019. StreamBox-HBM: Stream analytics on high bandwidth hybrid memory. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, 167–181.

Digital Library

[46]

Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE, 80–85.

Digital Library

[47]

Muhsen Owaida, Hantian Zhang, Ce Zhang, and Gustavo Alonso. 2017. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1–8.

[48]

Philippos Papaphilippou and Wayne Luk. 2018. Accelerating database systems using FPGAs: A survey. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 125–1255.

[49]

Constantin Pohl, Kai-Uwe Sattler, and Goetz Graefe. 2019. Joins on high-bandwidth memory: A new level in the memory hierarchy. VLDB J. (2019), 1–21.

[50]

Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Comput. Architect. News 42, 3 (2014), 13–24.

Digital Library

[51]

Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-based open-source 100 GbE TCP/IP stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292.

[52]

Francesco Sgherzi, Alberto Parravicini, Marco Siracusa, and Marco D. Santambrogio. 2021. Solving large top-K graph eigenproblems with a memory and compute-optimized FPGA design. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 78–87.

[53]

David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the ACM International Conference on Management of Data. ACM, 403–415.

Digital Library

[54]

Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gómez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 9–17.

[55]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34–46.

Digital Library

[56]

Jeffrey Stuecheli, William J. Starke, John D. Irish, L. Baba Arimilli, D. Dreps, Bart Blaner, Curt Wollbrink, and Brian Allison. 2018. IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM J. Res. Dev. 62, 4/5 (2018), 8–1.

Digital Library

[57]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.

[58]

Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 65–74.

Digital Library

[59]

Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.

[60]

Zeke Wang, Kaan Kara, Hantian Zhang, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. Proc. VLDB Endow. 12, 7 (2019), 807–821.

Digital Library

[61]

Christian Weis, Norbert Wehn, Loi Igor, and Luca Benini. 2011. Design space exploration for 3D-stacked DRAMs. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE, 1–6.

[62]

Louis Woods, Zsolt István, and Gustavo Alonso. 2014. Ibex: An intelligent storage engine with support for advanced SQL offloading. Proc. VLDB Endow. 7, 11 (2014), 963–974.

Digital Library

Cited By

Kieu-Do-Nguyen BDang TThe Binh NPham-Quoc CPhuc Nghi HTran NInoue KPham CHoang T(2024)A High-Performance Non-Indexed Text Search SystemElectronics10.3390/electronics1311212513:11(2125)Online publication date: 29-May-2024
https://doi.org/10.3390/electronics13112125
Zhu YHou K(2024)Development and Implementation of an FPGA-Embedded Multimedia Remote Monitoring System for Information Technology Server Room ManagementInternational Journal of Digital Multimedia Broadcasting10.1155/2024/44205782024Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1155/2024/4420578
Perdomo EMartorell XCervero TSalami B(2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00026
Show More Cited By

Index Terms

Exploiting HBM on FPGAs for Data Processing

Recommendations

LINQits: big data on little clients
ICSA '13

We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query ...
Accelerating Big Data Analytics Using FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware ...
Hardware and software infrastructure to implement many-core systems in modern FPGAs
SBCCI '17: Proceedings of the 30th Symposium on Integrated Circuits and Systems Design: Chip on the Sands

Many-core systems are increasingly popular in embedded systems due to their high-performance and flexibility to execute different workloads. These many-core systems provide a rich processing fabric but lack the flexibility to accelerate critical ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 15, Issue 4

December 2022

476 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3540252

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2022

Online AM: 09 February 2022

Accepted: 12 October 2021

Revised: 14 September 2021

Received: 30 June 2021

Published in TRETS Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,307
Total Downloads

Downloads (Last 12 months)443
Downloads (Last 6 weeks)25

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kieu-Do-Nguyen BDang TThe Binh NPham-Quoc CPhuc Nghi HTran NInoue KPham CHoang T(2024)A High-Performance Non-Indexed Text Search SystemElectronics10.3390/electronics1311212513:11(2125)Online publication date: 29-May-2024
https://doi.org/10.3390/electronics13112125
Zhu YHou K(2024)Development and Implementation of an FPGA-Embedded Multimedia Remote Monitoring System for Information Technology Server Room ManagementInternational Journal of Digital Multimedia Broadcasting10.1155/2024/44205782024Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1155/2024/4420578
Perdomo EMartorell XCervero TSalami B(2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00026
Cheng QZheng ZJiang TTang CWang TGong LWang CZhou X(2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00021

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents