skip to main content
research-article

Exploiting HBM on FPGAs for Data Processing

Published: 09 December 2022 Publication History

Abstract

Field Programmable Gate Arrays (FPGAs) are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility, since they can be used as accelerators, smart-NICs, or stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this article, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics-oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM-based solutions are able to surpass the highest performance provided by either a two-socket POWER9 system or a 14-core Xeon E5 by up to 5.9× (range selection), 18.3× (hash join), and 6.1× (SGD).

References

[2]
AWS. 2017. AWS F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.
[4]
Alpha Data. 2019. Alpha Data ADM-PCIE-9H7. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7.
[5]
IBM. 2019. POWER9 LaGrange Single-Chip Module Datasheet v1.8, OpenPOWER. Retrieved from https://www-50.ibm.com/systems/power/openpower/posting.xhtml?postingId=0646B83F1D410C28852580110015080A.
[7]
Baidu. 2020. Baidu FPGA Instances. Retrieved from https://cloud.baidu.com/product/fpga.html.
[9]
[10]
Xilinx. 2021. New Intel XPU Innovations Target HPC and AI. Retrieved from https://www.intel.com/content/www/us/en/newsroom/news/new-intel-xpu-innovations-target-hpc-ai.html.
[13]
Gustavo Alonso, Zsolt Istvan, Kaan Kara, Muhsen Owaida, and David Sidler. 2019. doppioDB 1.0: Machine learning inside a relational engine. IEEE Data Eng. Bull. 42, 2 (2019), 19–31.
[14]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Main-memory Hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 362–373.
[15]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281–305.
[16]
Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking. Springer, 61–76.
[17]
Peter A. Boncz, Stefan Manegold, Martin L. Kersten, et al. 1999. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the Very Large Data Base Conference (VLDB’99), Vol. 99. 54–65.
[18]
Sébastien Bubeck et al. 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 3–4 (2015), 231–357.
[19]
Jared Casper and Kunle Olukotun. 2014. Hardware acceleration of database operations. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 151–160.
[20]
Xuntao Cheng, Bingsheng He, Eric Lo, Wei Wang, Shengliang Lu, and Xinyu Chen. 2019. Deploying Hash tables on die-stacked high bandwidth memory. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 239–248.
[21]
Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 116–126.
[22]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.
[23]
Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 93–96.
[24]
Jian Fang, Yvo T. B. Mulder, Jan Hidders, Jinho Lee, and H. Peter Hofstee. 2019. In-memory database acceleration on FPGAs: A survey. VLDB J. (2019), 1–27.
[25]
Jeremy Fowers, Joo-Young Kim, Doug Burger, and Scott Hauck. 2015. A scalable high-bandwidth architecture for lossless compression on FPGAs. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE, 52–59.
[26]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, 1–14.
[27]
Pouya Haghi, Tong Geng, Anqi Guo, Tianqi Wang, and Martin Herbordt. 2020. FP-AMG: FPGA-based acceleration framework for algebraic multigrid solvers. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 148–156.
[28]
Hongjing Huang, Zeke Wang, Jie Zhang, Zhenhao He, Chao Wu, Jun Xiao, and Gustavo Alonso. 2021. Shuhai: A tool for benchmarking HighBandwidth memory on FPGAs. IEEE Trans. Comput. (2021).
[29]
Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40–45.
[30]
Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. 2014. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems. 3068–3076.
[31]
Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B. Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, and others. 2021. MicroRec: Efficient recommendation inference by hardware and data structure solutions. Proc. Mach. Learn. Syst. 3, 1 (2021), 845–859.
[32]
Wenqi Jiang, Zhenhao He, Shuai Zhang, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, et al. 2021. FleetRec: Large-scale recommendation inference on hybrid GPU-FPGA clusters. In Proceedings of the 27th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21).
[33]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 1–12.
[34]
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In Proceedings of the 8th International Workshop on Data Management on New Hardware. ACM, 55–62.
[35]
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 160–167.
[36]
Kaan Kara and Gustavo Alonso. 2016. Fast and robust hashing for database operators. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.
[37]
Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proc. VLDB Endow. 12, 4 (2018), 348–361.
[38]
Kaan Kara, Jana Giceva, and Gustavo Alonso. 2017. FPGA-based data partitioning. In Proceedings of the ACM International Conference on Management of Data. ACM, 433–445.
[39]
Kaan Kara, Christoph Hagleitner, Dionysios Diamantopoulos, Dimitris Syrivelis, and Gustavo Alonso. 2020. High bandwidth memory on FPGAs: A data analytics perspective. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 1–8.
[40]
Kaan Kara, Zeke Wang, Ce Zhang, and Gustavo Alonso. 2019. doppioDB 2.0: Hardware techniques for improved integration of machine learning into databases. Proc. VLDB Endow. 12, 12 (2019), 1818–1821.
[41]
Alec Lu, Zhenman Fang, Weihua Liu, and Lesley Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105–115.
[42]
John MacGregor. 2013. Predictive Analysis with SAP: The Comprehensive Guide. SAP Press.
[43]
Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS hardware acceleration of advanced analytics. Proc. VLDB Endow. 11, 11 (2018), 1317–1331.
[44]
Susumu Mashimo, Thiem Van Chu, and Kenji Kise. 2017. High-performance hardware merge sorter. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 1–8.
[45]
Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin. 2019. StreamBox-HBM: Stream analytics on high bandwidth hybrid memory. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, 167–181.
[46]
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE, 80–85.
[47]
Muhsen Owaida, Hantian Zhang, Ce Zhang, and Gustavo Alonso. 2017. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1–8.
[48]
Philippos Papaphilippou and Wayne Luk. 2018. Accelerating database systems using FPGAs: A survey. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 125–1255.
[49]
Constantin Pohl, Kai-Uwe Sattler, and Goetz Graefe. 2019. Joins on high-bandwidth memory: A new level in the memory hierarchy. VLDB J. (2019), 1–21.
[50]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Comput. Architect. News 42, 3 (2014), 13–24.
[51]
Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-based open-source 100 GbE TCP/IP stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292.
[52]
Francesco Sgherzi, Alberto Parravicini, Marco Siracusa, and Marco D. Santambrogio. 2021. Solving large top-K graph eigenproblems with a memory and compute-optimized FPGA design. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 78–87.
[53]
David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the ACM International Conference on Management of Data. ACM, 403–415.
[54]
Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gómez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 9–17.
[55]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34–46.
[56]
Jeffrey Stuecheli, William J. Starke, John D. Irish, L. Baba Arimilli, D. Dreps, Bart Blaner, Curt Wollbrink, and Brian Allison. 2018. IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM J. Res. Dev. 62, 4/5 (2018), 8–1.
[57]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
[58]
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 65–74.
[59]
Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.
[60]
Zeke Wang, Kaan Kara, Hantian Zhang, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. Proc. VLDB Endow. 12, 7 (2019), 807–821.
[61]
Christian Weis, Norbert Wehn, Loi Igor, and Luca Benini. 2011. Design space exploration for 3D-stacked DRAMs. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE, 1–6.
[62]
Louis Woods, Zsolt István, and Gustavo Alonso. 2014. Ibex: An intelligent storage engine with support for advanced SQL offloading. Proc. VLDB Endow. 7, 11 (2014), 963–974.

Cited By

View all
  • (2024)A High-Performance Non-Indexed Text Search SystemElectronics10.3390/electronics1311212513:11(2125)Online publication date: 29-May-2024
  • (2024)Development and Implementation of an FPGA-Embedded Multimedia Remote Monitoring System for Information Technology Server Room ManagementInternational Journal of Digital Multimedia Broadcasting10.1155/2024/44205782024Online publication date: 7-Mar-2024
  • (2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
December 2022
476 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3540252
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2022
Online AM: 09 February 2022
Accepted: 12 October 2021
Revised: 14 September 2021
Received: 30 June 2021
Published in TRETS Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High bandwidth memory (HBM)
  2. FPGA
  3. database
  4. advanced analytics

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)443
  • Downloads (Last 6 weeks)25
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A High-Performance Non-Indexed Text Search SystemElectronics10.3390/electronics1311212513:11(2125)Online publication date: 29-May-2024
  • (2024)Development and Implementation of an FPGA-Embedded Multimedia Remote Monitoring System for Information Technology Server Room ManagementInternational Journal of Digital Multimedia Broadcasting10.1155/2024/44205782024Online publication date: 7-Mar-2024
  • (2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
  • (2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media