Abstract
When we apply field programmable gate arrays (FPGAs) as HPC accelerators, their memory bandwidth presents a significant challenge because it is not comparable to those of other HPC accelerators. In this paper, we propose a memory system for HBM2-equipped FPGAs and HPC applications that uses block RAMs as an addressable cache implemented between HBM2 and an application. This architecture enables data transfer between HBM2 and the cache bulk and allows an application to utilize fast random access on BRAMs. This study demonstrates the implementation and performance evaluation of our new memory system for HPC and HBM2 on an FPGA. Furthermore, we describe the API that can be used to control this system from the host. We implement RISC-V cores in an FPGA as controllers to realize fine-grain data transfer control and to prevent overheads derived from the PCI Express bus. The proposed system is implemented on eight memory channels and achieves 102.7 GB/s of the bandwidth. It overcomes the memory bandwidth of conventional FPGA boards with four channels of DDR4 memory despite using only 8 of 32 channels of the HBM2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Boost.YAP Library: https://www.boost.org/doc/libs/release/doc/html/yap.html
Chao, J.: Saturn: a terabit packet switch using dual round robin. IEEE Commun. Mag. 38(12), 78–84 (2000). https://doi.org/10.1109/35.888261
kyu Choi, Y., Chi, Y., Qiao, W., Samardzic, N., Cong, J.: HBM connect: High-performance HLS interconnect for FPGA HBM. In: FPGA 2021 (2021)
De Matteis, T., de Fine Licht, J., Beránek, J., Hoefler, T.: Streaming message interface: High-performance distributed memory programming on reconfigurable hardware. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 201919, pp. 82:1–82:33. ACM New York (2019). https://doi.org/10.1145/3295500.3356201
Fujita, N., Kobayashi, R., Yamaguchi, Y., Boku, T.: Hbm2 memory system for HPC applications on an FPGA. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 783–786 (2021). https://doi.org/10.1109/Cluster48925.2021.00116
Hack, S., Grund, D., Goos, G.: Register allocation for programs in SSA-form. In: Mycroft, A., Zeller, A. (eds.) CC 2006. LNCS, vol. 3923, pp. 247–262. Springer, Heidelberg (2006). https://doi.org/10.1007/11688839_20
Holzinger, P., Reiser, D., Hahn, T., Reichenbach, M.: Fast HBM access with FPGAS: analysis, architectures, and applications. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 152–159 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00030
Kenter, T., et al.: OpenCL-based FPGA design to accelerate the nodal discontinuous galerkin method for unstructured meshes. In: 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 189–196, April 2018. https://doi.org/10.1109/FCCM.2018.00037
Kuramochi, R., Nakahara, H.: An FPGA-based low-latency accelerator for randomly wired neural networks. In: 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), pp. 298–303 (2020). https://doi.org/10.1109/FPL50879.2020.00056
LibFirm: https://pp.ipd.kit.edu/firm/
Meyer, M., Kenter, T., Plessl, C.: Evaluating FPGA accelerator performance with a parameterized opencl adaptation of selected benchmarks of the hpcchallenge benchmark suite. In: 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pp. 10–18 (2020). https://doi.org/10.1109/H2RC51942.2020.00007
RISC-V International: https://riscv.org/
Venkataramanaiah, S.K., et al.: FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory. In: 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–8 (2020)
Zohouri, H.R., Podobas, A., Matsuoka, S.: Combined spatial and temporal blocking for high-performance stencil computation on FPGAS using OpenCL. In: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2018, pp. 153–162. Association for Computing Machinery, New York, (2018). https://doi.org/10.1145/3174243.3174248
Acknowledgments
This work was supported by JSPS KAKENHI Grant Number 21H04869. We also thank the Intel University Program for providing hardware and software.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fujita, N., Kobayashi, R., Yamaguchi, Y., Boku, T. (2023). Implementation and Performance Evaluation of Memory System Using Addressable Cache for HPC Applications on HBM2 Equipped FPGAs. In: Singer, J., Elkhatib, Y., Blanco Heras, D., Diehl, P., Brown, N., Ilic, A. (eds) Euro-Par 2022: Parallel Processing Workshops. Euro-Par 2022. Lecture Notes in Computer Science, vol 13835. Springer, Cham. https://doi.org/10.1007/978-3-031-31209-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-31209-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31208-3
Online ISBN: 978-3-031-31209-0
eBook Packages: Computer ScienceComputer Science (R0)