Abstract:
A modern FPGA card can be equipped with high bandwidth memory, such as HBM2. Since the amount of the memory is limited on an FPGA, highly parallel data processing becomes...Show MoreMetadata
Abstract:
A modern FPGA card can be equipped with high bandwidth memory, such as HBM2. Since the amount of the memory is limited on an FPGA, highly parallel data processing becomes crucial on tightly coupled FPGAs by high-density optical integration, e.g., onboard Si-Photonics transceivers. This study presents a scalable distributed radix sorter, and implements it on an eight-FPGA cluster. Each custom Stratix10 MX2100 FPGA card has 819-Gbps memory bandwidth with two HBM2 memories and 800-Gbps network bandwidth with eight custom embedded optical modules. Existing FPGA sorter typically relies on a merge sort. However, it has a severe performance bottleneck at the final stage of data merge, which cannot make the best use of the high memory-to-memory bandwidth on the FPGA cluster. Instead, we implement a radix sort for a 32-bit key range consisting of eight 4-bit counting sorts optimized to the memory-network structure. Each counting sort needs memory read/write access only once through global and local pipelines. We demonstrated a sorting throughput of 37.2 GB/s.
Published in: 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Date of Conference: 15-18 May 2022
Date Added to IEEE Xplore: 03 June 2022
ISBN Information: