Abstract
With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.
We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has 90.92×, 46.13×, 27.38×, 84.57×, and 3.36× speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has 24.82×, 32.94×, and 18.22× performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.
Similar content being viewed by others
References
Carlson D, Carin L. Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90–96
Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12(3): e1004812
Heath L S, Vergara J P C. Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10(5): 775–789
Tsuda N, Satoh T, Kawada T. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270–271
Casper J, Olukotun K. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151–160
Cole R. Parallel merge sort. SIAM Journal on Computing, 1988, 17(4): 770–785
Mueller R, Teubner J, Alonso G. Sorting networks on FPGAs. The VLDB Journal, 2012, 21(1): 1–23
Bentley J L, Sedgewick R. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360–369
Arisland K Ø, Aasbø A C, Nundal A. VLSI parallel shift sort algorithm and design. Integration, 1984, 2(4): 331–347
Farmahini-Farahani A, Duwe III H J, Schulte M J, Compton K. Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62(7): 1389–1402
Govindaraju N, Gray J, Kumar R, Manocha D. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325–336
Singh D P, Joshi I, Choudhary J. Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46(6): 1017–1034
Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1–10
Satish N, Kim C, Chhugani J, Nguyen A D, Lee V W, Kim D, Dubey P. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351–362
Koch D, Torresen J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45–54
Saitoh M, Elsayed E A, Van Chu T, Mashimo S, Kise K. A highperformance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197–204
Cho M, Brand D, Bordawekar R, Finkler U, Kulandaisamy V, Puri R. PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8(2): 1518–1529
Minutoli M, Kuntz S K, Tumeo A, Kogge P. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015
Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S. Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34(4): 36–42
Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1–7
Akinaga H, Shima H. Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98(12): 2237–2251
Yavits L, Kvatinsky S, Morad A, Ginosar R. Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14(2): 148–151
Imani M, Gupta S, Sharma S, Rosing T S. NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(4): 628–639
Li H, Jin H, Zheng L, Liao X. ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 4030–4041
Leischner N, Osipov V, Sanders P. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1–10
Sintorn E, Assarsson U. Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68(10): 1381–1388
Cederman D, Tsigas P. GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4
Song W, Koch D, Luján M, Garside J. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95–102
Samardzic N, Qiao W, Aggarwal V, Chang M C F, Cong J. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282–294
Siegl P, Buchty R, Berekovic M. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295–308
Li Z, Challapalle N, Ramanathan A K, Narayanan V. IMC-Sort: inmemory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45–50
Prasad A K, Rezaalipour M, Dehyadegari M, Bojnordi M N. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440–452
Wong H S P, Raoux S, Kim S, Liang J, Reifenberg J P, Rajendran B, Asheghi M, Goodson K E. Phase change memory. Proceedings of the IEEE, 2010, 98(12): 2201–2227
Wang K L, Alzate J G, Amiri P K. Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46(7): 074003
Li J, Montoye R K, Ishii M, Chang L. 1 Mb 0. 41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits, 2014, 49(4): 896–907
Chang M F, Huang L Y, Lin W Z, Chiang Y N, Kuo C C, Chuang C H, Yang K H, Tsai H J, Chen T F, Sheu S S. A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51(11): 2786–2798
Matsunaga S, Katsumata A, Natsui M, Fukami S, Endoh T, Ohno H, Hanyu T. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits — Digest of Technical Papers. 2011, 298–299
Zhao L, Deng Q, Zhang Y, Yang J. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473–483
Huangfu W, Li S, Hu X, Xie Y. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1–6
Neelima B, Shamsundar B, Narayan A, Prabhu R, Gomes C. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29(4): e3865
Asiatici M, Maiorano D, Ienne P. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133–140
Pawlowski J T. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1–24
Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1–24
Sinha R, Zobel J, Ring D. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11(1.2): 1–32
Yavits L, Morad A, Ginosar R. Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64(2): 368–381
Lien Y C. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88–89
Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17–23
Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417–432
David H, Gorbatov E, Hanebutte U R, Khanna R, Le C. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189–194
Deshpande A, Narayanan P J. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305–313
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61832006, 62072195, and 61825202).
Author information
Authors and Affiliations
Corresponding author
Additional information
Huize Li is currently a PhD student in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His current research interests include computer architecture and emerging non-volatile memory.
Hai Jin is a Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST), China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz, Germany. Jin worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is a Fellow of IEEE, Fellow of CCF, and a life member of the ACM. He has co-authored more than 20 books and published over 900 research papers. His research interests include computer architecture, parallel and distributed computing, big data processing, data storage, and system security.
Long Zheng is now an associate professor in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. He received his PhD degree in computer engineering from HUST, China in 2016. His current research interests include program analysis, runtime systems, and configurable computer architecture with a particular focus on graph processing.
Yu Huang received the BS degree from Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward the PhD degree in the School of Computer Science and Technology, HUST, China. His research interests focus on processing-in-memory architecture and graph processing.
Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is now a Professor in the School of Computer Science and Technology at HUST, China. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of system software, P2P system, cluster computing and streaming services. He is a member of IEEE and the IEEE Computer Society.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Li, H., Jin, H., Zheng, L. et al. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Front. Comput. Sci. 17, 172103 (2023). https://doi.org/10.1007/s11704-022-1322-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-022-1322-3