ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory

Li, Huize; Jin, Hai; Zheng, Long; Huang, Yu; Liao, Xiaofei

doi:10.1007/s11704-022-1322-3

ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory

Research Article
Published: 08 August 2022

Volume 17, article number 172103, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Huize Li¹,
Hai Jin¹,
Long Zheng¹,
Yu Huang¹ &
…
Xiaofei Liao¹

152 Accesses
10 Citations
Explore all metrics

Abstract

With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.

We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has 90.92×, 46.13×, 27.38×, 84.57×, and 3.36× speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has 24.82×, 32.94×, and 18.22× performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Overlapping Community Detection: Performance Tuning a Stochastic Gradient Markov Chain Monte Carlo Algorithm

Parallelization of String Matching Algorithm with Compaction of DFA

Prototyping Reconfigurable RRAM-Based AI Accelerators Using the RISC-V Ecosystem and Digital Twins

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Carlson D, Carin L. Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90–96
Article Google Scholar
Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12(3): e1004812
Article Google Scholar
Heath L S, Vergara J P C. Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10(5): 775–789
Google Scholar
Tsuda N, Satoh T, Kawada T. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270–271
Casper J, Olukotun K. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151–160
Cole R. Parallel merge sort. SIAM Journal on Computing, 1988, 17(4): 770–785
Article MathSciNet Google Scholar
Mueller R, Teubner J, Alonso G. Sorting networks on FPGAs. The VLDB Journal, 2012, 21(1): 1–23
Article Google Scholar
Bentley J L, Sedgewick R. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360–369
Arisland K Ø, Aasbø A C, Nundal A. VLSI parallel shift sort algorithm and design. Integration, 1984, 2(4): 331–347
Article Google Scholar
Farmahini-Farahani A, Duwe III H J, Schulte M J, Compton K. Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62(7): 1389–1402
Article MathSciNet Google Scholar
Govindaraju N, Gray J, Kumar R, Manocha D. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325–336
Singh D P, Joshi I, Choudhary J. Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46(6): 1017–1034
Article Google Scholar
Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1–10
Satish N, Kim C, Chhugani J, Nguyen A D, Lee V W, Kim D, Dubey P. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351–362
Koch D, Torresen J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45–54
Saitoh M, Elsayed E A, Van Chu T, Mashimo S, Kise K. A highperformance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197–204
Cho M, Brand D, Bordawekar R, Finkler U, Kulandaisamy V, Puri R. PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8(2): 1518–1529
Article Google Scholar
Minutoli M, Kuntz S K, Tumeo A, Kogge P. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015
Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S. Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34(4): 36–42
Article Google Scholar
Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1–7
Akinaga H, Shima H. Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98(12): 2237–2251
Article Google Scholar
Yavits L, Kvatinsky S, Morad A, Ginosar R. Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14(2): 148–151
Article Google Scholar
Imani M, Gupta S, Sharma S, Rosing T S. NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(4): 628–639
Article Google Scholar
Li H, Jin H, Zheng L, Liao X. ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 4030–4041
Article Google Scholar
Leischner N, Osipov V, Sanders P. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1–10
Sintorn E, Assarsson U. Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68(10): 1381–1388
Article Google Scholar
Cederman D, Tsigas P. GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4
Article MathSciNet Google Scholar
Song W, Koch D, Luján M, Garside J. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95–102
Samardzic N, Qiao W, Aggarwal V, Chang M C F, Cong J. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282–294
Siegl P, Buchty R, Berekovic M. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295–308
Li Z, Challapalle N, Ramanathan A K, Narayanan V. IMC-Sort: inmemory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45–50
Prasad A K, Rezaalipour M, Dehyadegari M, Bojnordi M N. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440–452
Wong H S P, Raoux S, Kim S, Liang J, Reifenberg J P, Rajendran B, Asheghi M, Goodson K E. Phase change memory. Proceedings of the IEEE, 2010, 98(12): 2201–2227
Article Google Scholar
Wang K L, Alzate J G, Amiri P K. Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46(7): 074003
Article Google Scholar
Li J, Montoye R K, Ishii M, Chang L. 1 Mb 0. 41 µm² 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits, 2014, 49(4): 896–907
Article Google Scholar
Chang M F, Huang L Y, Lin W Z, Chiang Y N, Kuo C C, Chuang C H, Yang K H, Tsai H J, Chen T F, Sheu S S. A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51(11): 2786–2798
Article Google Scholar
Matsunaga S, Katsumata A, Natsui M, Fukami S, Endoh T, Ohno H, Hanyu T. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits — Digest of Technical Papers. 2011, 298–299
Zhao L, Deng Q, Zhang Y, Yang J. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473–483
Huangfu W, Li S, Hu X, Xie Y. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1–6
Neelima B, Shamsundar B, Narayan A, Prabhu R, Gomes C. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29(4): e3865
Article Google Scholar
Asiatici M, Maiorano D, Ienne P. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133–140
Pawlowski J T. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1–24
Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1–24
Sinha R, Zobel J, Ring D. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11(1.2): 1–32
MathSciNet Google Scholar
Yavits L, Morad A, Ginosar R. Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64(2): 368–381
Article MathSciNet Google Scholar
Lien Y C. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88–89
Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17–23
Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417–432
David H, Gorbatov E, Hanebutte U R, Khanna R, Le C. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189–194
Deshpande A, Narayanan P J. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305–313

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61832006, 62072195, and 61825202).

Author information

Authors and Affiliations

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Huize Li, Hai Jin, Long Zheng, Yu Huang & Xiaofei Liao

Authors

Huize Li
View author publications
Search author on:PubMed Google Scholar
Hai Jin
View author publications
Search author on:PubMed Google Scholar
Long Zheng
View author publications
Search author on:PubMed Google Scholar
Yu Huang
View author publications
Search author on:PubMed Google Scholar
Xiaofei Liao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Long Zheng.

Additional information

Huize Li is currently a PhD student in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His current research interests include computer architecture and emerging non-volatile memory.

Hai Jin is a Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST), China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz, Germany. Jin worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is a Fellow of IEEE, Fellow of CCF, and a life member of the ACM. He has co-authored more than 20 books and published over 900 research papers. His research interests include computer architecture, parallel and distributed computing, big data processing, data storage, and system security.

Long Zheng is now an associate professor in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. He received his PhD degree in computer engineering from HUST, China in 2016. His current research interests include program analysis, runtime systems, and configurable computer architecture with a particular focus on graph processing.

Yu Huang received the BS degree from Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward the PhD degree in the School of Computer Science and Technology, HUST, China. His research interests focus on processing-in-memory architecture and graph processing.

Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is now a Professor in the School of Computer Science and Technology at HUST, China. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of system software, P2P system, cluster computing and streaming services. He is a member of IEEE and the IEEE Computer Society.

Electronic Supplementary Material