Skip to main content
Log in

ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.

We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has 90.92×, 46.13×, 27.38×, 84.57×, and 3.36× speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has 24.82×, 32.94×, and 18.22× performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Carlson D, Carin L. Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90–96

    Article  Google Scholar 

  2. Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12(3): e1004812

    Article  Google Scholar 

  3. Heath L S, Vergara J P C. Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10(5): 775–789

    Google Scholar 

  4. Tsuda N, Satoh T, Kawada T. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270–271

  5. Casper J, Olukotun K. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151–160

  6. Cole R. Parallel merge sort. SIAM Journal on Computing, 1988, 17(4): 770–785

    Article  MathSciNet  Google Scholar 

  7. Mueller R, Teubner J, Alonso G. Sorting networks on FPGAs. The VLDB Journal, 2012, 21(1): 1–23

    Article  Google Scholar 

  8. Bentley J L, Sedgewick R. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360–369

  9. Arisland K Ø, Aasbø A C, Nundal A. VLSI parallel shift sort algorithm and design. Integration, 1984, 2(4): 331–347

    Article  Google Scholar 

  10. Farmahini-Farahani A, Duwe III H J, Schulte M J, Compton K. Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62(7): 1389–1402

    Article  MathSciNet  Google Scholar 

  11. Govindaraju N, Gray J, Kumar R, Manocha D. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325–336

  12. Singh D P, Joshi I, Choudhary J. Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46(6): 1017–1034

    Article  Google Scholar 

  13. Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1–10

  14. Satish N, Kim C, Chhugani J, Nguyen A D, Lee V W, Kim D, Dubey P. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351–362

  15. Koch D, Torresen J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45–54

  16. Saitoh M, Elsayed E A, Van Chu T, Mashimo S, Kise K. A highperformance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197–204

  17. Cho M, Brand D, Bordawekar R, Finkler U, Kulandaisamy V, Puri R. PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8(2): 1518–1529

    Article  Google Scholar 

  18. Minutoli M, Kuntz S K, Tumeo A, Kogge P. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015

  19. Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S. Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34(4): 36–42

    Article  Google Scholar 

  20. Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1–7

  21. Akinaga H, Shima H. Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98(12): 2237–2251

    Article  Google Scholar 

  22. Yavits L, Kvatinsky S, Morad A, Ginosar R. Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14(2): 148–151

    Article  Google Scholar 

  23. Imani M, Gupta S, Sharma S, Rosing T S. NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(4): 628–639

    Article  Google Scholar 

  24. Li H, Jin H, Zheng L, Liao X. ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 4030–4041

    Article  Google Scholar 

  25. Leischner N, Osipov V, Sanders P. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1–10

  26. Sintorn E, Assarsson U. Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68(10): 1381–1388

    Article  Google Scholar 

  27. Cederman D, Tsigas P. GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4

    Article  MathSciNet  Google Scholar 

  28. Song W, Koch D, Luján M, Garside J. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95–102

  29. Samardzic N, Qiao W, Aggarwal V, Chang M C F, Cong J. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282–294

  30. Siegl P, Buchty R, Berekovic M. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295–308

  31. Li Z, Challapalle N, Ramanathan A K, Narayanan V. IMC-Sort: inmemory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45–50

  32. Prasad A K, Rezaalipour M, Dehyadegari M, Bojnordi M N. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440–452

  33. Wong H S P, Raoux S, Kim S, Liang J, Reifenberg J P, Rajendran B, Asheghi M, Goodson K E. Phase change memory. Proceedings of the IEEE, 2010, 98(12): 2201–2227

    Article  Google Scholar 

  34. Wang K L, Alzate J G, Amiri P K. Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46(7): 074003

    Article  Google Scholar 

  35. Li J, Montoye R K, Ishii M, Chang L. 1 Mb 0. 41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits, 2014, 49(4): 896–907

    Article  Google Scholar 

  36. Chang M F, Huang L Y, Lin W Z, Chiang Y N, Kuo C C, Chuang C H, Yang K H, Tsai H J, Chen T F, Sheu S S. A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51(11): 2786–2798

    Article  Google Scholar 

  37. Matsunaga S, Katsumata A, Natsui M, Fukami S, Endoh T, Ohno H, Hanyu T. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits — Digest of Technical Papers. 2011, 298–299

  38. Zhao L, Deng Q, Zhang Y, Yang J. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473–483

  39. Huangfu W, Li S, Hu X, Xie Y. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1–6

  40. Neelima B, Shamsundar B, Narayan A, Prabhu R, Gomes C. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29(4): e3865

    Article  Google Scholar 

  41. Asiatici M, Maiorano D, Ienne P. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133–140

  42. Pawlowski J T. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1–24

  43. Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1–24

  44. Sinha R, Zobel J, Ring D. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11(1.2): 1–32

    MathSciNet  MATH  Google Scholar 

  45. Yavits L, Morad A, Ginosar R. Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64(2): 368–381

    Article  MathSciNet  Google Scholar 

  46. Lien Y C. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88–89

  47. Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17–23

  48. Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417–432

  49. David H, Gorbatov E, Hanebutte U R, Khanna R, Le C. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189–194

  50. Deshpande A, Narayanan P J. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305–313

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61832006, 62072195, and 61825202).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Zheng.

Additional information

Huize Li is currently a PhD student in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His current research interests include computer architecture and emerging non-volatile memory.

Hai Jin is a Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST), China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz, Germany. Jin worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is a Fellow of IEEE, Fellow of CCF, and a life member of the ACM. He has co-authored more than 20 books and published over 900 research papers. His research interests include computer architecture, parallel and distributed computing, big data processing, data storage, and system security.

Long Zheng is now an associate professor in the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. He received his PhD degree in computer engineering from HUST, China in 2016. His current research interests include program analysis, runtime systems, and configurable computer architecture with a particular focus on graph processing.

Yu Huang received the BS degree from Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward the PhD degree in the School of Computer Science and Technology, HUST, China. His research interests focus on processing-in-memory architecture and graph processing.

Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is now a Professor in the School of Computer Science and Technology at HUST, China. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of system software, P2P system, cluster computing and streaming services. He is a member of IEEE and the IEEE Computer Society.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Jin, H., Zheng, L. et al. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Front. Comput. Sci. 17, 172103 (2023). https://doi.org/10.1007/s11704-022-1322-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-1322-3

Keywords

Navigation