Skip to main content
Log in

ARCHER: a ReRAM-based accelerator for compressed recommendation systems

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ke L, Gupta U, Cho B Y, Brooks D, Chandra V, Diril U, Firoozshahian A, Hazelwood K, Jia B, Lee H H S, Li M, Maher B, Mudigere D, Naumov M, Schatz M, Smelyanskiy M, Wang X, Reagen B, Wu C J, Hempstead M, Zhang X. RecNMP: Accelerating personalized recommendation with near-memory processing. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 790–803

  2. Naumov M, Mudigere D, Shi H J M, Huang J, Sundaraman N, Park J, Wang X, Gupta U, Wu C J, Azzolini A G, Dzhulgakov D, Mallevich A, Cherniavskii I, Lu Y, Krishnamoorthi R, Yu A, Kondratenko V, Pereira S, Chen X, Chen W, Rao V, Jia B, Xiong L, Smelyanskiy M. Deep learning recommendation model for personalization and recommendation systems. 2019, arXiv preprint arXiv: 1906.00091

  3. Gupta U, Wu C J, Wang X, Naumov M, Reagen B, Brooks D, Cottel B, Hazelwood K, Hempstead M, Jia B, Lee H H S, Malevich A, Mudigere D, Smelyanskiy M, Xiong L, Zhang X. The architectural implications of Facebook’s DNN-based personalized recommendation. In: Proceedings of 2020 IEEE International Symposium on High Performance Computer Architecture. 2020, 488–501

  4. Wu J, He X, Wang X, Wang Q, Chen W, Lian J, Xie X. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16(6): 166614

    Article  Google Scholar 

  5. Guo H, Tang R, Ye Y, Li Z, He X, Dong Z. DeepFM: an end-to-end wide & deep learning framework for CTR prediction. 2018, arXiv preprint arXiv: 1804.04950

  6. Zhou G, Mou N, Fan Y, Pi Q, Bian W, Zhou C, Zhu X, Gai K. Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941–5948

  7. Hwang R, Kim T, Kwon Y, Rhu M. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 968–981

  8. Kal H, Lee S, Ko G, Ro W W. SPACE: locality-aware processing in heterogeneous memory for personalized recommendations. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture. 2021, 679–691

  9. Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan J P, Hu M, Williams R S, Srikumar V. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 2016, 44(3): 14–26

    Article  Google Scholar 

  10. Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News, 2016, 44(3): 27–39

    Article  Google Scholar 

  11. Imani M, Gupta S, Kim Y, Rosing T. FloatPIM: in-memory acceleration of deep neural network training with high precision. In: Proceedings of the 46th ACM/IEEE Annual International Symposium on Computer Architecture. 2019, 802–815

  12. Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: accelerating graph processing using ReRAM. In: Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture. 2018, 531–543

  13. Huang Y, Zheng L, Yao P, Zhao J, Liao X, Jin H, Xue J. A heterogeneous PIM hardware-software co-design for energy-efficient graph processing. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 684–695

  14. Zheng L, Zhao J, Huang Y, Wang Q, Zeng Z, Xue J, Liao X, Jin H. Spara: an energy-efficient ReRAM-based accelerator for sparse graph analytics applications. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 696–707

  15. Arka A I, Doppa J R, Pande P P, Joardar B K, Chakrabarty K. ReGraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of 2021 Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667–1672

  16. Zha Y, Li J. Hyper-AP: enhancing associative processing through a full-stack optimization. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 846–859

  17. Imani M, Pampana S, Gupta S, Zhou M, Kim Y, Rosing T. DUAL: acceleration of clustering algorithms using digital-based processing inmemory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 356–371

  18. Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design. 2013, 17–23

  19. Wong H S P, Lee H Y, Yu S, Chen Y S, Wu Y, Chen P S, Lee B, Chen F T, Tsai M J. Metal–oxide RRAM. Proceedings of the IEEE, 2012, 100(6): 1951–1970

    Article  Google Scholar 

  20. Li H, Jin H, Zheng L, Huang Y, Liao X. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Frontiers of Computer Science, 2023, 17(2): 172103

    Article  Google Scholar 

  21. Yin C, Acun B, Wu C J, Liu X. TT-Rec: Tensor train compression for deep learning recommendation models. 2021, arXiv preprint arXiv: 2101.11714

  22. Hu M, Strachan J P, Li Z, Grafals E M, Davila N, Graves C, Lam S, Ge N, Yang J J, Williams R S. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference. 2016, 1–6

  23. Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 476–488

  24. Song L, Qian X, Li H, Chen Y. PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of 2017 IEEE International Symposium on High Performance Computer Architecture. 2017, 541–552

  25. Cai H, Liu B, Chen J, Naviner L, Zhou Y, Wang Z, Yang J. A survey of in-spin transfer torque MRAM computing. Science China Information Sciences, 2021, 64(6): 160402

    Article  MathSciNet  Google Scholar 

  26. Luo Y, Wang P, Peng X, Sun X, Yu S. Benchmark of ferroelectric transistor-based hybrid precision synapse for neural network accelerator. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019, 5(2): 142–150

    Article  Google Scholar 

  27. Xia F, Jiang D J, Xiong J, Sun N H. A survey of phase change memory systems. Journal of Computer Science and Technology, 2015, 30(1): 121–144

    Article  Google Scholar 

  28. Gong N. Multi level cell (MLC) in 3D crosspoint phase change memory array. Science China Information Sciences, 2021, 64(6): 166401

    Article  Google Scholar 

  29. Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1113–1120

  30. Guan H, Malevich A, Yang J, Park J, Yuen H. Post-training 4-bit quantization on embedding tables. 2019, arXiv preprint arXiv: 1911.02079

  31. Oseledets I V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 2011, 33(5): 2295–2317

    Article  MathSciNet  Google Scholar 

  32. Han T, Wang P, Niu S, Li C. Modality matches modality: pretraining modality-disentangled item representations for recommendation. In: Proceedings of the ACM Web Conference 2022. 2022, 2058–2066

  33. Long Y, She X, Mukhopadhyay S. Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings of 2019 Design, Automation & Test in Europe Conference & Exhibition. 2019, 1769–1774

  34. Dong X, Xu C, Xie Y, Jouppi N P. NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(7): 994–1007

    Article  Google Scholar 

  35. Wang Y, Zhu Z, Chen F, Ma M, Dai G, Wang Y, Li H, Chen Y. Rerec: in-ReRAM acceleration with access-aware mapping for personalized recommendation. In: Proceedings of 2021 IEEE/ACM International Conference on Computer Aided Design. 2021, 1–9

  36. Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 2007, 3–14

  37. Jiang N, Becker D U, Michelogiannakis G, Balfour J, Towles B, Shaw D E, Kim J, Dally W J. A detailed and flexible cycle-accurate network-on-chip simulator. In: Proceedings of 2013 IEEE International Symposium on Performance Analysis of Systems and Software. 2013, 86–96

  38. Huang Y, Zheng L, Yao P, Wang Q, Liao X, Jin H, Xue J. Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029–1042

  39. Qu Y, Cai H, Ren K, Zhang W, Yu Y, Wen Y, Wang J. Product-based neural networks for user response prediction. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 1149–1154

  40. Qu Y, Fang B, Zhang W, Tang R, Niu M, Guo H, Yu Y, He X. Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems, 2019, 37(1): 5

    Article  Google Scholar 

  41. Ko H, Lee S, Park Y, Choi A. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 2022, 11(1): 141

    Article  Google Scholar 

  42. Chen D, Jin H, Zheng L, Huang Y, Yao P, Gui C, Wang Q, Liu H, He H, Liao X, Zheng R. A general offloading approach for near-dram processing-in-memory architectures. In: Proceedings of 2022 IEEE International Parallel and Distributed Processing Symposium. 2022, 246–257

  43. Chen D, He H, Jin H, Zheng L, Huang Y, Shen X, Liao X. MetaNMP: leveraging Cartesian-like product to accelerate HGNNs with near-memory processing. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023, 56

  44. Kwon Y, Lee Y, Rhu M. Tensor casting: co-designing algorithm-architecture for personalized recommendation training. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 235–248

  45. Wilkening M, Gupta U, Hsia S, Trippel C, Wu C J, Brooks D, Wei G Y. RecSSD: near data processing for solid state drive based recommendation inference. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2021, 717–729

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2022YFB4501403), the National Natural Science Foundation of China (Grant Nos. 62322205, 62072195, 61825202, and 61832006), and the Zhejiang Lab (No. 2022PI0AC02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Zheng.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Xinyang Shen is currently a PhD student at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests include ReRAM-based processing in memory, graph processing, and recommendation systems.

Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is currently a Professor in the School of Computer Science and Technology at HUST. He has served as a reviewer for many conferences and journal papers. He is a member of the IEEE. His research interests are in the areas of system software, P2P systems, cluster computing, graph processing, and streaming services.

Long Zheng is now an Associate Professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. He received his PhD degree at HUST in 2016. His current research interests include runtime systems, program analysis, and configurable computer architecture.

Yu Huang received a BS degree from the Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward a PhD degree at the School of Computer Science and Technology, HUST, China. His research interests focus on distributed stream processing and graph processing.

Dan Chen received a BS degree from the North China Electric Power University, China in 2018. He is now working toward a PhD degree at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests focus on processing-in-memory and graph neural networks.

Hai Jin is a Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. He received his PhD in computer engineering from HUST in 1994. He is the chief scientist of ChinaGrid, the largest grid computing project in China. He is an IEEE Fellow, CCF Fellow, and a member of the ACM. He research interests include computer architecture, big data processing, data storage, and system security.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, X., Liao, X., Zheng, L. et al. ARCHER: a ReRAM-based accelerator for compressed recommendation systems. Front. Comput. Sci. 18, 185607 (2024). https://doi.org/10.1007/s11704-023-3397-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-023-3397-x

Keywords

Navigation