ARCHER: a ReRAM-based accelerator for compressed recommendation systems

Shen, Xinyang; Liao, Xiaofei; Zheng, Long; Huang, Yu; Chen, Dan; Jin, Hai

doi:10.1007/s11704-023-3397-x

ARCHER: a ReRAM-based accelerator for compressed recommendation systems

Research Article
Published: 23 December 2023

Volume 18, article number 185607, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Xinyang Shen¹,
Xiaofei Liao¹,
Long Zheng¹,
Yu Huang¹,
Dan Chen¹ &
…
Hai Jin¹

172 Accesses
7 Altmetric
1 Mention
Explore all metrics

Abstract

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A heterogeneous 3-D stacked PIM accelerator for GCN-based recommender systems

Article Open access 28 February 2024

Efficient Large Scale DLRM Implementation on Heterogeneous Memory Systems

Efficient Hardware Acceleration of Recommendation Engines: A Use Case on Collaborative Filtering

References

Ke L, Gupta U, Cho B Y, Brooks D, Chandra V, Diril U, Firoozshahian A, Hazelwood K, Jia B, Lee H H S, Li M, Maher B, Mudigere D, Naumov M, Schatz M, Smelyanskiy M, Wang X, Reagen B, Wu C J, Hempstead M, Zhang X. RecNMP: Accelerating personalized recommendation with near-memory processing. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 790–803
Naumov M, Mudigere D, Shi H J M, Huang J, Sundaraman N, Park J, Wang X, Gupta U, Wu C J, Azzolini A G, Dzhulgakov D, Mallevich A, Cherniavskii I, Lu Y, Krishnamoorthi R, Yu A, Kondratenko V, Pereira S, Chen X, Chen W, Rao V, Jia B, Xiong L, Smelyanskiy M. Deep learning recommendation model for personalization and recommendation systems. 2019, arXiv preprint arXiv: 1906.00091
Gupta U, Wu C J, Wang X, Naumov M, Reagen B, Brooks D, Cottel B, Hazelwood K, Hempstead M, Jia B, Lee H H S, Malevich A, Mudigere D, Smelyanskiy M, Xiong L, Zhang X. The architectural implications of Facebook’s DNN-based personalized recommendation. In: Proceedings of 2020 IEEE International Symposium on High Performance Computer Architecture. 2020, 488–501
Wu J, He X, Wang X, Wang Q, Chen W, Lian J, Xie X. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16(6): 166614
Article Google Scholar
Guo H, Tang R, Ye Y, Li Z, He X, Dong Z. DeepFM: an end-to-end wide & deep learning framework for CTR prediction. 2018, arXiv preprint arXiv: 1804.04950
Zhou G, Mou N, Fan Y, Pi Q, Bian W, Zhou C, Zhu X, Gai K. Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941–5948
Hwang R, Kim T, Kwon Y, Rhu M. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 968–981
Kal H, Lee S, Ko G, Ro W W. SPACE: locality-aware processing in heterogeneous memory for personalized recommendations. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture. 2021, 679–691
Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan J P, Hu M, Williams R S, Srikumar V. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 2016, 44(3): 14–26
Article Google Scholar
Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News, 2016, 44(3): 27–39
Article Google Scholar
Imani M, Gupta S, Kim Y, Rosing T. FloatPIM: in-memory acceleration of deep neural network training with high precision. In: Proceedings of the 46th ACM/IEEE Annual International Symposium on Computer Architecture. 2019, 802–815
Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: accelerating graph processing using ReRAM. In: Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture. 2018, 531–543
Huang Y, Zheng L, Yao P, Zhao J, Liao X, Jin H, Xue J. A heterogeneous PIM hardware-software co-design for energy-efficient graph processing. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 684–695
Zheng L, Zhao J, Huang Y, Wang Q, Zeng Z, Xue J, Liao X, Jin H. Spara: an energy-efficient ReRAM-based accelerator for sparse graph analytics applications. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 696–707
Arka A I, Doppa J R, Pande P P, Joardar B K, Chakrabarty K. ReGraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of 2021 Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667–1672
Zha Y, Li J. Hyper-AP: enhancing associative processing through a full-stack optimization. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 846–859
Imani M, Pampana S, Gupta S, Zhou M, Kim Y, Rosing T. DUAL: acceleration of clustering algorithms using digital-based processing inmemory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 356–371
Niu D, Xu C, Muralimanohar N, Jouppi N P, Xie Y. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design. 2013, 17–23
Wong H S P, Lee H Y, Yu S, Chen Y S, Wu Y, Chen P S, Lee B, Chen F T, Tsai M J. Metal–oxide RRAM. Proceedings of the IEEE, 2012, 100(6): 1951–1970
Article Google Scholar
Li H, Jin H, Zheng L, Huang Y, Liao X. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Frontiers of Computer Science, 2023, 17(2): 172103
Article Google Scholar
Yin C, Acun B, Wu C J, Liu X. TT-Rec: Tensor train compression for deep learning recommendation models. 2021, arXiv preprint arXiv: 2101.11714
Hu M, Strachan J P, Li Z, Grafals E M, Davila N, Graves C, Lam S, Ge N, Yang J J, Williams R S. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference. 2016, 1–6
Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 476–488
Song L, Qian X, Li H, Chen Y. PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of 2017 IEEE International Symposium on High Performance Computer Architecture. 2017, 541–552
Cai H, Liu B, Chen J, Naviner L, Zhou Y, Wang Z, Yang J. A survey of in-spin transfer torque MRAM computing. Science China Information Sciences, 2021, 64(6): 160402
Article MathSciNet Google Scholar
Luo Y, Wang P, Peng X, Sun X, Yu S. Benchmark of ferroelectric transistor-based hybrid precision synapse for neural network accelerator. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019, 5(2): 142–150
Article Google Scholar
Xia F, Jiang D J, Xiong J, Sun N H. A survey of phase change memory systems. Journal of Computer Science and Technology, 2015, 30(1): 121–144
Article Google Scholar
Gong N. Multi level cell (MLC) in 3D crosspoint phase change memory array. Science China Information Sciences, 2021, 64(6): 166401
Article Google Scholar
Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1113–1120
Guan H, Malevich A, Yang J, Park J, Yuen H. Post-training 4-bit quantization on embedding tables. 2019, arXiv preprint arXiv: 1911.02079
Oseledets I V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 2011, 33(5): 2295–2317
Article MathSciNet Google Scholar
Han T, Wang P, Niu S, Li C. Modality matches modality: pretraining modality-disentangled item representations for recommendation. In: Proceedings of the ACM Web Conference 2022. 2022, 2058–2066
Long Y, She X, Mukhopadhyay S. Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings of 2019 Design, Automation & Test in Europe Conference & Exhibition. 2019, 1769–1774
Dong X, Xu C, Xie Y, Jouppi N P. NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(7): 994–1007
Article Google Scholar
Wang Y, Zhu Z, Chen F, Ma M, Dai G, Wang Y, Li H, Chen Y. Rerec: in-ReRAM acceleration with access-aware mapping for personalized recommendation. In: Proceedings of 2021 IEEE/ACM International Conference on Computer Aided Design. 2021, 1–9
Muralimanohar N, Balasubramonian R, Jouppi N. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 2007, 3–14
Jiang N, Becker D U, Michelogiannakis G, Balfour J, Towles B, Shaw D E, Kim J, Dally W J. A detailed and flexible cycle-accurate network-on-chip simulator. In: Proceedings of 2013 IEEE International Symposium on Performance Analysis of Systems and Software. 2013, 86–96
Huang Y, Zheng L, Yao P, Wang Q, Liao X, Jin H, Xue J. Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029–1042
Qu Y, Cai H, Ren K, Zhang W, Yu Y, Wen Y, Wang J. Product-based neural networks for user response prediction. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 1149–1154
Qu Y, Fang B, Zhang W, Tang R, Niu M, Guo H, Yu Y, He X. Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems, 2019, 37(1): 5
Article Google Scholar
Ko H, Lee S, Park Y, Choi A. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 2022, 11(1): 141
Article Google Scholar
Chen D, Jin H, Zheng L, Huang Y, Yao P, Gui C, Wang Q, Liu H, He H, Liao X, Zheng R. A general offloading approach for near-dram processing-in-memory architectures. In: Proceedings of 2022 IEEE International Parallel and Distributed Processing Symposium. 2022, 246–257
Chen D, He H, Jin H, Zheng L, Huang Y, Shen X, Liao X. MetaNMP: leveraging Cartesian-like product to accelerate HGNNs with near-memory processing. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023, 56
Kwon Y, Lee Y, Rhu M. Tensor casting: co-designing algorithm-architecture for personalized recommendation training. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 235–248
Wilkening M, Gupta U, Hsia S, Trippel C, Wu C J, Brooks D, Wei G Y. RecSSD: near data processing for solid state drive based recommendation inference. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2021, 717–729

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2022YFB4501403), the National Natural Science Foundation of China (Grant Nos. 62322205, 62072195, 61825202, and 61832006), and the Zhejiang Lab (No. 2022PI0AC02).

Author information

Authors and Affiliations

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen & Hai Jin

Authors

Xinyang Shen
View author publications
You can also search for this author inPubMed Google Scholar
Xiaofei Liao
View author publications
You can also search for this author inPubMed Google Scholar
Long Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Yu Huang
View author publications
You can also search for this author inPubMed Google Scholar
Dan Chen
View author publications
You can also search for this author inPubMed Google Scholar
Hai Jin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Long Zheng.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Xinyang Shen is currently a PhD student at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests include ReRAM-based processing in memory, graph processing, and recommendation systems.

Xiaofei Liao received his PhD degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China in 2005. He is currently a Professor in the School of Computer Science and Technology at HUST. He has served as a reviewer for many conferences and journal papers. He is a member of the IEEE. His research interests are in the areas of system software, P2P systems, cluster computing, graph processing, and streaming services.

Long Zheng is now an Associate Professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. He received his PhD degree at HUST in 2016. His current research interests include runtime systems, program analysis, and configurable computer architecture.

Yu Huang received a BS degree from the Huazhong University of Science and Technology (HUST), China in 2016. He is now working toward a PhD degree at the School of Computer Science and Technology, HUST, China. His research interests focus on distributed stream processing and graph processing.

Dan Chen received a BS degree from the North China Electric Power University, China in 2018. He is now working toward a PhD degree at the School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), China. His research interests focus on processing-in-memory and graph neural networks.

Hai Jin is a Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. He received his PhD in computer engineering from HUST in 1994. He is the chief scientist of ChinaGrid, the largest grid computing project in China. He is an IEEE Fellow, CCF Fellow, and a member of the ACM. He research interests include computer architecture, big data processing, data storage, and system security.

Electronic supplementary material