research-article

EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference

Authors:
Yitu Wang

Duke University, USA

Duke University, USA

0009-0004-0212-6704
View Profile

,
Shiyu Li

Duke University, USA

Duke University, USA

0000-0002-1990-7150
View Profile

,
Qilin Zheng

Duke University, USA

Duke University, USA

0000-0002-5593-1369
View Profile

,
Andrew Chang

Samsung Semiconductor, Inc., USA

Samsung Semiconductor, Inc., USA

0009-0006-1573-1377
View Profile

,
Hai Li

Duke University, USA

Duke University, USA

0000-0003-3228-6544
View Profile

,
Yiran Chen

Duke University, USA

Duke University, USA

0000-0002-1486-8412
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 22 Issue 5sArticle No.: 100pp 1–22https://doi.org/10.1145/3609384

Published:09 September 2023Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i, an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.

REFERENCES

[1] Amazon Personalize 2023. https://aws.amazon.com/personalize/Google Scholar
[2] Ardestani Ehsan K. et al. 2022. Supporting massive DLRM inference through software defined memory. In ICDCS. IEEE.Google Scholar
[3] Arya Sunil, Mount David M., Netanyahu Nathan S., Silverman Ruth, and Wu Angela Y.. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM) 45, 6 (1998), 891–923.Google ScholarDigital Library
[4] Babenko Artem and Lempitsky Victor. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055–2063.Google Scholar
[5] Balasubramanian Keshav, Alshabanah Abdulla, Choe Joshua D., and Annavaram Murali. 2021. cDLRM: Look ahead caching for scalable training of recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 263–272.Google ScholarDigital Library
[6] Criteo Kaggle Dataset 2020. https://www.kaggle.com/datasets/mrkmakr/criteo-datasetGoogle Scholar
[7] CXL 3.0 Specification 2022. https://www.computeexpresslink.org/download-the-specification/Google Scholar
[8] DRAM Market Price 2023. https://electronics-sourcing.com/2022/05/12/dram-price-increases-will-ease/Google Scholar
[9] Facebook DLRM Dataset 2021. https://github.com/facebookresearch/dlrm_datasetsGoogle Scholar
[10] Gupta Udit et al. 2020. DeepRecSys: A system for optimizing end-to-end at-scale neural recommendation inference. In ISCA.Google Scholar
[11] HBM Market Price 2023. https://www.networkworld.com/article/3664088/high-bandwidth-memory-hdm-delivers-impressive-performance-gains.htmlGoogle Scholar
[12] Hwang Ranggi, Kim Taehun, Kwon Youngeun, and Rhu Minsoo. 2020. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 968–981.Google Scholar
[13] Jung Myoungsoo et al. 2017. SimpleSSD: Modeling solid state drives for holistic system simulation. IEEE Computer Architecture Letters (2017).Google Scholar
[14] Kaggle 2023. https://www.kaggle.comGoogle Scholar
[15] Ke Liu et al. 2020. Recnmp: Accelerating personalized recommendation with near-memory processing. In ISCA.Google Scholar
[16] Ke Liu, Zhang Xuan, So Jinin, Lee Jong-Geon, Kang Shin-Haeng, Lee Sukhan, Han Songyi, Cho YeonGon, Kim Jin Hyun, Kwon Yongsuk, et al. 2021. Near-memory processing in action: Accelerating personalized recommendation with axdimm. IEEE Micro 42, 1 (2021), 116–127.Google ScholarCross Ref
[17] Kim Ji-Hoon, Park Yeo-Reum, Do Jaeyoung, Ji Soo-Young, and Kim Joo-Young. 2022. Accelerating large-scale graph-based nearest neighbor search on a computational storage platform. IEEE Trans. Comput. (2022), 1–1. Google ScholarCross Ref
[18] Kim Yoongu et al. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters (2015).Google Scholar
[19] Kwon Youngeun, Lee Yunjae, and Rhu Minsoo. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753.Google ScholarDigital Library
[20] Kwon Youngeun, Lee Yunjae, and Rhu Minsoo. 2021. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 235–248.Google Scholar
[21] Li Huaicheng et al. 2022. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. Google ScholarCross Ref
[22] Lowe-Power Jason et al. 2020. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).Google Scholar
[23] Malkov Yu A. and Yashunin Dmitry A.. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2018), 824–836.Google ScholarDigital Library
[24] Meta 2023. https://about.meta.comGoogle Scholar
[25] Mudigere Dheevatsa, Hao Yuchen, Huang Jianyu, Tulloch Andrew, Sridharan Srinivas, Liu Xing, Ozdal Mustafa, Nie Jade, Park Jongsoo, Luo Liang, et al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).Google Scholar
[26] Naumov Maxim et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv (2019).Google Scholar
[27] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
[28] PM983 Product Brief 2018. https://www.samsung.com/semiconductor/global.semi.static/Google Scholar
[29] sift-1b 2022. http://corpus-texmex.irisa.fr/Google Scholar
[30] Soltaniyeh Mohammadreza et al. 2022. Near-storage processing for solid state drive based recommendation inference with SmartSSDs®. In ICPE.Google Scholar
[31] spacev-1b 2021. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1BGoogle Scholar
[32] SSD Market Price 2023. https://www.disctech.com/Samsung-PM1725B-3.2TB-MZ-PLL3T2C-MZPLK1T6HCHP-00005-Dell-73KJ7-PCIe-NVMe-SSD?partner=1011&gclid=CjwKCAiAzp6eBhByEiwA_gGq5BswRyE1M-T6X7Gjbw9dlC_GAWnrc0kRwddyzN9IQ6mbkMA3mfSvpxoCmvEQAvD_BwEGoogle Scholar
[33] Sun Xuan, Wan Hu, Li Qiao, Yang Chia-Lin, Kuo Tei-Wei, and Xue Chun Jason. 2022. Rm-ssd: In-storage computing for large-scale recommendation inference. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 1056–1070.Google Scholar
[34] torchrec 2022. https://pytorch.org/torchrec/Google Scholar
[35] Walter Frank Edward et al. 2008. A model of a trust-based recommendation system on a social network. AAMAS (2008).Google Scholar
[36] Wang Yitu, Zhu Zhenhua, Chen Fan, Ma Mingyuan, Dai Guohao, Wang Yu, Li Hai, and Chen Yiran. 2021. REREC: In-ReRAM acceleration with access-aware mapping for personalized recommendation. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD’21). IEEE, 1–9.Google ScholarDigital Library
[37] Wilkening Mark, Gupta, et al. 2021. RecSSD: Near data processing for solid state drive based recommendation inference. In ASPLOS.Google Scholar
[38] Xiao Han, Rasul Kashif, and Vollgraf Roland. 2017. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).Google Scholar
[39] Xilinx VU57P HBM 2023. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu57p.htmlGoogle Scholar
[40] Zhou Xiangmin et al. 2015. Online video recommendation in sharing community. In ICMD.Google Scholar

Index Terms

EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Cooperative hardware/software caching for next-generation memory systems
Read More
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Read More
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 22, Issue 5s
Special Issue ESWEEK 2023
October 2023
1394 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3614235
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 9 September 2023
- Accepted: 13 July 2023
- Revised: 2 June 2023
- Received: 23 March 2023
Published in tecs Volume 22, Issue 5s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Recommendation system
compute express link
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 715
  Total Downloads
- Downloads (Last 12 months)715
- Downloads (Last 6 weeks)116
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Cooperative hardware/software caching for next-generation memory systems

Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

An efficient cache design for scalable glueless shared-memory multiprocessors