Abstract
Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i, an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.
- [1] Amazon Personalize 2023. https://aws.amazon.com/personalize/Google Scholar
- [2] . 2022. Supporting massive DLRM inference through software defined memory. In ICDCS. IEEE.Google Scholar
- [3] . 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM) 45, 6 (1998), 891–923.Google ScholarDigital Library
- [4] . 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055–2063.Google Scholar
- [5] . 2021. cDLRM: Look ahead caching for scalable training of recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 263–272.Google ScholarDigital Library
- [6] Criteo Kaggle Dataset 2020. https://www.kaggle.com/datasets/mrkmakr/criteo-datasetGoogle Scholar
- [7] CXL 3.0 Specification 2022. https://www.computeexpresslink.org/download-the-specification/Google Scholar
- [8] DRAM Market Price 2023. https://electronics-sourcing.com/2022/05/12/dram-price-increases-will-ease/Google Scholar
- [9] Facebook DLRM Dataset 2021. https://github.com/facebookresearch/dlrm_datasetsGoogle Scholar
- [10] . 2020. DeepRecSys: A system for optimizing end-to-end at-scale neural recommendation inference. In ISCA.Google Scholar
- [11] HBM Market Price 2023. https://www.networkworld.com/article/3664088/high-bandwidth-memory-hdm-delivers-impressive-performance-gains.htmlGoogle Scholar
- [12] . 2020. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 968–981.Google Scholar
- [13] . 2017. SimpleSSD: Modeling solid state drives for holistic system simulation. IEEE Computer Architecture Letters (2017).Google Scholar
- [14] Kaggle 2023. https://www.kaggle.comGoogle Scholar
- [15] . 2020. Recnmp: Accelerating personalized recommendation with near-memory processing. In ISCA.Google Scholar
- [16] . 2021. Near-memory processing in action: Accelerating personalized recommendation with axdimm. IEEE Micro 42, 1 (2021), 116–127.Google ScholarCross Ref
- [17] . 2022. Accelerating large-scale graph-based nearest neighbor search on a computational storage platform. IEEE Trans. Comput. (2022), 1–1. Google ScholarCross Ref
- [18] . 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters (2015).Google Scholar
- [19] . 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753.Google ScholarDigital Library
- [20] . 2021. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 235–248.Google Scholar
- [21] . 2022. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. Google ScholarCross Ref
- [22] . 2020. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).Google Scholar
- [23] . 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2018), 824–836.Google ScholarDigital Library
- [24] Meta 2023. https://about.meta.comGoogle Scholar
- [25] . 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).Google Scholar
- [26] . 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv (2019).Google Scholar
- [27] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
- [28] PM983 Product Brief 2018. https://www.samsung.com/semiconductor/global.semi.static/Google Scholar
- [29] sift-1b 2022. http://corpus-texmex.irisa.fr/Google Scholar
- [30] . 2022. Near-storage processing for solid state drive based recommendation inference with SmartSSDs®. In ICPE.Google Scholar
- [31] spacev-1b 2021. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1BGoogle Scholar
- [32] SSD Market Price 2023. https://www.disctech.com/Samsung-PM1725B-3.2TB-MZ-PLL3T2C-MZPLK1T6HCHP-00005-Dell-73KJ7-PCIe-NVMe-SSD?partner=1011&gclid=CjwKCAiAzp6eBhByEiwA_gGq5BswRyE1M-T6X7Gjbw9dlC_GAWnrc0kRwddyzN9IQ6mbkMA3mfSvpxoCmvEQAvD_BwEGoogle Scholar
- [33] . 2022. Rm-ssd: In-storage computing for large-scale recommendation inference. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 1056–1070.Google Scholar
- [34] torchrec 2022. https://pytorch.org/torchrec/Google Scholar
- [35] . 2008. A model of a trust-based recommendation system on a social network. AAMAS (2008).Google Scholar
- [36] . 2021. REREC: In-ReRAM acceleration with access-aware mapping for personalized recommendation. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD’21). IEEE, 1–9.Google ScholarDigital Library
- [37] . 2021. RecSSD: Near data processing for solid state drive based recommendation inference. In ASPLOS.Google Scholar
- [38] . 2017. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).Google Scholar
- [39] Xilinx VU57P HBM 2023. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu57p.htmlGoogle Scholar
- [40] . 2015. Online video recommendation in sharing community. In ICMD.Google Scholar
Index Terms
- EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference
Recommendations
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiersTraditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Comments