research-article

UGACHE: A Unified GPU Cache for Embedding-based Deep Learning

Authors:
Xiaoniu Song

Shanghai Jiao Tong University, Shanghai, China

Shanghai AI Laboratory, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

Shanghai AI Laboratory, Shanghai, China

https://orcid.org/0009-0009-1887-2636
View Profile

,
Yiwen Zhang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

https://orcid.org/0009-0002-0462-1288
View Profile

,
Rong Chen

Shanghai Jiao Tong University, Shanghai, China

Shanghai AI Laboratory, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

Shanghai AI Laboratory, Shanghai, China

https://orcid.org/0000-0002-6115-8130
View Profile

,
Haibo Chen

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

https://orcid.org/0000-0002-9720-0361
View Profile

SOSP '23: Proceedings of the 29th Symposium on Operating Systems PrinciplesOctober 2023Pages 627–641https://doi.org/10.1145/3600006.3613169

Published:23 October 2023Publication History

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Pages 627–641

ABSTRACT

This paper presents UGache, a unified multi-GPU cache system for embedding-based deep learning (EmbDL). UGache is primarily motivated by the unique characteristics of EmbDL applications, namely read-only, batched, skewed, and predictable embedding accesses. UGache introduces a novel factored extraction mechanism that avoids bandwidth congestion to fully exploit high-speed cross-GPU interconnects (e.g., NVLink and NVSwitch). Based on a new hotness metric, UGache also provides a near-optimal cache policy that balances local and remote access to minimize the extraction time. We have implemented UGache and integrated it into two representative frameworks, TensorFlow and PyTorch. Evaluation using two typical types of EmbDL applications, namely graph neural network training and deep learning recommendation inference, shows that UGache outperforms state-of-the-art replication and partition designs by an average of 1.93× and 1.63× (up to 5.25× and 3.45×), respectively.

References

2011. Peer-to-Peer and Unified Virtual Addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf.Google Scholar
2021. Open Graph Benchmark: Benchmark datasets, data loaders and evaluators for graph machine learning. https://ogb.stanford.edu/.Google Scholar
2021. Open Graph Benchmark: The MAG240M dataset. https://ogb.stanford.edu/docs/lsc/mag240m/.Google Scholar
2021. Open Graph Benchmark: The ogbn-papers100M dataset. https://ogb.stanford.edu/docs/nodeprop/#ogbn-papers100M.Google Scholar
2022. Distributed Embeddings · NVIDIA-Merlin. https://github.com/NVIDIA-Merlin/distributed-embeddingsGoogle Scholar
2022. Download Criteo 1TB Click Logs dataset - Criteo AI Lab. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.Google Scholar
2022. quiver-team/torch-quiver. https://github.com/quiver-team/torch-quiverGoogle Scholar
2022. Sparse Operation Kit · NVIDIA-Merlin. https://github.com/NVIDIA-Merlin/HugeCTRGoogle Scholar
2023. Gurobi Optimizer. https://www.gurobi.com/.Google Scholar
2023. Multi-Process Service. http://docs.nvidia.com/deploy/mps/index.htmlGoogle Scholar
2023. Nvidia Collective Communication Library (NCCL). https://developer.nvidia.com/nccl.Google Scholar
2023. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.Google Scholar
2023. Open Graph Benchmark Leaderboards. https://ogb.stanford.edu/docs/leader_overview/.Google Scholar
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle Scholar
Lada A Adamic and Bernardo A Huberman. 2000. Power-law distribution of the world wide web. science 287, 5461 (2000), 2115--2115.Google Scholar
Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair. 2021. Accelerating recommendation system training by leveraging popular choices. Proceedings of the VLDB Endowment 15, 1 (Sept. 2021), 127--140.Google ScholarDigital Library
Saurabh Agarwal, Ziyi Zhang, and Shivaram Venkataraman. 2022. BagPipe: Accelerating Deep Recommendation Model Training. arXiv:2202.12429 [cs].Google Scholar
Keshav Balasubramanian, Abdulla Alshabanah, Joshua D Choe, and Murali Annavaram. 2021. cDLRM: Look Ahead Caching for Scalable Training of Recommendation Models. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys '21). Association for Computing Machinery, New York, NY, USA, 263--272.Google ScholarDigital Library
Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: differentiated graph computation and partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Systems. 1--15.Google ScholarDigital Library
Message P Forum. 1994. MPI: A Message-Passing Interface Standard. Technical Report. University of Tennessee, USA.Google ScholarDigital Library
Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed Deep Graph Learning at Scale. In Proceedings of the 15th USENIX Conference on Operating Systems Design and Implementation (OSDI'21).Google Scholar
Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI'12). 17--30.Google Scholar
Huifeng Guo, Wei Guo, Yong Gao, Ruiming Tang, Xiuqiang He, and Wenzhi Liu. 2021. ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21). Association for Computing Machinery, New York, NY, USA, 1269--1278.Google ScholarDigital Library
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS'17). 1025--1035.Google ScholarDigital Library
Mohamed Assem Ibrahim, Onur Kayiran, and Shaizeen Aga. 2022. Efficient Cache Utilization via Model-aware Data Placement for Recommendation Models. In The International Symposium on Memory Systems (MEMSYS 2021). Association for Computing Machinery, New York, NY, USA, 1--11.Google Scholar
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR'17).Google Scholar
Jérôme Kunegis. 2013. KONECT: the Koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web (WWW '13 Companion). Association for Computing Machinery, New York, NY, USA, 1343--1350.Google ScholarDigital Library
Daniar H. Kurniawan, Ruipu Wang, Kahfi S. Zulkifli, Fandi A. Wiranata, John Bent, Ymir Vigfusson, and Haryadi S. Gunawi. 2023. EVStore: Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 281--294.Google Scholar
Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2020. PaGraph: Scaling GNN training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, Virtual Event USA, 401--415.Google ScholarDigital Library
Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2020. Pagraph: Scaling GNN Training on Large Graphs via Computation-aware Caching. In Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC'20). 401--415.Google ScholarDigital Library
Xupeng Miao, Yining Shi, Hailin Zhang, Xin Zhang, Xiaonan Nie, Zhi Yang, and Bin Cui. 2022. HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 470--480.Google ScholarDigital Library
Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, and Bin Cui. 2021. HET: scaling out huge embedding model training via cache-enabled distributed framework. Proceedings of the VLDB Endowment 15, 2 (Oct. 2021), 312--320.Google ScholarDigital Library
Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu. 2021. Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment 14, 11 (Oct. 2021), 2087--2100.Google ScholarDigital Library
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA '22). Association for Computing Machinery, New York, NY, USA, 993--1011.Google ScholarDigital Library
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems.Google Scholar
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv:1906.00091 [cs].Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).Google Scholar
Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). Association for Computing Machinery, New York, NY, USA, 344--358.Google ScholarDigital Library
Shihui Song and Peng Jiang. 2022. Rethinking graph data placement for graph neural network training on multiple GPUs. In Proceedings of the 36th ACM International Conference on Supercomputing (ICS '22). Association for Computing Machinery, New York, NY, USA, 1--10.Google ScholarDigital Library
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).Google Scholar
Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD'17 (ADKDD'17). Association for Computing Machinery, New York, NY, USA, 1--7.Google ScholarDigital Library
Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li. 2022. Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys '22). Association for Computing Machinery, New York, NY, USA, 534--537.Google ScholarDigital Library
Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi, and Zehuan Wang. 2022. A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys '22). Association for Computing Machinery, New York, NY, USA, 408--419.Google ScholarDigital Library
Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, Jian Gao, Kai Ren, and Jiwu Shu. 2022. Fleche: an efficient GPU embedding cache for personalized recommendations. In Proceedings of the Seventeenth European Conference on Computer Systems. ACM, Rennes France, 402--416.Google ScholarDigital Library
Dongxu Yang, Junhong Liu, Jiaxing Qi, and Junjie Lai. 2022. Whole-Graph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture. IEEE Computer Society, 767--780. ISSN: 2167-4337.Google Scholar
Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou. 2022. GNNLab: a factored system for sample-based GNN training over GPUs. In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys '22). Association for Computing Machinery, New York, NY, USA, 417--434.Google ScholarDigital Library
Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, and Xia Hu. 2022. AutoShard: Automated Embedding Table Sharding for Recommender Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22). Association for Computing Machinery, New York, NY, USA, 4461--4471.Google ScholarDigital Library
Yuanxing Zhang, Langshi Chen, Siran Yang, Man Yuan, Huimin Yi, Jie Zhang, Jiamang Wang, Jianbo Dong, Yunlong Xu, Yue Song, Yong Li, Di Zhang, Wei Lin, Lin Qu, and Bo Zheng. 2022. PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3453--3466. ISSN: 2375-026X.Google Scholar
Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In Proceedings of the 10th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3'20). 36--44.Google ScholarCross Ref

Index Terms

UGACHE: A Unified GPU Cache for Embedding-based Deep Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Characterizing and improving the use of demand-fetched caches in GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Initially introduced as special-purpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widely-used high-performance parallel computing platforms. GPUs traditionally provided only software-managed local ...
Read More
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Read More
Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles
October 2023
802 pages
ISBN:9798400702297
DOI:10.1145/3600006
Conference Chairs:
Jason Flinn
Meta
,
Margo Seltzer
University of British Columbia
,
General Chairs:
Peter Druschel
Max Planck Institute for Software Systems (MPI-SWS)
,
Antoine Kaufmann
Max Planck Institute for Software Systems (MPI-SWS)
,
Jonathan Mace
Max Planck Institute for Software Systems (MPI-SWS) and Microsoft Research
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
GPU cache
embedding
GPU interconnect
Qualifiers
- research-article
Conference

Acceptance Rates
SOSP '23 Paper Acceptance Rate43of232submissions,19%Overall Acceptance Rate131of716submissions,18%
More
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,090
  Total Downloads
- Downloads (Last 12 months)1,090
- Downloads (Last 6 weeks)182
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

UGACHE: A Unified GPU Cache for Embedding-based Deep Learning

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

ABSTRACT

References

Cited By

Index Terms

Recommendations

Characterizing and improving the use of demand-fetched caches in GPUs

Locality-Driven Dynamic GPU Cache Bypassing

Adaptive GPU cache bypassing