research-article

SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

Authors:
Hanyu Zhao

Peking University, Beijing, China

Peking University, Beijing, China

https://orcid.org/0000-0002-2536-0016
View Profile

,
Zhenhua Han

Microsoft Research, Shanghai, China

Microsoft Research, Shanghai, China

https://orcid.org/0000-0002-2880-7100
View Profile

,
Zhi Yang

Peking University, Beijing, China

Peking University, Beijing, China

https://orcid.org/0000-0002-8219-4499
View Profile

,
Quanlu Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China

https://orcid.org/0000-0003-0557-1104
View Profile

,
Mingxia Li

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

https://orcid.org/0000-0002-0792-8966
View Profile

,
Fan Yang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China

https://orcid.org/0000-0002-0378-060X
View Profile

,
Qianxi Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China

https://orcid.org/0000-0002-0646-5365
View Profile

,
Binyang Li

Microsoft, Beijing, China

Microsoft, Beijing, China

https://orcid.org/0000-0002-9295-6530
View Profile

,
Yuqing Yang

Microsoft Research, Shanghai, China

Microsoft Research, Shanghai, China

https://orcid.org/0000-0003-3518-5212
View Profile

,
Lili Qiu

Microsoft Research, Shanghai, China

Microsoft Research, Shanghai, China

https://orcid.org/0000-0002-1590-9749
View Profile

,
Lintao Zhang

BaseBit Technologies, Beijing, China

BaseBit Technologies, Beijing, China

https://orcid.org/0000-0003-2727-2703
View Profile

,
Lidong Zhou

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China

https://orcid.org/0000-0002-7258-3116
View Profile

EuroSys '23: Proceedings of the Eighteenth European Conference on Computer SystemsMay 2023Pages 883–898https://doi.org/10.1145/3552326.3567499

Published:08 May 2023Publication History

EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems

Pages 883–898

ABSTRACT

Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly.

To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.

References

Creating a dataset and a challenge for deepfakes. https://ai.facebook.com/blog/deepfake-detection-challenge/, 2020.Google Scholar
Open images dataset. https://opensource.google/projects/open-images-dataset, 2020.Google Scholar
Amazon s3. https://aws.amazon.com/s3/, 2021.Google Scholar
Amazon sagemaker. https://aws.amazon.com/sagemaker/, 2021.Google Scholar
Aws gpu instances. https://go.aws/3DkWUgY, 2021.Google Scholar
Azure blob storage. https://azure.microsoft.com/en-us/services/storage/blobs/, 2021.Google Scholar
Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/, 2021.Google Scholar
Azure machine learning. https://ml.azure.com, 2021.Google Scholar
Gpu optimized virtual machine sizes. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu, 2021.Google Scholar
Azure storage scalability and performance targets for standard storage accounts. https://docs.microsoft.com/en-us/azure/storage/common/scalability-targets-standard-account, 2022.Google Scholar
Nvidia gpu generations. https://www.nvidia.com/en-gb/data-center/products/, 2022.Google Scholar
Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Edward A Fox, and Stephen Williams. Removal policies in network caches for worldwide web documents. In Conference proceedings on Applications, technologies, architectures, and protocols for computer communications, pages 293--305, 1996.Google ScholarDigital Library
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.Google Scholar
Charu Aggarwal, Joel L Wolf, and Philip S. Yu. Caching on the world wide web. IEEE Transactions on Knowledge and data Engineering, 11(1):94--107, 1999.Google ScholarDigital Library
Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learning, 2020.Google Scholar
Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Warfield, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for parallel jobs. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 267--280, 2012.Google Scholar
Martin Arlitt, Ludmila Cherkasova, John Dilley, Rich Friedrich, and Tai Jin. Evaluating content management techniques for web proxy caches. ACM SIGMETRICS Performance Evaluation Review, 27(4):3--11, 2000.Google ScholarDigital Library
Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 389--403, 2018.Google Scholar
Daniel S Berger. Towards lightweight and robust machine learning for cdn caching. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, pages 134--140, 2018.Google ScholarDigital Library
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, 2014.Google ScholarDigital Library
Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, omega, and kubernetes. Commun. ACM, 59(5):50--57, 2016.Google ScholarDigital Library
Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.Google ScholarDigital Library
Ludmila Cherkasova and Gianfranco Ciardo. Role of aging, frequency, and size in web cache replacement policies. In International Conference on High-Performance Computing and Networking, pages 114--123. Springer, 2001.Google ScholarCross Ref
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
Bin Fan, Hyeontaek Lim, David G Andersen, and Michael Kaminsky. Small cache, big effect: Provable load balancing for randomly partitioned cluster services. In Proceedings of the 2nd ACM Symposium on Cloud Computing, pages 1--12, 2011.Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Choosy: Maxmin fair sharing for datacenter jobs with constraints. pages 365--378, 04 2013.Google Scholar
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99--115, 2016.Google ScholarDigital Library
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarDigital Library
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarDigital Library
Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 689--706, 2022.Google Scholar
Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing petastorm: Uber atg's data access library for deep learning. https://eng.uber.com/petastorm/, 2018.Google Scholar
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485--500, 2019.Google Scholar
Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535--2544. PMLR, 2019.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang. {LAMA}: Optimized locality-aware memory allocation for key-value cache. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pages 57--69, 2015.Google Scholar
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 261--276. ACM, 2009.Google ScholarDigital Library
Virajith Jalaparti, Chris Douglas, Mainak Ghosh, Ashvin Agrawal, Avrilia Floratou, Srikanth Kandula, Ishai Menache, Joseph Seffi Naor, and Sriram Rao. Netco: Cache and i/o management for analytics over disaggregated stores. In Proceedings of the ACM Symposium on Cloud Computing, pages 186--198, 2018.Google ScholarDigital Library
Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Pérez, and Cordelia Schmid. Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 34(9):1704--1716, 2011.Google Scholar
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. MSR-TR-2018-13, May 2018.Google Scholar
Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman. The case for unifying data loading in machine learning clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), Renton, WA, July 2019. USENIX Association.Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097--1105, 2012.Google Scholar
Abhishek Vijaya Kumar and Muthian Sivathanu. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 2020). USENIX, February 2020.Google Scholar
Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin, and Byung-Gon Chun. Refurbish your training data: Reusing partially augmented samples for faster deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 537--550, 2021.Google Scholar
Haoyuan Li. Alluxio: A virtual distributed file system. PhD thesis, UC Berkeley, 2018.Google Scholar
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181--196, 2018.Google ScholarDigital Library
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient gpu cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289--304, 2020.Google Scholar
Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. Looking beyond GPUs for DNN scheduling on Multi-Tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 579--596, Carlsbad, CA, July 2022. USENIX Association.Google Scholar
Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. Analyzing and mitigating data stalls in dnn training. Proc. VLDB Endow., 14(5):771--784, jan 2021.Google ScholarDigital Library
Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. Tf.data: A machine learning data processing framework. Proc. VLDB Endow., 14(12):2945--2958, jul 2021.Google ScholarDigital Library
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481--498. USENIX Association, November 2020.Google Scholar
NetApp. Deep learning and ai in the cloud with nfs storage. https://cloud.netapp.com/blog/ai-and-deep-learning-in-the-cloud, 2019.Google Scholar
Ed Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. Flat datacenter storage. In The 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). USENIX, October 2012.Google Scholar
Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297--306, 1993.Google ScholarDigital Library
Seon-yeong Park, Dawoon Jung, Jeong-uk Kang, Jin-soo Kim, and Joonwon Lee. Cflru: a replacement algorithm for flash memory. In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 234--241, 2006.Google Scholar
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1--14, 2018.Google ScholarDigital Library
Christian Pinto, Yiannis Gkoufas, Andrea Reale, Seetharami Seelam, and Steven Eliuk. Hoard: A distributed data caching system to accelerate deep learning training on the cloud, 2018.Google Scholar
Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, et al. Singularity: Planet-scale, preemptible, elastic scheduling of ai workloads. arXiv preprint arXiv:2202.07848, 2022.Google Scholar
Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), pages 529--544, 2020.Google Scholar
Karnam Sreenu and M. Sreelatha. W-scheduler: Whale optimization for task scheduling in cloud computing. Cluster Computing, 22(1):1087--1098, jan 2019.Google ScholarDigital Library
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.Google ScholarCross Ref
Haisheng Tan, Shaofeng H-C Jiang, Zhenhua Han, Liuyan Liu, Kai Han, and Qinglin Zhao. Camul: Online caching on multiple caches with relaying and bypassing. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 244--252. IEEE, 2019.Google ScholarDigital Library
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105--6114. PMLR, 2019.Google Scholar
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.Google ScholarDigital Library
Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J Franklin, and Ion Stoica. The power of choice in data-aware cluster scheduling. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 301--316, 2014.Google ScholarDigital Library
Lipeng Wang, Songgao Ye, Baichen Yang, Youyou Lu, Hequan Zhang, Shengen Yan, and Qiong Luo. Diesel: A dataset-based distributed storage and caching system for large-scale deep learning training. In 49th International Conference on Parallel Processing-ICPP, pages 1--11, 2020.Google ScholarDigital Library
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595--610, 2018.Google Scholar
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. Antman: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533--548. USENIX Association, November 2020.Google Scholar
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pages 265--278. ACM, 2010.Google ScholarDigital Library
Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. Hived: Sharing a gpu cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 515--532, 2020.Google Scholar
Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 1042--1057, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
Chen Zhong, M Cenk Gursoy, and Senem Velipasalar. A deep reinforcement learning-based framework for content caching. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE, 2018.Google ScholarCross Ref

Index Terms

SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Lyra: Elastic Scheduling for Deep Learning Clusters
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems

Organizations often build separate training and inference clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both: inference clusters have low utilization when the traffic load is low; training jobs often ...
Read More
Deep Reinforcement Learning for Multi-resource Cloud Job Scheduling
Neural Information Processing
Abstract
The resource scheduling problem in the cloud environment has always been a difficult and hot research field of cloud computing. The difficult problem of online decision-making tasks for resource management in a complex cloud environment can be ...
Read More
Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems
May 2023
910 pages
ISBN:9781450394871
DOI:10.1145/3552326
General Co-chairs:
Giuseppe Antonio Di Luna
University of Rome La Sapienza
,
Leonardo Querzoni
University of Rome La Sapienza
,
Program Co-chairs:
Alexandra Fedorova
University of British Columbia
,
Dushyanth Narayanan
Microsoft Research
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
machine learning systems
cloud computing
cache systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,042
  Total Downloads
- Downloads (Last 12 months)1,042
- Downloads (Last 6 weeks)112
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Lyra: Elastic Scheduling for Deep Learning Clusters

Deep Reinforcement Learning for Multi-resource Cloud Job Scheduling

Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters