ABSTRACT
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly.
To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
- Creating a dataset and a challenge for deepfakes. https://ai.facebook.com/blog/deepfake-detection-challenge/, 2020.Google Scholar
- Open images dataset. https://opensource.google/projects/open-images-dataset, 2020.Google Scholar
- Amazon s3. https://aws.amazon.com/s3/, 2021.Google Scholar
- Amazon sagemaker. https://aws.amazon.com/sagemaker/, 2021.Google Scholar
- Aws gpu instances. https://go.aws/3DkWUgY, 2021.Google Scholar
- Azure blob storage. https://azure.microsoft.com/en-us/services/storage/blobs/, 2021.Google Scholar
- Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/, 2021.Google Scholar
- Azure machine learning. https://ml.azure.com, 2021.Google Scholar
- Gpu optimized virtual machine sizes. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu, 2021.Google Scholar
- Azure storage scalability and performance targets for standard storage accounts. https://docs.microsoft.com/en-us/azure/storage/common/scalability-targets-standard-account, 2022.Google Scholar
- Nvidia gpu generations. https://www.nvidia.com/en-gb/data-center/products/, 2022.Google Scholar
- Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Edward A Fox, and Stephen Williams. Removal policies in network caches for worldwide web documents. In Conference proceedings on Applications, technologies, architectures, and protocols for computer communications, pages 293--305, 1996.Google ScholarDigital Library
- Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.Google Scholar
- Charu Aggarwal, Joel L Wolf, and Philip S. Yu. Caching on the world wide web. IEEE Transactions on Knowledge and data Engineering, 11(1):94--107, 1999.Google ScholarDigital Library
- Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learning, 2020.Google Scholar
- Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Warfield, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for parallel jobs. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 267--280, 2012.Google Scholar
- Martin Arlitt, Ludmila Cherkasova, John Dilley, Rich Friedrich, and Tai Jin. Evaluating content management techniques for web proxy caches. ACM SIGMETRICS Performance Evaluation Review, 27(4):3--11, 2000.Google ScholarDigital Library
- Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 389--403, 2018.Google Scholar
- Daniel S Berger. Towards lightweight and robust machine learning for cdn caching. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, pages 134--140, 2018.Google ScholarDigital Library
- Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, 2014.Google ScholarDigital Library
- Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, omega, and kubernetes. Commun. ACM, 59(5):50--57, 2016.Google ScholarDigital Library
- Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.Google ScholarDigital Library
- Ludmila Cherkasova and Gianfranco Ciardo. Role of aging, frequency, and size in web cache replacement policies. In International Conference on High-Performance Computing and Networking, pages 114--123. Springer, 2001.Google ScholarCross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
- Bin Fan, Hyeontaek Lim, David G Andersen, and Michael Kaminsky. Small cache, big effect: Provable load balancing for randomly partitioned cluster services. In Proceedings of the 2nd ACM Symposium on Cloud Computing, pages 1--12, 2011.Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Choosy: Maxmin fair sharing for datacenter jobs with constraints. pages 365--378, 04 2013.Google Scholar
- Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99--115, 2016.Google ScholarDigital Library
- Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarDigital Library
- Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarDigital Library
- Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 689--706, 2022.Google Scholar
- Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing petastorm: Uber atg's data access library for deep learning. https://eng.uber.com/petastorm/, 2018.Google Scholar
- Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485--500, 2019.Google Scholar
- Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535--2544. PMLR, 2019.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
- Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang. {LAMA}: Optimized locality-aware memory allocation for key-value cache. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pages 57--69, 2015.Google Scholar
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 261--276. ACM, 2009.Google ScholarDigital Library
- Virajith Jalaparti, Chris Douglas, Mainak Ghosh, Ashvin Agrawal, Avrilia Floratou, Srikanth Kandula, Ishai Menache, Joseph Seffi Naor, and Sriram Rao. Netco: Cache and i/o management for analytics over disaggregated stores. In Proceedings of the ACM Symposium on Cloud Computing, pages 186--198, 2018.Google ScholarDigital Library
- Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Pérez, and Cordelia Schmid. Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 34(9):1704--1716, 2011.Google Scholar
- Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. MSR-TR-2018-13, May 2018.Google Scholar
- Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman. The case for unifying data loading in machine learning clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), Renton, WA, July 2019. USENIX Association.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097--1105, 2012.Google Scholar
- Abhishek Vijaya Kumar and Muthian Sivathanu. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 2020). USENIX, February 2020.Google Scholar
- Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin, and Byung-Gon Chun. Refurbish your training data: Reusing partially augmented samples for faster deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 537--550, 2021.Google Scholar
- Haoyuan Li. Alluxio: A virtual distributed file system. PhD thesis, UC Berkeley, 2018.Google Scholar
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181--196, 2018.Google ScholarDigital Library
- Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient gpu cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289--304, 2020.Google Scholar
- Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. Looking beyond GPUs for DNN scheduling on Multi-Tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 579--596, Carlsbad, CA, July 2022. USENIX Association.Google Scholar
- Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. Analyzing and mitigating data stalls in dnn training. Proc. VLDB Endow., 14(5):771--784, jan 2021.Google ScholarDigital Library
- Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. Tf.data: A machine learning data processing framework. Proc. VLDB Endow., 14(12):2945--2958, jul 2021.Google ScholarDigital Library
- Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481--498. USENIX Association, November 2020.Google Scholar
- NetApp. Deep learning and ai in the cloud with nfs storage. https://cloud.netapp.com/blog/ai-and-deep-learning-in-the-cloud, 2019.Google Scholar
- Ed Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. Flat datacenter storage. In The 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). USENIX, October 2012.Google Scholar
- Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297--306, 1993.Google ScholarDigital Library
- Seon-yeong Park, Dawoon Jung, Jeong-uk Kang, Jin-soo Kim, and Joonwon Lee. Cflru: a replacement algorithm for flash memory. In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 234--241, 2006.Google Scholar
- Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1--14, 2018.Google ScholarDigital Library
- Christian Pinto, Yiannis Gkoufas, Andrea Reale, Seetharami Seelam, and Steven Eliuk. Hoard: A distributed data caching system to accelerate deep learning training on the cloud, 2018.Google Scholar
- Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, et al. Singularity: Planet-scale, preemptible, elastic scheduling of ai workloads. arXiv preprint arXiv:2202.07848, 2022.Google Scholar
- Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), pages 529--544, 2020.Google Scholar
- Karnam Sreenu and M. Sreelatha. W-scheduler: Whale optimization for task scheduling in cloud computing. Cluster Computing, 22(1):1087--1098, jan 2019.Google ScholarDigital Library
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.Google ScholarCross Ref
- Haisheng Tan, Shaofeng H-C Jiang, Zhenhua Han, Liuyan Liu, Kai Han, and Qinglin Zhao. Camul: Online caching on multiple caches with relaying and bypassing. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 244--252. IEEE, 2019.Google ScholarDigital Library
- Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105--6114. PMLR, 2019.Google Scholar
- Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.Google ScholarDigital Library
- Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J Franklin, and Ion Stoica. The power of choice in data-aware cluster scheduling. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 301--316, 2014.Google ScholarDigital Library
- Lipeng Wang, Songgao Ye, Baichen Yang, Youyou Lu, Hequan Zhang, Shengen Yan, and Qiong Luo. Diesel: A dataset-based distributed storage and caching system for large-scale deep learning training. In 49th International Conference on Parallel Processing-ICPP, pages 1--11, 2020.Google ScholarDigital Library
- Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595--610, 2018.Google Scholar
- Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. Antman: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533--548. USENIX Association, November 2020.Google Scholar
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pages 265--278. ACM, 2010.Google ScholarDigital Library
- Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. Hived: Sharing a gpu cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 515--532, 2020.Google Scholar
- Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 1042--1057, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
- Chen Zhong, M Cenk Gursoy, and Senem Velipasalar. A deep reinforcement learning-based framework for content caching. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE, 2018.Google ScholarCross Ref
Index Terms
- SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters
Recommendations
Lyra: Elastic Scheduling for Deep Learning Clusters
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer SystemsOrganizations often build separate training and inference clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both: inference clusters have low utilization when the traffic load is low; training jobs often ...
Deep Reinforcement Learning for Multi-resource Cloud Job Scheduling
Neural Information ProcessingAbstractThe resource scheduling problem in the cloud environment has always been a difficult and hot research field of cloud computing. The difficult problem of online decision-making tasks for resource management in a complex cloud environment can be ...
Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesWe consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is ...
Comments