skip to main content
10.1145/3552326.3567499acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

Authors Info & Claims
Published:08 May 2023Publication History

ABSTRACT

Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly.

To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.

References

  1. Creating a dataset and a challenge for deepfakes. https://ai.facebook.com/blog/deepfake-detection-challenge/, 2020.Google ScholarGoogle Scholar
  2. Open images dataset. https://opensource.google/projects/open-images-dataset, 2020.Google ScholarGoogle Scholar
  3. Amazon s3. https://aws.amazon.com/s3/, 2021.Google ScholarGoogle Scholar
  4. Amazon sagemaker. https://aws.amazon.com/sagemaker/, 2021.Google ScholarGoogle Scholar
  5. Aws gpu instances. https://go.aws/3DkWUgY, 2021.Google ScholarGoogle Scholar
  6. Azure blob storage. https://azure.microsoft.com/en-us/services/storage/blobs/, 2021.Google ScholarGoogle Scholar
  7. Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/, 2021.Google ScholarGoogle Scholar
  8. Azure machine learning. https://ml.azure.com, 2021.Google ScholarGoogle Scholar
  9. Gpu optimized virtual machine sizes. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu, 2021.Google ScholarGoogle Scholar
  10. Azure storage scalability and performance targets for standard storage accounts. https://docs.microsoft.com/en-us/azure/storage/common/scalability-targets-standard-account, 2022.Google ScholarGoogle Scholar
  11. Nvidia gpu generations. https://www.nvidia.com/en-gb/data-center/products/, 2022.Google ScholarGoogle Scholar
  12. Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Edward A Fox, and Stephen Williams. Removal policies in network caches for worldwide web documents. In Conference proceedings on Applications, technologies, architectures, and protocols for computer communications, pages 293--305, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.Google ScholarGoogle Scholar
  14. Charu Aggarwal, Joel L Wolf, and Philip S. Yu. Caching on the world wide web. IEEE Transactions on Knowledge and data Engineering, 11(1):94--107, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learning, 2020.Google ScholarGoogle Scholar
  16. Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Warfield, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for parallel jobs. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 267--280, 2012.Google ScholarGoogle Scholar
  17. Martin Arlitt, Ludmila Cherkasova, John Dilley, Rich Friedrich, and Tai Jin. Evaluating content management techniques for web proxy caches. ACM SIGMETRICS Performance Evaluation Review, 27(4):3--11, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 389--403, 2018.Google ScholarGoogle Scholar
  19. Daniel S Berger. Towards lightweight and robust machine learning for cdn caching. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, pages 134--140, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, omega, and kubernetes. Commun. ACM, 59(5):50--57, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ludmila Cherkasova and Gianfranco Ciardo. Role of aging, frequency, and size in web cache replacement policies. In International Conference on High-Performance Computing and Networking, pages 114--123. Springer, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  26. Bin Fan, Hyeontaek Lim, David G Andersen, and Michael Kaminsky. Small cache, big effect: Provable load balancing for randomly partitioned cluster services. In Proceedings of the 2nd ACM Symposium on Cloud Computing, pages 1--12, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Choosy: Maxmin fair sharing for datacenter jobs with constraints. pages 365--378, 04 2013.Google ScholarGoogle Scholar
  29. Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99--115, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 689--706, 2022.Google ScholarGoogle Scholar
  33. Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing petastorm: Uber atg's data access library for deep learning. https://eng.uber.com/petastorm/, 2018.Google ScholarGoogle Scholar
  34. Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485--500, 2019.Google ScholarGoogle Scholar
  35. Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535--2544. PMLR, 2019.Google ScholarGoogle Scholar
  36. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang. {LAMA}: Optimized locality-aware memory allocation for key-value cache. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pages 57--69, 2015.Google ScholarGoogle Scholar
  38. Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 261--276. ACM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Virajith Jalaparti, Chris Douglas, Mainak Ghosh, Ashvin Agrawal, Avrilia Floratou, Srikanth Kandula, Ishai Menache, Joseph Seffi Naor, and Sriram Rao. Netco: Cache and i/o management for analytics over disaggregated stores. In Proceedings of the ACM Symposium on Cloud Computing, pages 186--198, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Pérez, and Cordelia Schmid. Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 34(9):1704--1716, 2011.Google ScholarGoogle Scholar
  41. Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. MSR-TR-2018-13, May 2018.Google ScholarGoogle Scholar
  42. Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman. The case for unifying data loading in machine learning clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), Renton, WA, July 2019. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097--1105, 2012.Google ScholarGoogle Scholar
  44. Abhishek Vijaya Kumar and Muthian Sivathanu. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 2020). USENIX, February 2020.Google ScholarGoogle Scholar
  45. Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin, and Byung-Gon Chun. Refurbish your training data: Reusing partially augmented samples for faster deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 537--550, 2021.Google ScholarGoogle Scholar
  46. Haoyuan Li. Alluxio: A virtual distributed file system. PhD thesis, UC Berkeley, 2018.Google ScholarGoogle Scholar
  47. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181--196, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient gpu cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289--304, 2020.Google ScholarGoogle Scholar
  49. Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. Looking beyond GPUs for DNN scheduling on Multi-Tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 579--596, Carlsbad, CA, July 2022. USENIX Association.Google ScholarGoogle Scholar
  50. Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. Analyzing and mitigating data stalls in dnn training. Proc. VLDB Endow., 14(5):771--784, jan 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. Tf.data: A machine learning data processing framework. Proc. VLDB Endow., 14(12):2945--2958, jul 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481--498. USENIX Association, November 2020.Google ScholarGoogle Scholar
  53. NetApp. Deep learning and ai in the cloud with nfs storage. https://cloud.netapp.com/blog/ai-and-deep-learning-in-the-cloud, 2019.Google ScholarGoogle Scholar
  54. Ed Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. Flat datacenter storage. In The 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). USENIX, October 2012.Google ScholarGoogle Scholar
  55. Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297--306, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Seon-yeong Park, Dawoon Jung, Jeong-uk Kang, Jin-soo Kim, and Joonwon Lee. Cflru: a replacement algorithm for flash memory. In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 234--241, 2006.Google ScholarGoogle Scholar
  57. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1--14, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Christian Pinto, Yiannis Gkoufas, Andrea Reale, Seetharami Seelam, and Steven Eliuk. Hoard: A distributed data caching system to accelerate deep learning training on the cloud, 2018.Google ScholarGoogle Scholar
  59. Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, et al. Singularity: Planet-scale, preemptible, elastic scheduling of ai workloads. arXiv preprint arXiv:2202.07848, 2022.Google ScholarGoogle Scholar
  60. Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), pages 529--544, 2020.Google ScholarGoogle Scholar
  61. Karnam Sreenu and M. Sreelatha. W-scheduler: Whale optimization for task scheduling in cloud computing. Cluster Computing, 22(1):1087--1098, jan 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  63. Haisheng Tan, Shaofeng H-C Jiang, Zhenhua Han, Liuyan Liu, Kai Han, and Qinglin Zhao. Camul: Online caching on multiple caches with relaying and bypassing. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 244--252. IEEE, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105--6114. PMLR, 2019.Google ScholarGoogle Scholar
  65. Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J Franklin, and Ion Stoica. The power of choice in data-aware cluster scheduling. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 301--316, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Lipeng Wang, Songgao Ye, Baichen Yang, Youyou Lu, Hequan Zhang, Shengen Yan, and Qiong Luo. Diesel: A dataset-based distributed storage and caching system for large-scale deep learning training. In 49th International Conference on Parallel Processing-ICPP, pages 1--11, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595--610, 2018.Google ScholarGoogle Scholar
  69. Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. Antman: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533--548. USENIX Association, November 2020.Google ScholarGoogle Scholar
  70. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pages 265--278. ACM, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. Hived: Sharing a gpu cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 515--532, 2020.Google ScholarGoogle Scholar
  72. Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 1042--1057, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Chen Zhong, M Cenk Gursoy, and Senem Velipasalar. A deep reinforcement learning-based framework for content caching. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems
        May 2023
        910 pages
        ISBN:9781450394871
        DOI:10.1145/3552326

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 May 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader