skip to main content
10.1145/3588195.3595940acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
poster

COLTI: Towards Concurrent and Co-located DNN Training and Inference

Published:07 August 2023Publication History

ABSTRACT

Deep learning models are extensively used in a wide range of domains, e.g., scientific simulations, predictions, and modeling. However, training these dense networks is both compute and memory intensive, and typically requires accelerators such as Graphics Processing Units (GPUs). While such DNN workloads consume a major proportion of the limited onboard high-bandwidth memory (HBM), they typically underutilize the GPU compute resources. In such scenarios, the idle compute resources on the GPU can be leveraged to run pending jobs that can either be (1) accommodated on the remainder HBM, or (2) can share memory resources with other concurrent workloads. However, state-of-the-art workload schedulers and DNN runtimes are not designed to leverage HBM co-location to improve resource utilization and throughput. In this work, we propose COLTI, which introduces a set of novel techniques to solve the aforementioned challenges by co-locating DNN training and inference on memory-constrained GPU devices. Our preliminary evaluations of three different DNN models implemented in the PyTorch framework demonstrate up to 37% and 40% improvement in makespan and memory utilization, respectively.

References

  1. Aditya Dhakal, Sameer G Kulkarni, and KK Ramakrishnan. 2020. Gslice: controlled spatial sharing of gpus for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing. 492--506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. [n.,d.]. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In In Proc. of NSDI'22. USENIX Association.Google ScholarGoogle Scholar

Index Terms

  1. COLTI: Towards Concurrent and Co-located DNN Training and Inference

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
          August 2023
          350 pages
          ISBN:9798400701559
          DOI:10.1145/3588195
          • General Chair:
          • Ali R. Butt,
          • Program Chairs:
          • Ningfang Mi,
          • Kyle Chard

          Copyright © 2023 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 August 2023

          Check for updates

          Qualifiers

          • poster

          Acceptance Rates

          Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        • Article Metrics

          • Downloads (Last 12 months)154
          • Downloads (Last 6 weeks)15

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader