skip to main content
10.1145/3631295.3631401acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

On Serving Image Classification Models

Published: 11 December 2023 Publication History

Abstract

This paper aims to optimize model inference in interactive applications by reducing the infrastructure costs. It seeks to improve resource utilization, lower costs, and enhance the scalability and responsiveness of model serving systems. The focus is on achieving efficient inference in computer vision but has potential applications in other domains. The study involved experiments using a single GPU to analyze the impact of input image size and mini-batch size on request delivery time for image classification. Key findings include a model to estimate GPU warm-up time based on four parameters, the ratification of the existence of a linear relationship between mini-batch size and inference given one particular model, and the need to consider input size when selecting mini-batch size to avoid GPU crashes. Additionally, two mathematical models are proposed for further exploration using optimization algorithms. We also motivate the need to develop a more comprehensive mathematical model for soft and relaxed inference model serving.

References

[1]
Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2022. Optimizing inference serving on serverless platforms. Proceedings of the VLDB Endowment 15, 10 (2022).
[2]
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on Multi-GPU servers with Spatio-Temporal sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 199--216.
[3]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613--627.
[4]
Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712 (2022).
[5]
Aparna Gopalakrishnan, Narayan P Kulkarni, Chethan B Raghavendra, Raghavendra Manjappa, Prasad Honnavalli, and Sivaraman Eswaran. 2022. PriMed: Private federated training and encrypted inference on medical images in healthcare. Expert Systems (2022), e13283.
[6]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
[7]
Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX middleware conference. 109--120.
[8]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443--462.
[9]
Alexander Isenko. 2023. Basic Hardware Monitor. https://github.com/cirquit/pyhardware-monitor
[10]
Yuriy Kochura, Yuri Gordienko, Vlad Taran, Nikita Gordienko, Alexandr Rokovyi, Oleg Alienin, and Sergii Stirenko. 2020. Batch size influence on performance of graphic and tensor processing units during training and inference phases. In Advances in Computer Science for Engineering and Education II. Springer, 658--668.
[11]
Pouya Kousha, Bharath Ramesh, Kaushik Kandadi Suresh, Ching-Hsiang Chu, Arpan Jain, Nick Sarkauskas, Hari Subramoni, and Dhabaleswar K Panda. 2019. Designing a profiling and visualization tool for scalable and in-depth analysis of high-performance GPU clusters. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 93--102.
[12]
Zichong Li, Lan Zhang, Mu Yuan, Miaohui Song, and Qi Song. 2023. Efficient Deep Ensemble Inference via Query Difficulty-dependent Task Scheduling. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 1005--1018.
[13]
Najmeh Razfar, Julian True, Rodina Bassiouny, Vishaal Venkatesh, and Rasha Kashef. 2022. Weed detection in soybean crops using custom lightweight deep learning models. Journal of Agriculture and Food Research 8 (2022), 100308.
[14]
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397--411.
[15]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105--6114.
[16]
Stylianos I Venieris, Ioannis Panopoulos, and Iakovos S Venieris. 2021. OODIn: An optimised on-device inference framework for heterogeneous mobile devices. In 2021 IEEE International Conference on Smart Computing (SMARTCOMP). IEEE, 1--8.
[17]
Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. 2021. Morphling: fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing. 639--653.
[18]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049--1062.
[19]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 787--808.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WoSC '23: Proceedings of the 9th International Workshop on Serverless Computing
December 2023
68 pages
ISBN:9798400704550
DOI:10.1145/3631295
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IFIP: International Federation for Information Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU utilization
  2. model inference
  3. resource allocation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Middleware '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 61
    Total Downloads
  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media