skip to main content
10.1145/3576842.3582375acmconferencesArticle/Chapter ViewAbstractPublication PagesiotdiConference Proceedingsconference-collections
research-article
Public Access

Dělen: Enabling Flexible and Adaptive Model-serving for Multi-tenant Edge AI

Published: 09 May 2023 Publication History

Abstract

Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud platforms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multiple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substantially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources.
To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evaluate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adaptation policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
Alexei Baevski, H. Zhou, Abdel rahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. ArXiv abs/2006.11477 (2020).
[3]
Brendan Barry, Cormac Brick, F. Connor, David Donohoe, D. Moloney, R. Richmond, M. O’Riordan, and V. Toma. 2015. Always-on Vision Processing Unit for Mobile Applications. IEEE Micro 35 (2015), 56–66.
[4]
Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017. Adaptive Neural Networks for Efficient Inference. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 527–536.
[5]
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision.
[6]
Qingqing Cao, Noah Weber, Niranjan Balasubramanian, and Aruna Balasubramanian. 2019. DeQA: On-Device Question Answering. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services (Seoul, Republic of Korea) (MobiSys ’19). 27–40. https://doi.org/10.1145/3307334.3326071
[7]
S. Cass. 2019. Taking AI to the edge: Google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56 (2019), 16–17.
[8]
D. Crankshaw, G. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and A. Tumanov. 2020. Inferline: Latency-aware Provisioning and Scaling for Prediction Serving Pipelines. In SoCC.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[10]
Open Neural Network Exchange. 2021. ONNX model zoo. https://github.com/onnx/models
[11]
Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (New Delhi, India) (MobiCom ’18). 115–127. https://doi.org/10.1145/3241539.3241559
[12]
Eric Flamand, Davide Rossi, Francesco Conti, Igor Loi, Antonio Pullini, Florent Rotenberg, and Luca Benini. 2018. GAP-8: A RISC-V SoC for AI at the Edge of the IoT. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 1–4. https://doi.org/10.1109/ASAP.2018.8445101
[13]
Jason Flinn, Soyoung Park, and Mahadev Satyanarayanan. 2002. Balancing performance, energy, and quality in pervasive computing. Proceedings 22nd International Conference on Distributed Computing Systems (2002), 217–226.
[14]
Jason Flinn and M. Satyanarayanan. 1999. Energy-Aware Adaptation for Mobile Applications. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (Charleston, South Carolina, USA) (SOSP ’99). Association for Computing Machinery, New York, NY, USA, 48–63. https://doi.org/10.1145/319151.319155
[15]
Peizhen Guo, Bo Hu, and Wenjun Hu. 2021. Mistify: Automating DNN Model Porting for On-Device Inference at the Edge. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 705–719. https://www.usenix.org/conference/nsdi21/presentation/guo
[16]
M. Halpern, B. Boroujerdian, T. Mummert, E. Duesterwald, and V. Reddi. 2019. One Size Does Not Fit All: Quantifying and Exposing the Accuracy-latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers. In ISPASS.
[17]
Walid A. Hanafy, Tergel Molom-Ochir, and Rohan Shenoy. 2021. Design Considerations for Energy-efficient Inference on Edge Devices. In The Twelfth ACM International Conference on Future Energy Systems (e-Energy ’21) (Virtual Event, Italy). 7 pages. https://doi.org/10.1145/3447555.3465326
[18]
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.
[19]
Yitao Hu, Weiwu Pang, Xiaochen Liu, Rajrup Ghosh, Bongjun Ko, Wei-Han Lee, and Ramesh Govindan. 2021. Rim: Offloading Inference to the Edge. In Proceedings of the International Conference on Internet-of-Things Design and Implementation (Charlottesvle, VA, USA) (IoTDI ’21). Association for Computing Machinery, New York, NY, USA, 80–92. https://doi.org/10.1145/3450268.3453521
[20]
Gao Huang, Danlu Chen, T. Li, Felix Wu, L. V. D. Maaten, and Kilian Q. Weinberger. 2017. Multi-Scale Dense Convolutional Networks for Efficient Prediction. ArXiv abs/1703.09844 (2017).
[21]
Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-Based Deep Learning Framework for Continuous Vision Applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Niagara Falls, New York, USA) (MobiSys ’17). Association for Computing Machinery, New York, NY, USA, 82–95. https://doi.org/10.1145/3081333.3081360
[22]
Nitthilan Kanappan Jayakodi, Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. 2020. Design and Optimization of Energy-Accuracy Tradeoff Networks for Mobile Platforms via Pretrained Deep Models. ACM Trans. Embed. Comput. Syst. 19, 1, Article 4 (Feb. 2020), 24 pages. https://doi.org/10.1145/3366636
[23]
Nitthilan Kannappan Jayakodi, Anwesha Chatterjee, Wonje Choi, Janardhan Rao Doppa, and Partha Pratim Pande. 2018. Trading-Off Accuracy and Energy of Deep Inference on Embedded Systems: A Co-Design Approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2881–2893. https://doi.org/10.1109/TCAD.2018.2857338
[24]
F. P. Kelly, A. K. Maulloo, and D. K. H. Tan. 1998. Rate Control for Communication Networks: Shadow Prices, Proportional Fairness and Stability. The Journal of the Operational Research Society 49, 3 (1998), 237–252. http://www.jstor.org/stable/3010473
[25]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
[26]
Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. 2020. HAPI: Hardware-Aware Progressive Inference. 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD) (2020), 1–9.
[27]
Y. Lee, A. Scolari, B. Chun, M. Santambrogio, M. Weimer, and M. Interlandi. 2018. PETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In OSDI.
[28]
En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. 2020. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing. IEEE Transactions on Wireless Communications 19, 1 (2020), 447–457. https://doi.org/10.1109/TWC.2019.2946140
[29]
Christopher A. Mattson and Achille Messac. 2005. Pareto Frontier Based Concept Selection Under Uncertainty, with Visualization. Optimization and Engineering 6, 1 (2005), 85–115. https://doi.org/10.1023/B:OPTE.0000048538.35456.45
[30]
David Mellis, Massimo Banzi, David Cuartielles, and Tom Igoe. 2007. Arduino: An open electronic prototyping platform. In Proc. Chi, Vol. 2007. 1–11.
[31]
Niluthpol Chowdhury Mithun, Sirajum Munir, Karen Guo, and Charles Shelton. 2018. ODDS: Real-Time Object Detection Using Depth Sensors on Embedded GPUs. In 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). 230–241. https://doi.org/10.1109/IPSN.2018.00051
[32]
Alessandro Montanari, Manuja Sharma, Dainius Jenkus, Mohammed Alloulah, Lorena Qendro, and Fahim Kawsar. 2020. EPerceptive: Energy Reactive Embedded Intelligence for Batteryless Sensors. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems (Virtual Event, Japan) (SenSys ’20). 382–394. https://doi.org/10.1145/3384419.3430782
[33]
Dushyanth Narayanan and Mahadev Satyanarayanan. 2003. Predictive Resource Management for Wearable Computing. In MobiSys ’03.
[34]
Brian D. Noble, Mahadev Satyanarayanan, Dushyanth Narayanan, J. Eric Tilton, Jason Flinn, and Kevin R. Walker. 1997. Agile application-aware adaptation for mobility. Proceedings of the sixteenth ACM symposium on Operating systems principles (1997).
[35]
Nvidia. 2020. NVIDIA Jetson Modules. Retrieved October 19, 2020 from https://developer.nvidia.com/embedded/jetson-modules
[36]
Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2016. Conditional Deep Learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 475–480.
[37]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.
[38]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.
[39]
Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411. https://www.usenix.org/conference/atc21/presentation/romero
[40]
Mahadev Satyanarayanan and Nigel Davies. 2019. Augmenting Cognition Through Edge Computing. Computer 52, 7 (2019), 37–46. https://doi.org/10.1109/MC.2019.2911878
[41]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv abs/1905.11946 (2019).
[42]
Tianxiang Tan and Guohong Cao. 2021. Efficient Execution of Deep Neural Networks on Mobile Devices with NPU. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (Co-Located with CPS-IoT Week 2021) (Nashville, TN, USA) (IPSN ’21). 283–298. https://doi.org/10.1145/3412382.3458272
[43]
Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang. 2018. Adaptive Deep Learning Model Selection on Embedded Systems. In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (Philadelphia, PA, USA) (LCTES 2018). 31–43.
[44]
Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). 2464–2469. https://doi.org/10.1109/ICPR.2016.7900006
[45]
Camill Trueeb, Carmelo Sferrazza, and Raffaello D’Andrea. 2020. Towards vision-based robotic skins: a data-driven, multi-camera tactile sensor. In 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft). 333–338. https://doi.org/10.1109/RoboSoft48309.2020.9116060
[46]
J. Turner. 1986. New directions in communications (or which way to the information age?). IEEE Communications Magazine 24, 10 (1986), 8–15.
[47]
Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. 2020. ALERT: Accurate Learning for Energy and Timeliness. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 353–369.
[48]
Junjue Wang, Ziqiang Feng, Shilpa George, Roger Iyengar, Padmanabhan Pillai, and Mahadev Satyanarayanan. 2019. Towards Scalable Edge-Native Applications. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing (Arlington, Virginia) (SEC ’19). Association for Computing Machinery, New York, NY, USA, 152–165. https://doi.org/10.1145/3318216.3363308
[49]
Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv abs/1804.03209 (2018).
[50]
Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. 2020. EMO: Real-Time Emotion Recognition from Single-Eye Images for Resource-Constrained Eyewear Devices. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services (Toronto, Ontario, Canada) (MobiSys ’20). 448–461. https://doi.org/10.1145/3386901.3388917
[51]
Xiaorui Wu, Hong Xu, and Yi Wang. 2020. Irina: Accelerating DNN Inference with Efficient Online Scheduling(APNet ’20). Association for Computing Machinery, New York, NY, USA, 36–43. https://doi.org/10.1145/3411029.3411035
[52]
Mengwei Xu, Xiwen Zhang, Yunxin Liu, Gang Huang, Xuanzhe Liu, and Felix Xiaozhu Lin. 2020. Approximate Query Service on Autonomous IoT Cameras. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services (Toronto, Ontario, Canada) (MobiSys ’20). 191–205. https://doi.org/10.1145/3386901.3388948
[53]
Hyunho Yeo, Chan Ju Chong, Youngmok Jung, Juncheol Ye, and Dongsu Han. 2020. NEMO: Enabling Neural-Enhanced Video Streaming on Commodity Mobile Devices. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Article 28, 14 pages. https://doi.org/10.1145/3372224.3419185
[54]
Juheon Yi, Sunghyun Choi, and Youngki Lee. 2020. EagleEye: Wearable Camera-Based Person Identification in Crowded Urban Spaces. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Article 4, 14 pages. https://doi.org/10.1145/3372224.3380881
[55]
C. Zhang, M. Yu, W. Wang, and F. Yan. 2019. Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference. In USENIX ATC.
[56]
J. Zhang, S. Elnikety, S. Zarar, A. Gupta, and S. Garg. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In HotCloud.

Cited By

View all
  • (2024)BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge DevicesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340970121:4(4131-4145)Online publication date: Aug-2024
  • (2024)Collaborative Inference in Resource-Constrained Edge Networks: Challenges and OpportunitiesMILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM61039.2024.10773876(1-6)Online publication date: 28-Oct-2024
  • (2024)Enhancing Resilience in Distributed ML Inference Pipelines for Edge ComputingMILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM61039.2024.10773652(1-6)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IoTDI '23: Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation
May 2023
514 pages
ISBN:9798400700378
DOI:10.1145/3576842
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

IoTDI '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)231
  • Downloads (Last 6 weeks)28
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge DevicesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340970121:4(4131-4145)Online publication date: Aug-2024
  • (2024)Collaborative Inference in Resource-Constrained Edge Networks: Challenges and OpportunitiesMILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM61039.2024.10773876(1-6)Online publication date: 28-Oct-2024
  • (2024)Enhancing Resilience in Distributed ML Inference Pipelines for Edge ComputingMILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM61039.2024.10773652(1-6)Online publication date: 28-Oct-2024
  • (2023)Energy Time Fairness: Balancing Fair Allocation of Energy and Time for GPU WorkloadsProceedings of the Eighth ACM/IEEE Symposium on Edge Computing10.1145/3583740.3628435(53-66)Online publication date: 6-Dec-2023
  • (2023)Failure-Resilient ML Inference at the Edge through Graceful Service DegradationMILCOM 2023 - 2023 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM58377.2023.10356302(144-149)Online publication date: 30-Oct-2023
  • (2023)Octopus: SLO-Aware Progressive Inference Serving via Deep Reinforcement Learning in Multi-tenant Edge ClusterService-Oriented Computing10.1007/978-3-031-48424-7_18(242-258)Online publication date: 28-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media