ABSTRACT
Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud platforms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multiple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substantially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources.
To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evaluate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adaptation policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Alexei Baevski, H. Zhou, Abdel rahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. ArXiv abs/2006.11477 (2020).Google Scholar
- Brendan Barry, Cormac Brick, F. Connor, David Donohoe, D. Moloney, R. Richmond, M. O’Riordan, and V. Toma. 2015. Always-on Vision Processing Unit for Mobile Applications. IEEE Micro 35 (2015), 56–66.Google ScholarDigital Library
- Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017. Adaptive Neural Networks for Efficient Inference. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 527–536.Google Scholar
- Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision.Google ScholarCross Ref
- Qingqing Cao, Noah Weber, Niranjan Balasubramanian, and Aruna Balasubramanian. 2019. DeQA: On-Device Question Answering. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services (Seoul, Republic of Korea) (MobiSys ’19). 27–40. https://doi.org/10.1145/3307334.3326071Google ScholarDigital Library
- S. Cass. 2019. Taking AI to the edge: Google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56 (2019), 16–17.Google ScholarCross Ref
- D. Crankshaw, G. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and A. Tumanov. 2020. Inferline: Latency-aware Provisioning and Scaling for Prediction Serving Pipelines. In SoCC.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google Scholar
- Open Neural Network Exchange. 2021. ONNX model zoo. https://github.com/onnx/modelsGoogle Scholar
- Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (New Delhi, India) (MobiCom ’18). 115–127. https://doi.org/10.1145/3241539.3241559Google ScholarDigital Library
- Eric Flamand, Davide Rossi, Francesco Conti, Igor Loi, Antonio Pullini, Florent Rotenberg, and Luca Benini. 2018. GAP-8: A RISC-V SoC for AI at the Edge of the IoT. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 1–4. https://doi.org/10.1109/ASAP.2018.8445101Google ScholarCross Ref
- Jason Flinn, Soyoung Park, and Mahadev Satyanarayanan. 2002. Balancing performance, energy, and quality in pervasive computing. Proceedings 22nd International Conference on Distributed Computing Systems (2002), 217–226.Google ScholarCross Ref
- Jason Flinn and M. Satyanarayanan. 1999. Energy-Aware Adaptation for Mobile Applications. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (Charleston, South Carolina, USA) (SOSP ’99). Association for Computing Machinery, New York, NY, USA, 48–63. https://doi.org/10.1145/319151.319155Google ScholarDigital Library
- Peizhen Guo, Bo Hu, and Wenjun Hu. 2021. Mistify: Automating DNN Model Porting for On-Device Inference at the Edge. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 705–719. https://www.usenix.org/conference/nsdi21/presentation/guoGoogle Scholar
- M. Halpern, B. Boroujerdian, T. Mummert, E. Duesterwald, and V. Reddi. 2019. One Size Does Not Fit All: Quantifying and Exposing the Accuracy-latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers. In ISPASS.Google Scholar
- Walid A. Hanafy, Tergel Molom-Ochir, and Rohan Shenoy. 2021. Design Considerations for Energy-efficient Inference on Edge Devices. In The Twelfth ACM International Conference on Future Energy Systems (e-Energy ’21) (Virtual Event, Italy). 7 pages. https://doi.org/10.1145/3447555.3465326Google ScholarDigital Library
- Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.Google Scholar
- Yitao Hu, Weiwu Pang, Xiaochen Liu, Rajrup Ghosh, Bongjun Ko, Wei-Han Lee, and Ramesh Govindan. 2021. Rim: Offloading Inference to the Edge. In Proceedings of the International Conference on Internet-of-Things Design and Implementation (Charlottesvle, VA, USA) (IoTDI ’21). Association for Computing Machinery, New York, NY, USA, 80–92. https://doi.org/10.1145/3450268.3453521Google ScholarDigital Library
- Gao Huang, Danlu Chen, T. Li, Felix Wu, L. V. D. Maaten, and Kilian Q. Weinberger. 2017. Multi-Scale Dense Convolutional Networks for Efficient Prediction. ArXiv abs/1703.09844 (2017).Google Scholar
- Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-Based Deep Learning Framework for Continuous Vision Applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Niagara Falls, New York, USA) (MobiSys ’17). Association for Computing Machinery, New York, NY, USA, 82–95. https://doi.org/10.1145/3081333.3081360Google ScholarDigital Library
- Nitthilan Kanappan Jayakodi, Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. 2020. Design and Optimization of Energy-Accuracy Tradeoff Networks for Mobile Platforms via Pretrained Deep Models. ACM Trans. Embed. Comput. Syst. 19, 1, Article 4 (Feb. 2020), 24 pages. https://doi.org/10.1145/3366636Google ScholarDigital Library
- Nitthilan Kannappan Jayakodi, Anwesha Chatterjee, Wonje Choi, Janardhan Rao Doppa, and Partha Pratim Pande. 2018. Trading-Off Accuracy and Energy of Deep Inference on Embedded Systems: A Co-Design Approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2881–2893. https://doi.org/10.1109/TCAD.2018.2857338Google ScholarCross Ref
- F. P. Kelly, A. K. Maulloo, and D. K. H. Tan. 1998. Rate Control for Communication Networks: Shadow Prices, Proportional Fairness and Stability. The Journal of the Operational Research Society 49, 3 (1998), 237–252. http://www.jstor.org/stable/3010473Google ScholarCross Ref
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).Google Scholar
- Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. 2020. HAPI: Hardware-Aware Progressive Inference. 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD) (2020), 1–9.Google ScholarDigital Library
- Y. Lee, A. Scolari, B. Chun, M. Santambrogio, M. Weimer, and M. Interlandi. 2018. PETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In OSDI.Google Scholar
- En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. 2020. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing. IEEE Transactions on Wireless Communications 19, 1 (2020), 447–457. https://doi.org/10.1109/TWC.2019.2946140Google ScholarCross Ref
- Christopher A. Mattson and Achille Messac. 2005. Pareto Frontier Based Concept Selection Under Uncertainty, with Visualization. Optimization and Engineering 6, 1 (2005), 85–115. https://doi.org/10.1023/B:OPTE.0000048538.35456.45Google ScholarCross Ref
- David Mellis, Massimo Banzi, David Cuartielles, and Tom Igoe. 2007. Arduino: An open electronic prototyping platform. In Proc. Chi, Vol. 2007. 1–11.Google Scholar
- Niluthpol Chowdhury Mithun, Sirajum Munir, Karen Guo, and Charles Shelton. 2018. ODDS: Real-Time Object Detection Using Depth Sensors on Embedded GPUs. In 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). 230–241. https://doi.org/10.1109/IPSN.2018.00051Google ScholarDigital Library
- Alessandro Montanari, Manuja Sharma, Dainius Jenkus, Mohammed Alloulah, Lorena Qendro, and Fahim Kawsar. 2020. EPerceptive: Energy Reactive Embedded Intelligence for Batteryless Sensors. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems (Virtual Event, Japan) (SenSys ’20). 382–394. https://doi.org/10.1145/3384419.3430782Google ScholarDigital Library
- Dushyanth Narayanan and Mahadev Satyanarayanan. 2003. Predictive Resource Management for Wearable Computing. In MobiSys ’03.Google Scholar
- Brian D. Noble, Mahadev Satyanarayanan, Dushyanth Narayanan, J. Eric Tilton, Jason Flinn, and Kevin R. Walker. 1997. Agile application-aware adaptation for mobility. Proceedings of the sixteenth ACM symposium on Operating systems principles (1997).Google ScholarDigital Library
- Nvidia. 2020. NVIDIA Jetson Modules. Retrieved October 19, 2020 from https://developer.nvidia.com/embedded/jetson-modulesGoogle Scholar
- Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2016. Conditional Deep Learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 475–480.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.Google ScholarDigital Library
- Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.Google ScholarCross Ref
- Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411. https://www.usenix.org/conference/atc21/presentation/romeroGoogle Scholar
- Mahadev Satyanarayanan and Nigel Davies. 2019. Augmenting Cognition Through Edge Computing. Computer 52, 7 (2019), 37–46. https://doi.org/10.1109/MC.2019.2911878Google ScholarCross Ref
- Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv abs/1905.11946 (2019).Google Scholar
- Tianxiang Tan and Guohong Cao. 2021. Efficient Execution of Deep Neural Networks on Mobile Devices with NPU. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (Co-Located with CPS-IoT Week 2021) (Nashville, TN, USA) (IPSN ’21). 283–298. https://doi.org/10.1145/3412382.3458272Google ScholarDigital Library
- Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang. 2018. Adaptive Deep Learning Model Selection on Embedded Systems. In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (Philadelphia, PA, USA) (LCTES 2018). 31–43.Google ScholarDigital Library
- Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). 2464–2469. https://doi.org/10.1109/ICPR.2016.7900006Google ScholarCross Ref
- Camill Trueeb, Carmelo Sferrazza, and Raffaello D’Andrea. 2020. Towards vision-based robotic skins: a data-driven, multi-camera tactile sensor. In 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft). 333–338. https://doi.org/10.1109/RoboSoft48309.2020.9116060Google ScholarCross Ref
- J. Turner. 1986. New directions in communications (or which way to the information age?). IEEE Communications Magazine 24, 10 (1986), 8–15.Google ScholarDigital Library
- Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. 2020. ALERT: Accurate Learning for Energy and Timeliness. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 353–369.Google Scholar
- Junjue Wang, Ziqiang Feng, Shilpa George, Roger Iyengar, Padmanabhan Pillai, and Mahadev Satyanarayanan. 2019. Towards Scalable Edge-Native Applications. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing (Arlington, Virginia) (SEC ’19). Association for Computing Machinery, New York, NY, USA, 152–165. https://doi.org/10.1145/3318216.3363308Google ScholarDigital Library
- Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv abs/1804.03209 (2018).Google Scholar
- Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. 2020. EMO: Real-Time Emotion Recognition from Single-Eye Images for Resource-Constrained Eyewear Devices. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services (Toronto, Ontario, Canada) (MobiSys ’20). 448–461. https://doi.org/10.1145/3386901.3388917Google ScholarDigital Library
- Xiaorui Wu, Hong Xu, and Yi Wang. 2020. Irina: Accelerating DNN Inference with Efficient Online Scheduling(APNet ’20). Association for Computing Machinery, New York, NY, USA, 36–43. https://doi.org/10.1145/3411029.3411035Google ScholarDigital Library
- Mengwei Xu, Xiwen Zhang, Yunxin Liu, Gang Huang, Xuanzhe Liu, and Felix Xiaozhu Lin. 2020. Approximate Query Service on Autonomous IoT Cameras. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services (Toronto, Ontario, Canada) (MobiSys ’20). 191–205. https://doi.org/10.1145/3386901.3388948Google ScholarDigital Library
- Hyunho Yeo, Chan Ju Chong, Youngmok Jung, Juncheol Ye, and Dongsu Han. 2020. NEMO: Enabling Neural-Enhanced Video Streaming on Commodity Mobile Devices. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Article 28, 14 pages. https://doi.org/10.1145/3372224.3419185Google ScholarDigital Library
- Juheon Yi, Sunghyun Choi, and Youngki Lee. 2020. EagleEye: Wearable Camera-Based Person Identification in Crowded Urban Spaces. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Article 4, 14 pages. https://doi.org/10.1145/3372224.3380881Google ScholarDigital Library
- C. Zhang, M. Yu, W. Wang, and F. Yan. 2019. Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference. In USENIX ATC.Google Scholar
- J. Zhang, S. Elnikety, S. Zarar, A. Gupta, and S. Garg. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In HotCloud.Google Scholar
Recommendations
Towards Dynamic Tenant Management for Microservice based Multi-Tenant SaaS Applications
ISEC '18: Proceedings of the 11th Innovations in Software Engineering ConferenceIn a multi-tenant cloud application, more than one heterogeneous tenants share the single instance of the application. It increases the degree of resource sharing among tenants and brings down the operational cost. In this work, we propose a ...
Controlled Intelligent Agents' Security Model for Multi-Tenant Cloud Computing Infrastructures
Data security in the cloud continues to be a huge concern. The adoption of cloud services continues to increase with more businesses transitioning from on premise technology infrastructures to outsourcing cloud-based infrastructures. As the cloud ...
Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel ProcessingServerless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Comments