skip to main content
research-article

Inferencing on Edge Devices: A Time- and Space-aware Co-scheduling Approach

Published: 19 March 2023 Publication History

Abstract

Neural Network (NN)-based real-time inferencing tasks are often co-scheduled on GPGPU-style edge platforms. Existing works advocate using different NN parameters for the same detection task in different environments. However, realizing such approaches remains challenging, given accelerator devices’ limited on-chip memory capacity. As a solution, we propose a multi-pass, time- and space-aware scheduling infrastructure for embedded platforms with GPU accelerators. The framework manages the residency of NN parameters in the limited on-chip memory while simultaneously dispatching relevant compute operations. The mapping decisions for memory operations and compute operations to the underlying resources of the platform are first determined in an offline manner. For this, we proposed a constraint solver-assisted scheduler that optimizes for schedule makespan. This is followed by memory optimization passes, which take the memory budget into account and accordingly adjust the start times of memory and compute operations. Our approach reports a 74%–90% savings in peak memory utilization with 0%–33% deadline misses for schedules that suffer miss percentage in ranges of 25%–100% when run using existing methods.

References

[1]
Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In IEEE Real-Time Systems Symposium (RTSS’17). IEEE, 104–115.
[2]
Clark Barrett and Cesare Tinelli. 2018. Satisfiability Modulo Theories. Springer International Publishing, Cham, 305–343.
[3]
Sanjoy Baruah, Bipasa Chattopadhyay, Haohan Li, and Insik Shin. 2014. Mixed-criticality scheduling on multiprocessors. Real-time Syst. 50, 1 (2014), 142–177.
[4]
Soroush Bateni, Zhendong Wang, Yuankun Zhu, Yang Hu, and Cong Liu. 2020. Co-optimizing performance and memory footprint via integrated CPU/GPU memory management, an implementation on autonomous driving platform. In Real-Time and Embedded Technology and Applications Symposium (RTAS’20). IEEE, 310–323.
[5]
Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. 2018. Deadline-based scheduling for GPU with preemption support. In IEEE Real-Time Systems Symposium (RTSS’18). IEEE, 119–130.
[6]
Samarjit Chakraborty, Mohammad Abdullah Al Faruque, Wanli Chang, Dip Goswami, Marilyn Wolf, and Qi Zhu. 2016. Automotive cyber-physical systems: A tutorial introduction. IEEE Design Test 33, 4 (2016), 92–108.
[7]
Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 155–168.
[8]
Xiaoming Chen, Danny Ziyi Chen, Yinhe Han, and Xiaobo Sharon Hu. 2018. moDNN: Memory optimal deep neural network training on graphics processing units. IEEE Trans. Parallel Distrib. Syst. 30, 3 (2018), 646–661.
[9]
Intel corporation. 2020. OpenVINO. Retrieved from https://docs.openvinotoolkit.org/.
[10]
Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, and Wen-mei Hwu. 2019. TRIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. In 12th International Conference on Cloud Computing (CLOUD’19). IEEE, 372–382.
[11]
Madhuchhanda Dasgupta, Oishila Bandyopadhyay, and Sanjay Chatterji. 2019. Automated helmet detection for multiple motorcycle riders using CNN. In Conference on Information and Communication Technology. IEEE, 1–4.
[12]
Wen-mei W. Hwu and David B. Kirk. 2016. Programming Massively Parallel Processors (3rd ed.). Morgan Kaufmann.
[13]
Shaoxia Fang, Lu Tian, Junbin Wang, Shuang Liang, Dongliang Xie, Zhongmin Chen, Lingzhi Sui, Qian Yu, Xiaoming Sun, Yi Shan, and Yu Wang. 2018. Real-time object detection and semantic segmentation hardware system with deep learning networks. In International Conference on Field-Programmable Technology (FPT’18). IEEE, 389–392.
[14]
Anirban Ghose, Srijeeta Maity, Arijit Kar, and Soumyajit Dey. 2021. Orchestration of perception systems for reliable performance in heterogeneous platforms. In Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 1757–1762.
[15]
Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor García-Floreszx, Simon Garcia De Gonzalo, Thomas B. Jablin, Antonio J. Pena, and Wen-mei Hwu. 2017. Chai: Collaborative heterogeneous applications for integrated-architectures. In International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 43–54.
[16]
Xiaohong Han, Jun Chang, and Kaiyuan Wang. 2021. Real-time object detection based on YOLO-v2 for tiny vehicle object. Proced. Comput. Sci. 183 (2021), 61–72.
[17]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. DOI:
[18]
Rachel Huang, Jonathan Pedoeem, and Cuixian Chen. 2018. YOLO-LITE: A real-time object detection algorithm optimized for non-GPU computers. In IEEE International Conference on Big Data (Big Data’18). IEEE, 2503–2510.
[19]
Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based deep learning framework for continuous vision applications. In 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 82–95.
[20]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and < 0.5MB Model Size. DOI:
[21]
Cheng Ji, Fan Wu, Zongwei Zhu, Li-Pin Chang, Huanghe Liu, and Wenjie Zhai. 2021. Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems. J. Syst. Archit. 118 (2021), 102183. DOI:
[22]
Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable adaptation of video analytics. In Conference of the ACM Special Interest Group on Data Communication. ACM, 253–266.
[23]
Ziheng Jiang, Tianqi Chen, and Mu Li. 2018. Efficient Deep Learning Inference on Edge Devices.ACM SysML, Stanford, CA.
[24]
Woosung Kang, Kilho Lee, Jinkyu Lee, Insik Shin, and Hoon Sung Chwa. 2021. LaLaRAND: Flexible layer-by-layer CPU/GPU scheduling for real-time DNN tasks. In Real-Time Systems Symposium (RTSS’21). IEEE, 329–341.
[25]
Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh, Yeongjae Cheon, and Minje Park. 2016. PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection. (2016). DOI:
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.
[27]
Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 633–644.
[28]
Jinyang Li, Runyu Ma, Vikram Sharma Mailthody, Colin Samplawski, Benjamin Marlin, Songqing Chen, Shuochao Yao, and Tarek Abdelzaher. 2021. Towards an accurate latency model for convolutional neural network layers on GPUs. In Military Communications Conference (MILCOM’21). IEEE, 904–909. DOI:
[29]
Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2020. Thermal load-aware adaptive scheduling for heterogeneous platforms. In 33rd International Conference on VLSI Design and 19th International Conference on Embedded Systems (VLSID’20). IEEE, 125–130.
[30]
Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2021. Thermal-aware adaptive platform management for heterogeneous embedded systems. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–28.
[31]
Abhinandan Majumdar, Leonardo Piga, Indrani Paul, Joseph L. Greathouse, Wei Huang, and David H. Albonesi. 2017. Dynamic GPGPU power management using adaptive model predictive control. In International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 613–624.
[32]
Arezou Mohammadi and Selim G. Akl. 2005. Scheduling Algorithms for Real-time Systems.Technical Report. School of Computing Queens University. Citeseer.
[33]
Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.
[34]
Nvidia. 2020. CUDA C++ Programming Guide. Retrieved from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[35]
Nvidia. 2020. NVIDIA Drive Perception. Retrieved from https://developer.nvidia.com/drive/drive-perception.
[36]
Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. 2020. Dissecting the CUDA scheduling hierarchy: A performance and predictability perspective. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’20). IEEE, 213–225.
[37]
Qiushi Pan, Yutong Guo, and Zhiliang Wang. 2019. A scene classification algorithm of visual robot based on Tiny YOLO v2. In Chinese Control Conference (CCC’19). IEEE, 8544–8549.
[38]
Yury Pisarchyk and Juhyun Lee. 2020. Efficient Memory Management for Deep Neural Net Inference. DOI:
[39]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
[40]
Franz Rammig, Michael Ditze, Peter Janacik, Tales Heimfarth, Timo Kerstan, Simon Oberthuer, and Katharina Stahl. 2009. Basic concepts of real time operating systems. In Hardware-dependent Software. Springer, Dordrecht, 15–45.
[41]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779–788.
[42]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7263–7271.
[43]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. DOI:
[44]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE/ACM, 1–13.
[45]
Karattupalayam Chidambaram Saranya, Arunkumar Thangavelu, Ashwin Chidambaram, Sharan Arumugam, and Sushant Govindraj. 2020. Cyclist detection using Tiny YOLO v2. In Soft Computing for Problem Solving. Springer, 969–979.
[46]
Ulzhalgas Seidaliyeva, Manal Alduraibi, Lyazzat Ilipbayeva, and Akhan Almagambetov. 2020. Detection of loaded and unloaded UAV using deep neural network. In 4th International Conference on Robotic Computing (IRC’20). IEEE, 490–494.
[47]
S. B. Shriram, Anshuj Garg, and Purushottam Kulkarni. 2019. Dynamic memory management for GPU-based training of deep neural networks. In International Parallel and Distributed Processing Symposium (IPDPS’19). IEEE, 200–209.
[48]
Tensorflow. 2020. TensorFlow-lite. Retrieved from https://www.tensorflow.org/lite.
[49]
Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling distributed deep learning workloads beyond the memory capacity with KARMA. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
[50]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. SuperNeurons: Dynamic GPU memory management for training deep neural networks. In 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 41–53.
[51]
Jian Wei, Jianhua He, Yi Zhou, Kai Chen, Zuoyin Tang, and Zhiliang Xiong. 2019. Enhanced object detection with deep convolutional neural networks for advanced driving assistance. IEEE Trans. Intell. Transport. Syst. 21, 4 (2019), 1572–1583.
[52]
Yecheng Xiang and Hyoseung Kim. 2019. Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In IEEE Real-Time Systems Symposium (RTSS’19). IEEE, 392–405.
[53]
Ming Yang, Shige Wang, Joshua Bakita, Thanh Vu, F. Donelson Smith, James H. Anderson, and Jan-Michael Frahm. 2019. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’19). IEEE, 305–317.
[54]
Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. DeepSense: A unified deep learning framework for time-series mobile sensing data processing. In 26th International Conference on World Wide Web. ACM, 351–360.
[55]
Pavly Salah Zaki, Marco Magdy William, Bolis Karam Soliman, Kerolos Gamal Alexsan, Keroles Khalil, and Magdy El-Moursy. 2020. Traffic Signs Detection and Recognition System using Deep Learning. DOI:
[56]
Yuanyuan Zheng and Jun Ge. 2021. Binocular intelligent following robot based on YOLO-LITE. In MATEC Web of Conferences, Vol. 336. EDP Sciences, 03002.
[57]
Husheng Zhou, Soroush Bateni, and Cong Liu. 2018. S3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads. In Real-Time and Embedded Technology and Applications Symposium (RTAS’18). IEEE, 190–201.

Cited By

View all
  • (2024)DRL-based Multi-Stream Scheduling of Inference Pipelines on Edge Devices2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00060(324-329)Online publication date: 6-Jan-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 28, Issue 3
May 2023
456 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3587887
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 19 March 2023
Online AM: 15 December 2022
Accepted: 04 December 2022
Revised: 26 November 2022
Received: 10 July 2022
Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolutional neural network
  2. edge device
  3. GPU
  4. Satisfiability Modulo Theories

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)156
  • Downloads (Last 6 weeks)16
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DRL-based Multi-Stream Scheduling of Inference Pipelines on Edge Devices2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00060(324-329)Online publication date: 6-Jan-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media