research-article

Inferencing on Edge Devices: A Time- and Space-aware Co-scheduling Approach

Authors:

Soumyajit DeyAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 28, Issue 3

Article No.: 38, Pages 1 - 33

https://doi.org/10.1145/3576197

Published: 19 March 2023 Publication History

Abstract

Neural Network (NN)-based real-time inferencing tasks are often co-scheduled on GPGPU-style edge platforms. Existing works advocate using different NN parameters for the same detection task in different environments. However, realizing such approaches remains challenging, given accelerator devices’ limited on-chip memory capacity. As a solution, we propose a multi-pass, time- and space-aware scheduling infrastructure for embedded platforms with GPU accelerators. The framework manages the residency of NN parameters in the limited on-chip memory while simultaneously dispatching relevant compute operations. The mapping decisions for memory operations and compute operations to the underlying resources of the platform are first determined in an offline manner. For this, we proposed a constraint solver-assisted scheduler that optimizes for schedule makespan. This is followed by memory optimization passes, which take the memory budget into account and accordingly adjust the start times of memory and compute operations. Our approach reports a 74%–90% savings in peak memory utilization with 0%–33% deadline misses for schedules that suffer miss percentage in ranges of 25%–100% when run using existing methods.

References

[1]

Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In IEEE Real-Time Systems Symposium (RTSS’17). IEEE, 104–115.

[2]

Clark Barrett and Cesare Tinelli. 2018. Satisfiability Modulo Theories. Springer International Publishing, Cham, 305–343.

[3]

Sanjoy Baruah, Bipasa Chattopadhyay, Haohan Li, and Insik Shin. 2014. Mixed-criticality scheduling on multiprocessors. Real-time Syst. 50, 1 (2014), 142–177.

Digital Library

[4]

Soroush Bateni, Zhendong Wang, Yuankun Zhu, Yang Hu, and Cong Liu. 2020. Co-optimizing performance and memory footprint via integrated CPU/GPU memory management, an implementation on autonomous driving platform. In Real-Time and Embedded Technology and Applications Symposium (RTAS’20). IEEE, 310–323.

[5]

Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. 2018. Deadline-based scheduling for GPU with preemption support. In IEEE Real-Time Systems Symposium (RTSS’18). IEEE, 119–130.

[6]

Samarjit Chakraborty, Mohammad Abdullah Al Faruque, Wanli Chang, Dip Goswami, Marilyn Wolf, and Qi Zhu. 2016. Automotive cyber-physical systems: A tutorial introduction. IEEE Design Test 33, 4 (2016), 92–108.

[7]

Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 155–168.

Digital Library

[8]

Xiaoming Chen, Danny Ziyi Chen, Yinhe Han, and Xiaobo Sharon Hu. 2018. moDNN: Memory optimal deep neural network training on graphics processing units. IEEE Trans. Parallel Distrib. Syst. 30, 3 (2018), 646–661.

[9]

Intel corporation. 2020. OpenVINO. Retrieved from https://docs.openvinotoolkit.org/.

[10]

Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, and Wen-mei Hwu. 2019. TRIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. In 12th International Conference on Cloud Computing (CLOUD’19). IEEE, 372–382.

[11]

Madhuchhanda Dasgupta, Oishila Bandyopadhyay, and Sanjay Chatterji. 2019. Automated helmet detection for multiple motorcycle riders using CNN. In Conference on Information and Communication Technology. IEEE, 1–4.

[12]

Wen-mei W. Hwu and David B. Kirk. 2016. Programming Massively Parallel Processors (3rd ed.). Morgan Kaufmann.

[13]

Shaoxia Fang, Lu Tian, Junbin Wang, Shuang Liang, Dongliang Xie, Zhongmin Chen, Lingzhi Sui, Qian Yu, Xiaoming Sun, Yi Shan, and Yu Wang. 2018. Real-time object detection and semantic segmentation hardware system with deep learning networks. In International Conference on Field-Programmable Technology (FPT’18). IEEE, 389–392.

[14]

Anirban Ghose, Srijeeta Maity, Arijit Kar, and Soumyajit Dey. 2021. Orchestration of perception systems for reliable performance in heterogeneous platforms. In Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 1757–1762.

[15]

Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor García-Floreszx, Simon Garcia De Gonzalo, Thomas B. Jablin, Antonio J. Pena, and Wen-mei Hwu. 2017. Chai: Collaborative heterogeneous applications for integrated-architectures. In International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 43–54.

[16]

Xiaohong Han, Jun Chang, and Kaiyuan Wang. 2021. Real-time object detection based on YOLO-v2 for tiny vehicle object. Proced. Comput. Sci. 183 (2021), 61–72.

[17]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. DOI:

[18]

Rachel Huang, Jonathan Pedoeem, and Cuixian Chen. 2018. YOLO-LITE: A real-time object detection algorithm optimized for non-GPU computers. In IEEE International Conference on Big Data (Big Data’18). IEEE, 2503–2510.

[19]

Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based deep learning framework for continuous vision applications. In 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 82–95.

Digital Library

[20]

Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and < 0.5MB Model Size. DOI:

[21]

Cheng Ji, Fan Wu, Zongwei Zhu, Li-Pin Chang, Huanghe Liu, and Wenjie Zhai. 2021. Memory-efficient deep learning inference with incremental weight loading and data layout reorganization on edge systems. J. Syst. Archit. 118 (2021), 102183. DOI:

[22]

Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable adaptation of video analytics. In Conference of the ACM Special Interest Group on Data Communication. ACM, 253–266.

Digital Library

[23]

Ziheng Jiang, Tianqi Chen, and Mu Li. 2018. Efficient Deep Learning Inference on Edge Devices.ACM SysML, Stanford, CA.

[24]

Woosung Kang, Kilho Lee, Jinkyu Lee, Insik Shin, and Hoon Sung Chwa. 2021. LaLaRAND: Flexible layer-by-layer CPU/GPU scheduling for real-time DNN tasks. In Real-Time Systems Symposium (RTSS’21). IEEE, 329–341.

[25]

Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh, Yeongjae Cheon, and Minje Park. 2016. PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection. (2016). DOI:

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.

Digital Library

[27]

Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 633–644.

[28]

Jinyang Li, Runyu Ma, Vikram Sharma Mailthody, Colin Samplawski, Benjamin Marlin, Songqing Chen, Shuochao Yao, and Tarek Abdelzaher. 2021. Towards an accurate latency model for convolutional neural network layers on GPUs. In Military Communications Conference (MILCOM’21). IEEE, 904–909. DOI:

Digital Library

[29]

Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2020. Thermal load-aware adaptive scheduling for heterogeneous platforms. In 33rd International Conference on VLSI Design and 19th International Conference on Embedded Systems (VLSID’20). IEEE, 125–130.

[30]

Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2021. Thermal-aware adaptive platform management for heterogeneous embedded systems. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–28.

Digital Library

[31]

Abhinandan Majumdar, Leonardo Piga, Indrani Paul, Joseph L. Greathouse, Wei Huang, and David H. Albonesi. 2017. Dynamic GPGPU power management using adaptive model predictive control. In International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 613–624.

[32]

Arezou Mohammadi and Selim G. Akl. 2005. Scheduling Algorithms for Real-time Systems.Technical Report. School of Computing Queens University. Citeseer.

[33]

Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

Digital Library

[34]

Nvidia. 2020. CUDA C++ Programming Guide. Retrieved from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[35]

Nvidia. 2020. NVIDIA Drive Perception. Retrieved from https://developer.nvidia.com/drive/drive-perception.

[36]

Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. 2020. Dissecting the CUDA scheduling hierarchy: A performance and predictability perspective. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’20). IEEE, 213–225.

[37]

Qiushi Pan, Yutong Guo, and Zhiliang Wang. 2019. A scene classification algorithm of visual robot based on Tiny YOLO v2. In Chinese Control Conference (CCC’19). IEEE, 8544–8549.

[38]

Yury Pisarchyk and Juhyun Lee. 2020. Efficient Memory Management for Deep Neural Net Inference. DOI:

[39]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.

[40]

Franz Rammig, Michael Ditze, Peter Janacik, Tales Heimfarth, Timo Kerstan, Simon Oberthuer, and Katharina Stahl. 2009. Basic concepts of real time operating systems. In Hardware-dependent Software. Springer, Dordrecht, 15–45.

[41]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779–788.

[42]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7263–7271.

[43]

Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. DOI:

[44]

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE/ACM, 1–13.

[45]

Karattupalayam Chidambaram Saranya, Arunkumar Thangavelu, Ashwin Chidambaram, Sharan Arumugam, and Sushant Govindraj. 2020. Cyclist detection using Tiny YOLO v2. In Soft Computing for Problem Solving. Springer, 969–979.

[46]

Ulzhalgas Seidaliyeva, Manal Alduraibi, Lyazzat Ilipbayeva, and Akhan Almagambetov. 2020. Detection of loaded and unloaded UAV using deep neural network. In 4th International Conference on Robotic Computing (IRC’20). IEEE, 490–494.

[47]

S. B. Shriram, Anshuj Garg, and Purushottam Kulkarni. 2019. Dynamic memory management for GPU-based training of deep neural networks. In International Parallel and Distributed Processing Symposium (IPDPS’19). IEEE, 200–209.

[48]

Tensorflow. 2020. TensorFlow-lite. Retrieved from https://www.tensorflow.org/lite.

[49]

Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling distributed deep learning workloads beyond the memory capacity with KARMA. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.

Digital Library

[50]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. SuperNeurons: Dynamic GPU memory management for training deep neural networks. In 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 41–53.

[51]

Jian Wei, Jianhua He, Yi Zhou, Kai Chen, Zuoyin Tang, and Zhiliang Xiong. 2019. Enhanced object detection with deep convolutional neural networks for advanced driving assistance. IEEE Trans. Intell. Transport. Syst. 21, 4 (2019), 1572–1583.

[52]

Yecheng Xiang and Hyoseung Kim. 2019. Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In IEEE Real-Time Systems Symposium (RTSS’19). IEEE, 392–405.

[53]

Ming Yang, Shige Wang, Joshua Bakita, Thanh Vu, F. Donelson Smith, James H. Anderson, and Jan-Michael Frahm. 2019. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’19). IEEE, 305–317.

[54]

Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. DeepSense: A unified deep learning framework for time-series mobile sensing data processing. In 26th International Conference on World Wide Web. ACM, 351–360.

Digital Library

[55]

Pavly Salah Zaki, Marco Magdy William, Bolis Karam Soliman, Kerolos Gamal Alexsan, Keroles Khalil, and Magdy El-Moursy. 2020. Traffic Signs Detection and Recognition System using Deep Learning. DOI:

[56]

Yuanyuan Zheng and Jun Ge. 2021. Binocular intelligent following robot based on YOLO-LITE. In MATEC Web of Conferences, Vol. 336. EDP Sciences, 03002.

[57]

Husheng Zhou, Soroush Bateni, and Cong Liu. 2018. S3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads. In Real-Time and Embedded Technology and Applications Symposium (RTAS’18). IEEE, 190–201.

Cited By

Pereria DGhosh SDey S(2024)DRL-based Multi-Stream Scheduling of Inference Pipelines on Edge Devices2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00060(324-329)Online publication date: 6-Jan-2024
https://doi.org/10.1109/VLSID60093.2024.00060

Index Terms

Inferencing on Edge Devices: A Time- and Space-aware Co-scheduling Approach
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded software
2. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Optimization algorithms

Recommendations

Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL Approach
Low-power edge devices equipped with Graphics Processing Units (GPUs) are a popular target platform for real-time scheduling of inference pipelines. Such application-architecture combinations are popular in Advanced Driver-assistance Systems for aiding in ...
Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and ...
Power-Aware Real-Time Scheduling upon Identical Multiprocessor Platforms
SUTC '08: Proceedings of the 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008)

In this paper, we address the power-aware scheduling of sporadic constrained-deadline hard real-time tasks using dynamic voltage scaling upon multiprocessor platforms. We propose two distinct algorithms. Our first algorithm is an off-line speed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 3

May 2023

456 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3587887

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 19 March 2023

Online AM: 15 December 2022

Accepted: 04 December 2022

Revised: 26 November 2022

Received: 10 July 2022

Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)16

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pereria DGhosh SDey S(2024)DRL-based Multi-Stream Scheduling of Inference Pipelines on Edge Devices2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00060(324-329)Online publication date: 6-Jan-2024
https://doi.org/10.1109/VLSID60093.2024.00060

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents