research-article

MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems

Authors:

Xiaowei LiAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems (JETC), Volume 15, Issue 4

Article No.: 35, Pages 1 - 25

https://doi.org/10.1145/3358696

Published: 03 October 2019 Publication History

Abstract

Recently the development of deep learning has been propelling the sheer growth of vision and speech applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) involves large quantities of weights and operations. Particularly, how to provide quality-of-services (QoS)-guaranteed neural network inference ability in the multitask execution environment of multicore SoCs is even more complicated due to the existence of resource contention. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile computing systems. When the constraints of QoS, output accuracy, and resource contention status of the system change, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and thus achieves an effective tradeoff between neural network computational complexity and prediction accuracy via approximate computing. The experimental results show that (1) MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and (2) it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.

References

[1]

Víctor Campos, Brendan Jou, Xavier Giró-I-Nieto, Jordi Torres, and Shih-Fu Chang. 2017. Skip RNN: Learning to skip state updates in recurrent neural networks. Arxiv Preprint Arxiv:1708.06834

[2]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, 2009 (IISWC’09). IEEE, 44--54.

Digital Library

[3]

Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3286--3293.

Digital Library

[4]

Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2857--2865.

Digital Library

[5]

Arm Cortex. A57 MPCore processor technical reference manual infocenter. arm. com arithmetic. Logical Unit Advanced SIMD Micro-Operation Vector Floating Point.

[6]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2758--2766.

Digital Library

[7]

Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.

[8]

Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium (RTSS'13). IEEE, 33--44.

[9]

Pedro F. Felzenszwalb, Ross B. Girshick, and David Mcallester. 2010. Cascade object detection with deformable part models. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). IEEE, 2241--2248.

[10]

Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 1645--1660.

Digital Library

[11]

Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Arxiv Preprint Arxiv:1510.00149

[12]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS’15). 1135--1143.

[13]

Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5353--5360.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.

[15]

Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 850--855.

Digital Library

[16]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (ICM’14). ACM, 675--678.

Digital Library

[17]

Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium (RTSS’11). IEEE, 57--66.

[18]

Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX Annual Technical Conference (USENIX ATC’11). 17--30.

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.

[20]

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI. 2267--2273.

[21]

Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (IWMCSA’15). ACM, 117--122.

[22]

Haeseung Lee and Mohammad Abdullah Al Faruque. 2016. Run-time scheduling framework for event-driven applications on a GPU-based embedded system. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 1956--1967.

Digital Library

[23]

Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS’17). 2181--2191.

[24]

Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. Arxiv Preprint Arxiv:1312.4400

[25]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. Ssd: Single shot multibox detector. ArXiv Preprint Arxiv:1512.02325

[26]

Mason Mcgill and Pietro Perona. 2017. Deciding how to decide: Dynamic routing in artificial neural networks. ArXiv Preprint ArXiv:1703.06217

[27]

Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS’14). 2924--2932.

[28]

Nvidia. 2015. Jetson tx1 module. http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.

[29]

Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2.

[30]

Priyadarshini Panda, Aayush Ankit, Parami Wijesinghe, and Kaushik Roy. 2016. FALCON: Feature driven selective classification for energy-efficient image recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). 1--1.

[31]

Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2015. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). 36.

[32]

Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2017. Energy-efficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 1--21.

Digital Library

[33]

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, and Sen Song. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.

Digital Library

[34]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.

[35]

Scott Rixner. 1995. Memory system architecture for real-time multitasking systems. Massachusetts Institute of Technology.

[36]

Sayantan Sarkar, Vishal M. Patel, and Rama Chellappa. 2016. Deep feature-based face detection on mobile devices. In 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA’16). IEEE, 1--8.

[37]

Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Query-reduction networks for question answering. ArXiv Preprint Arxiv:1606.04582

[38]

Lili Song, Ying Wang, Yinhe Hand, and Xiaowei Li. 2016. C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the IEEE Design Automation Conference (DAC’16). 1--6.

Digital Library

[39]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.

[40]

Ehsan Variani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 4052--4056.

[41]

Swagath Venkataramani, Anand Raghunathan, Liu Jie, and Mohammed Shoaib. 2015. Scalable-effort classifiers for energy-efficient machine learning. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, 67.

Digital Library

[42]

Uri Verner, Assaf Schuster, Mark Silberstein, and Avi Mendelson. 2012. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 8.

Digital Library

[43]

Cheng Wang, Ying Wang, Yinhe Han, Lili Song, Zhenyu Quan, Jiajun Li, and Xiaowei Li. 2017. CNN-based object detection solutions for embedded heterogeneous multicore SoCs. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 105--110.

[44]

Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. 2017. Structured probabilistic pruning for convolutional neural network acceleration. ArXiv Preprint Arxiv:1709.06994

[45]

Ying Wang, Huawei Li, Dawen Xu, and Xiaowei Li. 2017. Real-Time meets approximate computing: An elastic deep learning accelerator design with adaptive trade-off between QoS and QoR. In Proceedings of the IEEE Design Automation Conference (DAC’17). 1--6.

[46]

Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the IEEE Design Automation Conference (DAC’16). 110.

Digital Library

[47]

Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8817--8826.

[48]

Edward J. Wyrwas. 2017. Proton Testing of nVidia Jetson TX1. Retrieved on Oct 2016 from http://nepp.nasa.gov/ test report: NEPP-TR-2016-Wyrwas-16-038-Jetson-TX1-MGH2016Oct-TN44749.

[49]

Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian. 2016. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, 30.

Digital Library

[50]

Daecheol You and Ki-Seok Chung. 2015. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. IEEE Computer Architecture Letters 14, 66--69.

[51]

Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. Gpes: A preemptive execution system for gpgpu computing. In 2015 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’15). IEEE, 87--97.

Cited By

Tang PTan YXia J(2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11042-023-14887-z
Hebaish MHussein AEl-Mougy A(2022)Towards Safe and Efficient Modular Path Planning using Twin Delayed DDPG2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860536(1-7)Online publication date: Jun-2022
https://doi.org/10.1109/VTC2022-Spring54318.2022.9860536
Hebaish MHussein AEl-Mougy A(2022)Supervised-Reinforcement Learning (SRL) Approach for Efficient Modular Path Planning2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9922495(3537-3542)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1109/ITSC55140.2022.9922495
Show More Cited By

Index Terms

MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
  2. Real-time systems
2. Computing methodologies
  1. Machine learning

Recommendations

A cross-layer approach towards developing efficient embedded Deep Learning systems
Abstract
With the evolution of Smart Cyber–Physical Systems (CPS) and Internet-of-Things (IoT), the number of connected (intelligent) devices is increasing at an exponential rate, and so as the data being produced by them. To process this ...
CANN: Curable Approximations for High-Performance Deep Neural Network Accelerators
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

Approximate Computing (AC) has emerged as a means for improving the performance, area and power-/energy-efficiency of a digital design at the cost of output quality degradation. Applications like machine learning (e.g., using DNNs-deep neural networks) ...
Moving convolutional neural networks to embedded systems: the alexnet and VGG-16 case
IPSN '18: Proceedings of the 17th ACM/IEEE International Conference on Information Processing in Sensor Networks

Execution of deep learning solutions is mostly restricted to high performing computing platforms, e.g., those endowed with GPUs or FPGAs, due to the high demand on computation and memory such solutions require. Despite the fact that dedicated hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems

ACM Journal on Emerging Technologies in Computing Systems Volume 15, Issue 4

Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers

October 2019

226 pages

ISSN:1550-4832

EISSN:1550-4840

DOI:10.1145/3365594

Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 03 October 2019

Accepted: 01 July 2019

Revised: 01 March 2019

Received: 01 July 2018

Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
YESS hip program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
274
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang PTan YXia J(2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11042-023-14887-z
Hebaish MHussein AEl-Mougy A(2022)Towards Safe and Efficient Modular Path Planning using Twin Delayed DDPG2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860536(1-7)Online publication date: Jun-2022
https://doi.org/10.1109/VTC2022-Spring54318.2022.9860536
Hebaish MHussein AEl-Mougy A(2022)Supervised-Reinforcement Learning (SRL) Approach for Efficient Modular Path Planning2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9922495(3537-3542)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1109/ITSC55140.2022.9922495
Hung YChen YLo CSo AChang S(2021)Dynamic Workload Allocation for Edge ComputingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.304952029:3(519-529)Online publication date: Mar-2021
https://doi.org/10.1109/TVLSI.2021.3049520
Long YChakraborty ISrinivasan GRoy K(2021)Complexity-aware Adaptive Training and Inference for Edge-Cloud Distributed AI Systems2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00061(573-583)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00061
Tang PXia JTan YTan B(2020)Double-channel language feature mining based model for video descriptionMultimedia Tools and Applications10.1007/s11042-020-09674-z79:43-44(33193-33213)Online publication date: 31-Aug-2020
https://dl.acm.org/doi/10.1007/s11042-020-09674-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents