skip to main content
research-article

MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems

Published: 03 October 2019 Publication History

Abstract

Recently the development of deep learning has been propelling the sheer growth of vision and speech applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) involves large quantities of weights and operations. Particularly, how to provide quality-of-services (QoS)-guaranteed neural network inference ability in the multitask execution environment of multicore SoCs is even more complicated due to the existence of resource contention. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile computing systems. When the constraints of QoS, output accuracy, and resource contention status of the system change, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and thus achieves an effective tradeoff between neural network computational complexity and prediction accuracy via approximate computing. The experimental results show that (1) MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and (2) it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.

References

[1]
Víctor Campos, Brendan Jou, Xavier Giró-I-Nieto, Jordi Torres, and Shih-Fu Chang. 2017. Skip RNN: Learning to skip state updates in recurrent neural networks. Arxiv Preprint Arxiv:1708.06834
[2]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, 2009 (IISWC’09). IEEE, 44--54.
[3]
Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3286--3293.
[4]
Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2857--2865.
[5]
Arm Cortex. A57 MPCore processor technical reference manual infocenter. arm. com arithmetic. Logical Unit Advanced SIMD Micro-Operation Vector Floating Point.
[6]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2758--2766.
[7]
Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.
[8]
Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium (RTSS'13). IEEE, 33--44.
[9]
Pedro F. Felzenszwalb, Ross B. Girshick, and David Mcallester. 2010. Cascade object detection with deformable part models. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). IEEE, 2241--2248.
[10]
Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 1645--1660.
[11]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Arxiv Preprint Arxiv:1510.00149
[12]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS’15). 1135--1143.
[13]
Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5353--5360.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
[15]
Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 850--855.
[16]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (ICM’14). ACM, 675--678.
[17]
Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium (RTSS’11). IEEE, 57--66.
[18]
Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX Annual Technical Conference (USENIX ATC’11). 17--30.
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.
[20]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI. 2267--2273.
[21]
Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (IWMCSA’15). ACM, 117--122.
[22]
Haeseung Lee and Mohammad Abdullah Al Faruque. 2016. Run-time scheduling framework for event-driven applications on a GPU-based embedded system. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 1956--1967.
[23]
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS’17). 2181--2191.
[24]
Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. Arxiv Preprint Arxiv:1312.4400
[25]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. Ssd: Single shot multibox detector. ArXiv Preprint Arxiv:1512.02325
[26]
Mason Mcgill and Pietro Perona. 2017. Deciding how to decide: Dynamic routing in artificial neural networks. ArXiv Preprint ArXiv:1703.06217
[27]
Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS’14). 2924--2932.
[28]
Nvidia. 2015. Jetson tx1 module. http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.
[29]
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2.
[30]
Priyadarshini Panda, Aayush Ankit, Parami Wijesinghe, and Kaushik Roy. 2016. FALCON: Feature driven selective classification for energy-efficient image recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). 1--1.
[31]
Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2015. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). 36.
[32]
Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2017. Energy-efficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 1--21.
[33]
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, and Sen Song. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.
[34]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.
[35]
Scott Rixner. 1995. Memory system architecture for real-time multitasking systems. Massachusetts Institute of Technology.
[36]
Sayantan Sarkar, Vishal M. Patel, and Rama Chellappa. 2016. Deep feature-based face detection on mobile devices. In 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA’16). IEEE, 1--8.
[37]
Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Query-reduction networks for question answering. ArXiv Preprint Arxiv:1606.04582
[38]
Lili Song, Ying Wang, Yinhe Hand, and Xiaowei Li. 2016. C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the IEEE Design Automation Conference (DAC’16). 1--6.
[39]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.
[40]
Ehsan Variani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 4052--4056.
[41]
Swagath Venkataramani, Anand Raghunathan, Liu Jie, and Mohammed Shoaib. 2015. Scalable-effort classifiers for energy-efficient machine learning. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, 67.
[42]
Uri Verner, Assaf Schuster, Mark Silberstein, and Avi Mendelson. 2012. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 8.
[43]
Cheng Wang, Ying Wang, Yinhe Han, Lili Song, Zhenyu Quan, Jiajun Li, and Xiaowei Li. 2017. CNN-based object detection solutions for embedded heterogeneous multicore SoCs. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 105--110.
[44]
Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. 2017. Structured probabilistic pruning for convolutional neural network acceleration. ArXiv Preprint Arxiv:1709.06994
[45]
Ying Wang, Huawei Li, Dawen Xu, and Xiaowei Li. 2017. Real-Time meets approximate computing: An elastic deep learning accelerator design with adaptive trade-off between QoS and QoR. In Proceedings of the IEEE Design Automation Conference (DAC’17). 1--6.
[46]
Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the IEEE Design Automation Conference (DAC’16). 110.
[47]
Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8817--8826.
[48]
Edward J. Wyrwas. 2017. Proton Testing of nVidia Jetson TX1. Retrieved on Oct 2016 from http://nepp.nasa.gov/ test report: NEPP-TR-2016-Wyrwas-16-038-Jetson-TX1-MGH2016Oct-TN44749.
[49]
Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian. 2016. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, 30.
[50]
Daecheol You and Ki-Seok Chung. 2015. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. IEEE Computer Architecture Letters 14, 66--69.
[51]
Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. Gpes: A preemptive execution system for gpgpu computing. In 2015 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’15). IEEE, 87--97.

Cited By

View all
  • (2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 1-Sep-2023
  • (2022)Towards Safe and Efficient Modular Path Planning using Twin Delayed DDPG2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860536(1-7)Online publication date: Jun-2022
  • (2022)Supervised-Reinforcement Learning (SRL) Approach for Efficient Modular Path Planning2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9922495(3537-3542)Online publication date: 8-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
October 2019
226 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3365594
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 03 October 2019
Accepted: 01 July 2019
Revised: 01 March 2019
Received: 01 July 2018
Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Edge computing
  2. approximate computing
  3. deep learning
  4. energy efficiency
  5. online scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • YESS hip program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 1-Sep-2023
  • (2022)Towards Safe and Efficient Modular Path Planning using Twin Delayed DDPG2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring)10.1109/VTC2022-Spring54318.2022.9860536(1-7)Online publication date: Jun-2022
  • (2022)Supervised-Reinforcement Learning (SRL) Approach for Efficient Modular Path Planning2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9922495(3537-3542)Online publication date: 8-Oct-2022
  • (2021)Dynamic Workload Allocation for Edge ComputingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.304952029:3(519-529)Online publication date: Mar-2021
  • (2021)Complexity-aware Adaptive Training and Inference for Edge-Cloud Distributed AI Systems2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00061(573-583)Online publication date: Jul-2021
  • (2020)Double-channel language feature mining based model for video descriptionMultimedia Tools and Applications10.1007/s11042-020-09674-z79:43-44(33193-33213)Online publication date: 31-Aug-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media