research-article

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

Authors:
Haolin Chu

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

https://orcid.org/0009-0000-7184-3150
View Profile

,
Xiaolong Zheng

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

https://orcid.org/0000-0001-7950-6773
View Profile

,
Liang Liu

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

https://orcid.org/0000-0002-5040-2468
View Profile

,
Huadong Ma

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

https://orcid.org/0000-0002-7199-5047
View Profile

SenSys '23: Proceedings of the 21st ACM Conference on Embedded Networked Sensor SystemsNovember 2023Pages 125–137https://doi.org/10.1145/3625687.3625797

Published:26 April 2024Publication History

SenSys '23: Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems

Pages 125–137

ABSTRACT

We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers and metrics used for pursuing DNN optimizations and adaptations at the granularity of operators and kernels, ensuring every facet contributing to a DNN model's run-time efficiency is easily accessible to mobile developers via well-defined APIs. With nnPerf, the mobile developers can easily identify the bottleneck in model run-time efficiency and optimize the model architecture to meet system-level objectives (SLO). We implement nnPerf on TFLite framework and evaluate its e2e-, operator-, and kernel-latency profiling accuracy across four mobile platforms. The results show that nnPerf achieves consistently high latency profiling accuracy on both CPU (98.12%) and GPU (99.87%). Our benchmark studies demonstrate that running nnPerf on mobile devices introduces the minimum overhead to model inference, with 0.231% and 0.605% extra inference latency and power consumption. We further run a case study to show how we leverage nnPerf to migrate OFA, a SOTA NAS system, to kernel-oriented model optimization on GPUs.

References

Adreno GPU OpenCL Support. https://en.wikipedia.org/wiki/Adreno.Google Scholar
Adreno GPU Profiler. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler.Google Scholar
Adreno GPU Profiler by Qualcomm. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler?order=last_updated&sort=asc.Google Scholar
AI creates new levels for Doom and Super Mario games. https://www.bbc.com/news/technology-44040007.Google Scholar
Android Debug Bridge. https://en.wikipedia.org/wiki/Android_Debug_Bridge.Google Scholar
Android GPU Inspector. https://gpuinspector.dev/.Google Scholar
Android Studio Profiler. https://developer.android.com/studio/profile/android-profiler.Google Scholar
Chrono. https://cplusplus.com/reference/chrono/.Google Scholar
Deep Learning for Games. https://developer.nvidia.com/deep-learning-games.Google Scholar
Efficientnet-b0 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/efficientnet/lite0/fp32/2.Google Scholar
Emmagee - a practical, handy performance test tool for specified Android App. https://github.com/NetEase/Emmagee.Google Scholar
Esrgan TFLite model TensorFLow HUb. https://tfhub.dev/captain-pool/esrgan-tf2/l.Google Scholar
High Voltage Power Monitor. https://www.msoon.com/online-store/High-Voltage-Power-Monitor-p90002590.Google Scholar
MACE. https://github.com/XiaoMi/mace.Google Scholar
Mali GPU OpenCL Support. https://en.wikipedia.org/wiki/Mali_(GPU).Google Scholar
MNasNet_1.0_224 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/mnasnet_1.0_224/1/default/1.Google Scholar
MNN. https://github.com/alibaba/MNN.Google Scholar
MobileBert 101 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilebert/fp32/l.Google Scholar
MobileNetV3-small-100-224 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilenet_v3_small_100_224/fp32/1.Google Scholar
ncnn. https://github.com/Tencent/ncnn.Google Scholar
NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.Google Scholar
NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.Google Scholar
Out-of-order execution. https://en.wikipedia.org/wiki/Out-of-order_execution.Google Scholar
paddle-lite. https://github.com/PaddlePaddle/Paddle-Lite.Google Scholar
Perfetto. https://ui.perfetto.dev/.Google Scholar
Portable Computing Language (PoCL). http://portablecl.org/.Google Scholar
ResnetV2 101 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/resnet_v2_101/1/default/1.Google Scholar
Snapdragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler.Google Scholar
SoloPi. https://github.com/alipay/SoloPi.Google Scholar
Squeezenet TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/squeezenet/1/default/1.Google Scholar
SSDMobileV2 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/ssd_mobilenet_v2_100/fp32/default/1.Google Scholar
Tencent Perfdog. https://perfdog.qq.com/.Google Scholar
Tensorflow Benchmark tools. https://www.tensorflow.org/lite/performance/measurement.Google Scholar
TensorFlow Lite. https://www.tensorflow.org/lite/.Google Scholar
vDSO. https://man7.org/linux/man-pages/man7/vdso.7.html.Google Scholar
XCode Instrument. https://developer.apple.com/documentation/xcode-release-notes/xcode-14_1-release-notes.Google Scholar
M. Almeida, S. Laskaridis, A. Mehrotra, L. Dudziak, I. Leontiadis, and N. D. Lane. Smart at what cost? characterising mobile deep neural networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference, page 658--672. New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
S. G. Bhaskaracharya, J. Demouth, and V. Grover. Automatic kernel generation for volta tensor cores. arXiv preprint arXiv:2006.12645, 2020.Google Scholar
E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu. Neuralpower: Predict and deploy energy-efficient convolutional neural networks. In Asian Conference on Machine Learning, pages 622--637. PMLR, 2017.Google Scholar
H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv: 1908.09791, 2019.Google Scholar
H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.Google Scholar
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, 2018.Google Scholar
X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11398--11407, 2019.Google ScholarCross Ref
S. Dhar, J. Guo, J. J. Liu, S. Tripathi, U. Kurup, and M. Shah. A survey of on-device machine learning: An algorithms and learning theory perspective. 2(3), 2021.Google ScholarDigital Library
H. Du, P. Li, H. Zhou, W. Gong, G. Luo, and P. Yang. Wordrecorder: Accurate acoustic-based handwriting recognition using deep learning. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1448--1456. IEEE, 2018.Google ScholarDigital Library
T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarDigital Library
B. Fang, X. Zeng, and M. Zhang. Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 115--127, 2018.Google ScholarDigital Library
J. Fromm, M. Cowan, M. Philipose, L. Ceze, and S. Patel. Riptide: Fast end-to-end binarized neural networks. Proceedings of Machine Learning and Systems, 2:379--389, 2020.Google Scholar
T. S. D. S.-L. S. D. W. R. M. C. Z. X. L. N. K. D. L. A. L. C. L. Y. C. X. C. Fuxun Yu, Zirui Xu. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google Scholar
Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun. Single path one-shot neural architecture search with uniform sampling. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVI 16, pages 544--560. Springer, 2020.Google Scholar
M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google Scholar
M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google Scholar
Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han. Amc: Automi for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784--800, 2018.Google ScholarDigital Library
Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389--1397, 2017.Google ScholarCross Ref
A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314--1324, 2019.Google ScholarCross Ref
L. Howes and A. Munshi. The OpenCL Specification 2.0. Khronos OpenCL Working Group, 2015.Google Scholar
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.Google Scholar
W.-M. C. W.-C. W. C. G. S. H. Ji Lin, Ligeng Zhu. On-device training under 256kb memory. arXiv preprint arXiv:2206.15472, 2022.Google Scholar
Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47--62, 2019.Google ScholarDigital Library
H. Jin, Q. Song, and X. Hu. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1946--1956, 2019.Google ScholarDigital Library
W. Jung, T. T. Dao, and J. Lee. Deepcuts: a deep learning optimization framework for versatile gpu workloads. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 190--205, 2021.Google ScholarDigital Library
J. M. Kim, Y. G. Kim, and S. W. Chung. Stabilizing cpu frequency and voltage for temperature-aware dvfs in mobile devices. volume 64, pages 286--292, 2015.Google Scholar
S. Kim, K. Bin, S. Ha, K. Lee, and S. Chong. ztt: learning-based dvfs with zero thermal throttling for mobile devices. GetMobile: Mobile Computing and Communications, 25(4):30--34, 2022.Google ScholarDigital Library
J. Lee, Y. Liu, and Y. Lee. Parallelfusion: towards maximum utilization of mobile gpu for dnn inference. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 25--30, 2021.Google ScholarDigital Library
R. Liang, T. Cao, J. Wen, M. Wang, Y. Wang, J. Zou, and Y. Liu. Romou: Rapidly generate high-performance tensor kernels for mobile gpus. In The 28th Annual International Conference On Mobile Computing And Networking (MobiCom 2022). ACM, February 2022.Google ScholarDigital Library
J. Lin, W.-M. Chen, Y. Lin, j. cohn, C. Gan, and S. Han. Mcunet: Tiny deep learning on iot devices. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11711--11722. Curran Associates, Inc., 2020.Google Scholar
H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.Google Scholar
Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3296--3305, 2019.Google ScholarCross Ref
L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. Zhang, and L. Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897, 2020.Google Scholar
P. Mochel. The sysfs filesystem. In Linux Symposium, volume 1, pages 313--326. The Linux Foundation San Francisco, CA, USA, 2005.Google Scholar
P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.Google Scholar
H. Qi, E. R. Sparks, and A. Talwalkar. Paleo: A performance model for deep neural networks. In International Conference on Learning Representations.Google Scholar
Qualcomm. Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization. Khronos OpenCL Working Group, 2021.Google Scholar
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780--4789, 2019.Google ScholarDigital Library
B. Sharma, C. Gupta, H. Li, and Y. Wang. Automatic lyrics-to-audio alignment on polyphonic music using singing-adapted acoustic models. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 396--400. IEEE, 2019.Google ScholarCross Ref
M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820--2828, 2019.Google ScholarCross Ref
M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105--6114. PMLR, 2019.Google Scholar
X. Tang, S. Han, L. L. Zhang, T. Cao, and Y. Liu. To bridge neural network design and real-world performance: A behaviour study for neural networks. Proceedings of Machine Learning and Systems, 3:21--37, 2021.Google Scholar
H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, and Z. Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37--54. USENIX Association, July 2021.Google Scholar
Y. Wang, S. Ye, Z. He, X. Ma, L. Zhang, S. Lin, G. Yuan, S. H. Tan, Z. Li, D. Fan, et al. Non-structured dnn weight pruning considered harmful. arXiv preprint arXiv:1907.02124, 2, 2019.Google Scholar
M. Xu, M. Zhu, Y. Liu, F. X. Lin, and X. Liu. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 129--144, 2018.Google ScholarDigital Library
S. Yang, Y. He, and Y. Chen. Spatialgaze: towards spatial gaze tracking for extended reality. CCF Transactions on Pervasive Computing and Interaction, pages 1--17, 2023.Google ScholarCross Ref
T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285--300, 2018.Google ScholarDigital Library
F. Yu, Z. Xu, T. Shen, D. Stamoulis, L. Shangguan, D. Wang, R. Madhok, C. Zhao, X. Li, N. Karianakis, et al. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google Scholar
L. L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, and Y. Liu. nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, page 81--93, New York, NY, USA, 2021. ACM.Google ScholarDigital Library
R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543--7552. PMLR, 2019.Google Scholar
L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863--879, 2020.Google Scholar
L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559--578, 2022.Google Scholar
N. Zheng, B. Lin, Q. Zhang, L. Ma, Y. Yang, F. Yang, Y. Wang, M. Yang, and L. Zhou. {SparTA}:{Deep-Learning} model sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213--232, 2022.Google Scholar
Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.Google ScholarCross Ref
B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697--8710, 2018.Google ScholarCross Ref

Index Terms

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are ...
Read More
Profiling and optimizing deep learning inference on mobile GPUs
APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems

Mobile GPU, as the ubiquitous computing hardware on almost every smartphone, is being exploited for the deep learning inference. In this paper, we present our measurements on the inference performance with mobile GPUs. Our observations suggest that ...
Read More
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SenSys '23: Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems
November 2023
574 pages
ISBN:9798400704147
DOI:10.1145/3625687
General Chair:
Rasit Eskicioglu,
Program Chair:
Polly Huang,
Program Co-chair:
Neal Patwari
Copyright © 2023 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2024
Check for updates
Author Tags
mobile GPU
deep neural network
inference latency
profiling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate174of867submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 92
  Total Downloads
- Downloads (Last 12 months)92
- Downloads (Last 6 weeks)92
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

SenSys '23: Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Profiling and optimizing deep learning inference on mobile GPUs

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs