ABSTRACT
We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers and metrics used for pursuing DNN optimizations and adaptations at the granularity of operators and kernels, ensuring every facet contributing to a DNN model's run-time efficiency is easily accessible to mobile developers via well-defined APIs. With nnPerf, the mobile developers can easily identify the bottleneck in model run-time efficiency and optimize the model architecture to meet system-level objectives (SLO). We implement nnPerf on TFLite framework and evaluate its e2e-, operator-, and kernel-latency profiling accuracy across four mobile platforms. The results show that nnPerf achieves consistently high latency profiling accuracy on both CPU (98.12%) and GPU (99.87%). Our benchmark studies demonstrate that running nnPerf on mobile devices introduces the minimum overhead to model inference, with 0.231% and 0.605% extra inference latency and power consumption. We further run a case study to show how we leverage nnPerf to migrate OFA, a SOTA NAS system, to kernel-oriented model optimization on GPUs.
- Adreno GPU OpenCL Support. https://en.wikipedia.org/wiki/Adreno.Google Scholar
- Adreno GPU Profiler. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler.Google Scholar
- Adreno GPU Profiler by Qualcomm. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler?order=last_updated&sort=asc.Google Scholar
- AI creates new levels for Doom and Super Mario games. https://www.bbc.com/news/technology-44040007.Google Scholar
- Android Debug Bridge. https://en.wikipedia.org/wiki/Android_Debug_Bridge.Google Scholar
- Android GPU Inspector. https://gpuinspector.dev/.Google Scholar
- Android Studio Profiler. https://developer.android.com/studio/profile/android-profiler.Google Scholar
- Chrono. https://cplusplus.com/reference/chrono/.Google Scholar
- Deep Learning for Games. https://developer.nvidia.com/deep-learning-games.Google Scholar
- Efficientnet-b0 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/efficientnet/lite0/fp32/2.Google Scholar
- Emmagee - a practical, handy performance test tool for specified Android App. https://github.com/NetEase/Emmagee.Google Scholar
- Esrgan TFLite model TensorFLow HUb. https://tfhub.dev/captain-pool/esrgan-tf2/l.Google Scholar
- High Voltage Power Monitor. https://www.msoon.com/online-store/High-Voltage-Power-Monitor-p90002590.Google Scholar
- MACE. https://github.com/XiaoMi/mace.Google Scholar
- Mali GPU OpenCL Support. https://en.wikipedia.org/wiki/Mali_(GPU).Google Scholar
- MNasNet_1.0_224 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/mnasnet_1.0_224/1/default/1.Google Scholar
- MNN. https://github.com/alibaba/MNN.Google Scholar
- MobileBert 101 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilebert/fp32/l.Google Scholar
- MobileNetV3-small-100-224 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilenet_v3_small_100_224/fp32/1.Google Scholar
- ncnn. https://github.com/Tencent/ncnn.Google Scholar
- NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.Google Scholar
- NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.Google Scholar
- Out-of-order execution. https://en.wikipedia.org/wiki/Out-of-order_execution.Google Scholar
- paddle-lite. https://github.com/PaddlePaddle/Paddle-Lite.Google Scholar
- Perfetto. https://ui.perfetto.dev/.Google Scholar
- Portable Computing Language (PoCL). http://portablecl.org/.Google Scholar
- ResnetV2 101 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/resnet_v2_101/1/default/1.Google Scholar
- Snapdragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler.Google Scholar
- SoloPi. https://github.com/alipay/SoloPi.Google Scholar
- Squeezenet TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/squeezenet/1/default/1.Google Scholar
- SSDMobileV2 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/ssd_mobilenet_v2_100/fp32/default/1.Google Scholar
- Tencent Perfdog. https://perfdog.qq.com/.Google Scholar
- Tensorflow Benchmark tools. https://www.tensorflow.org/lite/performance/measurement.Google Scholar
- TensorFlow Lite. https://www.tensorflow.org/lite/.Google Scholar
- vDSO. https://man7.org/linux/man-pages/man7/vdso.7.html.Google Scholar
- XCode Instrument. https://developer.apple.com/documentation/xcode-release-notes/xcode-14_1-release-notes.Google Scholar
- M. Almeida, S. Laskaridis, A. Mehrotra, L. Dudziak, I. Leontiadis, and N. D. Lane. Smart at what cost? characterising mobile deep neural networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference, page 658--672. New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarDigital Library
- S. G. Bhaskaracharya, J. Demouth, and V. Grover. Automatic kernel generation for volta tensor cores. arXiv preprint arXiv:2006.12645, 2020.Google Scholar
- E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu. Neuralpower: Predict and deploy energy-efficient convolutional neural networks. In Asian Conference on Machine Learning, pages 622--637. PMLR, 2017.Google Scholar
- H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv: 1908.09791, 2019.Google Scholar
- H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.Google Scholar
- T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, 2018.Google Scholar
- X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11398--11407, 2019.Google ScholarCross Ref
- S. Dhar, J. Guo, J. J. Liu, S. Tripathi, U. Kurup, and M. Shah. A survey of on-device machine learning: An algorithms and learning theory perspective. 2(3), 2021.Google ScholarDigital Library
- H. Du, P. Li, H. Zhou, W. Gong, G. Luo, and P. Yang. Wordrecorder: Accurate acoustic-based handwriting recognition using deep learning. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1448--1456. IEEE, 2018.Google ScholarDigital Library
- T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarDigital Library
- B. Fang, X. Zeng, and M. Zhang. Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 115--127, 2018.Google ScholarDigital Library
- J. Fromm, M. Cowan, M. Philipose, L. Ceze, and S. Patel. Riptide: Fast end-to-end binarized neural networks. Proceedings of Machine Learning and Systems, 2:379--389, 2020.Google Scholar
- T. S. D. S.-L. S. D. W. R. M. C. Z. X. L. N. K. D. L. A. L. C. L. Y. C. X. C. Fuxun Yu, Zirui Xu. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google Scholar
- Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun. Single path one-shot neural architecture search with uniform sampling. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVI 16, pages 544--560. Springer, 2020.Google Scholar
- M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google Scholar
- M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google Scholar
- Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han. Amc: Automi for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784--800, 2018.Google ScholarDigital Library
- Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389--1397, 2017.Google ScholarCross Ref
- A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314--1324, 2019.Google ScholarCross Ref
- L. Howes and A. Munshi. The OpenCL Specification 2.0. Khronos OpenCL Working Group, 2015.Google Scholar
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.Google Scholar
- W.-M. C. W.-C. W. C. G. S. H. Ji Lin, Ligeng Zhu. On-device training under 256kb memory. arXiv preprint arXiv:2206.15472, 2022.Google Scholar
- Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47--62, 2019.Google ScholarDigital Library
- H. Jin, Q. Song, and X. Hu. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1946--1956, 2019.Google ScholarDigital Library
- W. Jung, T. T. Dao, and J. Lee. Deepcuts: a deep learning optimization framework for versatile gpu workloads. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 190--205, 2021.Google ScholarDigital Library
- J. M. Kim, Y. G. Kim, and S. W. Chung. Stabilizing cpu frequency and voltage for temperature-aware dvfs in mobile devices. volume 64, pages 286--292, 2015.Google Scholar
- S. Kim, K. Bin, S. Ha, K. Lee, and S. Chong. ztt: learning-based dvfs with zero thermal throttling for mobile devices. GetMobile: Mobile Computing and Communications, 25(4):30--34, 2022.Google ScholarDigital Library
- J. Lee, Y. Liu, and Y. Lee. Parallelfusion: towards maximum utilization of mobile gpu for dnn inference. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 25--30, 2021.Google ScholarDigital Library
- R. Liang, T. Cao, J. Wen, M. Wang, Y. Wang, J. Zou, and Y. Liu. Romou: Rapidly generate high-performance tensor kernels for mobile gpus. In The 28th Annual International Conference On Mobile Computing And Networking (MobiCom 2022). ACM, February 2022.Google ScholarDigital Library
- J. Lin, W.-M. Chen, Y. Lin, j. cohn, C. Gan, and S. Han. Mcunet: Tiny deep learning on iot devices. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11711--11722. Curran Associates, Inc., 2020.Google Scholar
- H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.Google Scholar
- Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3296--3305, 2019.Google ScholarCross Ref
- L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. Zhang, and L. Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897, 2020.Google Scholar
- P. Mochel. The sysfs filesystem. In Linux Symposium, volume 1, pages 313--326. The Linux Foundation San Francisco, CA, USA, 2005.Google Scholar
- P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.Google Scholar
- H. Qi, E. R. Sparks, and A. Talwalkar. Paleo: A performance model for deep neural networks. In International Conference on Learning Representations.Google Scholar
- Qualcomm. Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization. Khronos OpenCL Working Group, 2021.Google Scholar
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780--4789, 2019.Google ScholarDigital Library
- B. Sharma, C. Gupta, H. Li, and Y. Wang. Automatic lyrics-to-audio alignment on polyphonic music using singing-adapted acoustic models. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 396--400. IEEE, 2019.Google ScholarCross Ref
- M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820--2828, 2019.Google ScholarCross Ref
- M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105--6114. PMLR, 2019.Google Scholar
- X. Tang, S. Han, L. L. Zhang, T. Cao, and Y. Liu. To bridge neural network design and real-world performance: A behaviour study for neural networks. Proceedings of Machine Learning and Systems, 3:21--37, 2021.Google Scholar
- H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, and Z. Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37--54. USENIX Association, July 2021.Google Scholar
- Y. Wang, S. Ye, Z. He, X. Ma, L. Zhang, S. Lin, G. Yuan, S. H. Tan, Z. Li, D. Fan, et al. Non-structured dnn weight pruning considered harmful. arXiv preprint arXiv:1907.02124, 2, 2019.Google Scholar
- M. Xu, M. Zhu, Y. Liu, F. X. Lin, and X. Liu. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 129--144, 2018.Google ScholarDigital Library
- S. Yang, Y. He, and Y. Chen. Spatialgaze: towards spatial gaze tracking for extended reality. CCF Transactions on Pervasive Computing and Interaction, pages 1--17, 2023.Google ScholarCross Ref
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285--300, 2018.Google ScholarDigital Library
- F. Yu, Z. Xu, T. Shen, D. Stamoulis, L. Shangguan, D. Wang, R. Madhok, C. Zhao, X. Li, N. Karianakis, et al. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google Scholar
- L. L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, and Y. Liu. nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, page 81--93, New York, NY, USA, 2021. ACM.Google ScholarDigital Library
- R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543--7552. PMLR, 2019.Google Scholar
- L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863--879, 2020.Google Scholar
- L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559--578, 2022.Google Scholar
- N. Zheng, B. Lin, Q. Zhang, L. Ma, Y. Yang, F. Yang, Y. Wang, M. Yang, and L. Zhou. {SparTA}:{Deep-Learning} model sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213--232, 2022.Google Scholar
- Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.Google ScholarCross Ref
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697--8710, 2018.Google ScholarCross Ref
Index Terms
- nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms
Recommendations
Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and ApplicationsThe need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are ...
Profiling and optimizing deep learning inference on mobile GPUs
APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on SystemsMobile GPU, as the ubiquitous computing hardware on almost every smartphone, is being exploited for the deep learning inference. In this paper, we present our measurements on the inference performance with mobile GPUs. Our observations suggest that ...
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and OptimizationGeneral-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools ...
Comments