skip to main content
10.1145/3625687.3625797acmconferencesArticle/Chapter ViewAbstractPublication PagessensysConference Proceedingsconference-collections
research-article

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

Published:26 April 2024Publication History

ABSTRACT

We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers and metrics used for pursuing DNN optimizations and adaptations at the granularity of operators and kernels, ensuring every facet contributing to a DNN model's run-time efficiency is easily accessible to mobile developers via well-defined APIs. With nnPerf, the mobile developers can easily identify the bottleneck in model run-time efficiency and optimize the model architecture to meet system-level objectives (SLO). We implement nnPerf on TFLite framework and evaluate its e2e-, operator-, and kernel-latency profiling accuracy across four mobile platforms. The results show that nnPerf achieves consistently high latency profiling accuracy on both CPU (98.12%) and GPU (99.87%). Our benchmark studies demonstrate that running nnPerf on mobile devices introduces the minimum overhead to model inference, with 0.231% and 0.605% extra inference latency and power consumption. We further run a case study to show how we leverage nnPerf to migrate OFA, a SOTA NAS system, to kernel-oriented model optimization on GPUs.

References

  1. Adreno GPU OpenCL Support. https://en.wikipedia.org/wiki/Adreno.Google ScholarGoogle Scholar
  2. Adreno GPU Profiler. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler.Google ScholarGoogle Scholar
  3. Adreno GPU Profiler by Qualcomm. https://developer.qualcomm.com/forums/software/adreno-gpu-profiler?order=last_updated&sort=asc.Google ScholarGoogle Scholar
  4. AI creates new levels for Doom and Super Mario games. https://www.bbc.com/news/technology-44040007.Google ScholarGoogle Scholar
  5. Android Debug Bridge. https://en.wikipedia.org/wiki/Android_Debug_Bridge.Google ScholarGoogle Scholar
  6. Android GPU Inspector. https://gpuinspector.dev/.Google ScholarGoogle Scholar
  7. Android Studio Profiler. https://developer.android.com/studio/profile/android-profiler.Google ScholarGoogle Scholar
  8. Chrono. https://cplusplus.com/reference/chrono/.Google ScholarGoogle Scholar
  9. Deep Learning for Games. https://developer.nvidia.com/deep-learning-games.Google ScholarGoogle Scholar
  10. Efficientnet-b0 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/efficientnet/lite0/fp32/2.Google ScholarGoogle Scholar
  11. Emmagee - a practical, handy performance test tool for specified Android App. https://github.com/NetEase/Emmagee.Google ScholarGoogle Scholar
  12. Esrgan TFLite model TensorFLow HUb. https://tfhub.dev/captain-pool/esrgan-tf2/l.Google ScholarGoogle Scholar
  13. High Voltage Power Monitor. https://www.msoon.com/online-store/High-Voltage-Power-Monitor-p90002590.Google ScholarGoogle Scholar
  14. MACE. https://github.com/XiaoMi/mace.Google ScholarGoogle Scholar
  15. Mali GPU OpenCL Support. https://en.wikipedia.org/wiki/Mali_(GPU).Google ScholarGoogle Scholar
  16. MNasNet_1.0_224 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/mnasnet_1.0_224/1/default/1.Google ScholarGoogle Scholar
  17. MNN. https://github.com/alibaba/MNN.Google ScholarGoogle Scholar
  18. MobileBert 101 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilebert/fp32/l.Google ScholarGoogle Scholar
  19. MobileNetV3-small-100-224 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/mobilenet_v3_small_100_224/fp32/1.Google ScholarGoogle Scholar
  20. ncnn. https://github.com/Tencent/ncnn.Google ScholarGoogle Scholar
  21. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.Google ScholarGoogle Scholar
  22. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.Google ScholarGoogle Scholar
  23. Out-of-order execution. https://en.wikipedia.org/wiki/Out-of-order_execution.Google ScholarGoogle Scholar
  24. paddle-lite. https://github.com/PaddlePaddle/Paddle-Lite.Google ScholarGoogle Scholar
  25. Perfetto. https://ui.perfetto.dev/.Google ScholarGoogle Scholar
  26. Portable Computing Language (PoCL). http://portablecl.org/.Google ScholarGoogle Scholar
  27. ResnetV2 101 TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/resnet_v2_101/1/default/1.Google ScholarGoogle Scholar
  28. Snapdragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler.Google ScholarGoogle Scholar
  29. SoloPi. https://github.com/alipay/SoloPi.Google ScholarGoogle Scholar
  30. Squeezenet TFLite model TensorFLow HUb. https://tfhub.dev/tensorflow/lite-model/squeezenet/1/default/1.Google ScholarGoogle Scholar
  31. SSDMobileV2 TFLite model TensorFLow HUb. https://tfhub.dev/iree/lite-model/ssd_mobilenet_v2_100/fp32/default/1.Google ScholarGoogle Scholar
  32. Tencent Perfdog. https://perfdog.qq.com/.Google ScholarGoogle Scholar
  33. Tensorflow Benchmark tools. https://www.tensorflow.org/lite/performance/measurement.Google ScholarGoogle Scholar
  34. TensorFlow Lite. https://www.tensorflow.org/lite/.Google ScholarGoogle Scholar
  35. vDSO. https://man7.org/linux/man-pages/man7/vdso.7.html.Google ScholarGoogle Scholar
  36. XCode Instrument. https://developer.apple.com/documentation/xcode-release-notes/xcode-14_1-release-notes.Google ScholarGoogle Scholar
  37. M. Almeida, S. Laskaridis, A. Mehrotra, L. Dudziak, I. Leontiadis, and N. D. Lane. Smart at what cost? characterising mobile deep neural networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference, page 658--672. New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. G. Bhaskaracharya, J. Demouth, and V. Grover. Automatic kernel generation for volta tensor cores. arXiv preprint arXiv:2006.12645, 2020.Google ScholarGoogle Scholar
  39. E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu. Neuralpower: Predict and deploy energy-efficient convolutional neural networks. In Asian Conference on Machine Learning, pages 622--637. PMLR, 2017.Google ScholarGoogle Scholar
  40. H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv: 1908.09791, 2019.Google ScholarGoogle Scholar
  41. H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.Google ScholarGoogle Scholar
  42. T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, 2018.Google ScholarGoogle Scholar
  43. X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11398--11407, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. Dhar, J. Guo, J. J. Liu, S. Tripathi, U. Kurup, and M. Shah. A survey of on-device machine learning: An algorithms and learning theory perspective. 2(3), 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. H. Du, P. Li, H. Zhou, W. Gong, G. Luo, and P. Yang. Wordrecorder: Accurate acoustic-based handwriting recognition using deep learning. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1448--1456. IEEE, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. B. Fang, X. Zeng, and M. Zhang. Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 115--127, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Fromm, M. Cowan, M. Philipose, L. Ceze, and S. Patel. Riptide: Fast end-to-end binarized neural networks. Proceedings of Machine Learning and Systems, 2:379--389, 2020.Google ScholarGoogle Scholar
  49. T. S. D. S.-L. S. D. W. R. M. C. Z. X. L. N. K. D. L. A. L. C. L. Y. C. X. C. Fuxun Yu, Zirui Xu. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google ScholarGoogle Scholar
  50. Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun. Single path one-shot neural architecture search with uniform sampling. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVI 16, pages 544--560. Springer, 2020.Google ScholarGoogle Scholar
  51. M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google ScholarGoogle Scholar
  52. M. Han, H. Zhang, R. Chen, and H. Chen. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, 2022.Google ScholarGoogle Scholar
  53. Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han. Amc: Automi for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784--800, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389--1397, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  55. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314--1324, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  56. L. Howes and A. Munshi. The OpenCL Specification 2.0. Khronos OpenCL Working Group, 2015.Google ScholarGoogle Scholar
  57. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.Google ScholarGoogle Scholar
  58. W.-M. C. W.-C. W. C. G. S. H. Ji Lin, Ligeng Zhu. On-device training under 256kb memory. arXiv preprint arXiv:2206.15472, 2022.Google ScholarGoogle Scholar
  59. Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47--62, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H. Jin, Q. Song, and X. Hu. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1946--1956, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. W. Jung, T. T. Dao, and J. Lee. Deepcuts: a deep learning optimization framework for versatile gpu workloads. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 190--205, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. M. Kim, Y. G. Kim, and S. W. Chung. Stabilizing cpu frequency and voltage for temperature-aware dvfs in mobile devices. volume 64, pages 286--292, 2015.Google ScholarGoogle Scholar
  63. S. Kim, K. Bin, S. Ha, K. Lee, and S. Chong. ztt: learning-based dvfs with zero thermal throttling for mobile devices. GetMobile: Mobile Computing and Communications, 25(4):30--34, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. J. Lee, Y. Liu, and Y. Lee. Parallelfusion: towards maximum utilization of mobile gpu for dnn inference. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 25--30, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. R. Liang, T. Cao, J. Wen, M. Wang, Y. Wang, J. Zou, and Y. Liu. Romou: Rapidly generate high-performance tensor kernels for mobile gpus. In The 28th Annual International Conference On Mobile Computing And Networking (MobiCom 2022). ACM, February 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. Lin, W.-M. Chen, Y. Lin, j. cohn, C. Gan, and S. Han. Mcunet: Tiny deep learning on iot devices. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11711--11722. Curran Associates, Inc., 2020.Google ScholarGoogle Scholar
  67. H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.Google ScholarGoogle Scholar
  68. Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3296--3305, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  69. L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. Zhang, and L. Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897, 2020.Google ScholarGoogle Scholar
  70. P. Mochel. The sysfs filesystem. In Linux Symposium, volume 1, pages 313--326. The Linux Foundation San Francisco, CA, USA, 2005.Google ScholarGoogle Scholar
  71. P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.Google ScholarGoogle Scholar
  72. H. Qi, E. R. Sparks, and A. Talwalkar. Paleo: A performance model for deep neural networks. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  73. Qualcomm. Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization. Khronos OpenCL Working Group, 2021.Google ScholarGoogle Scholar
  74. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780--4789, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. B. Sharma, C. Gupta, H. Li, and Y. Wang. Automatic lyrics-to-audio alignment on polyphonic music using singing-adapted acoustic models. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 396--400. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  76. M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820--2828, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  77. M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105--6114. PMLR, 2019.Google ScholarGoogle Scholar
  78. X. Tang, S. Han, L. L. Zhang, T. Cao, and Y. Liu. To bridge neural network design and real-world performance: A behaviour study for neural networks. Proceedings of Machine Learning and Systems, 3:21--37, 2021.Google ScholarGoogle Scholar
  79. H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, and Z. Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37--54. USENIX Association, July 2021.Google ScholarGoogle Scholar
  80. Y. Wang, S. Ye, Z. He, X. Ma, L. Zhang, S. Lin, G. Yuan, S. H. Tan, Z. Li, D. Fan, et al. Non-structured dnn weight pruning considered harmful. arXiv preprint arXiv:1907.02124, 2, 2019.Google ScholarGoogle Scholar
  81. M. Xu, M. Zhu, Y. Liu, F. X. Lin, and X. Liu. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 129--144, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. S. Yang, Y. He, and Y. Chen. Spatialgaze: towards spatial gaze tracking for extended reality. CCF Transactions on Pervasive Computing and Interaction, pages 1--17, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  83. T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285--300, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. F. Yu, Z. Xu, T. Shen, D. Stamoulis, L. Shangguan, D. Wang, R. Madhok, C. Zhao, X. Li, N. Karianakis, et al. Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897, 2020.Google ScholarGoogle Scholar
  85. L. L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, and Y. Liu. nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, page 81--93, New York, NY, USA, 2021. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543--7552. PMLR, 2019.Google ScholarGoogle Scholar
  87. L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863--879, 2020.Google ScholarGoogle Scholar
  88. L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559--578, 2022.Google ScholarGoogle Scholar
  89. N. Zheng, B. Lin, Q. Zhang, L. Ma, Y. Yang, F. Yang, Y. Wang, M. Yang, and L. Zhou. {SparTA}:{Deep-Learning} model sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213--232, 2022.Google ScholarGoogle Scholar
  90. Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  91. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697--8710, 2018.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SenSys '23: Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems
      November 2023
      574 pages
      ISBN:9798400704147
      DOI:10.1145/3625687

      Copyright © 2023 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 April 2024

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate174of867submissions,20%
    • Article Metrics

      • Downloads (Last 12 months)92
      • Downloads (Last 6 weeks)92

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader