skip to main content
10.1145/3446382.3448606acmconferencesArticle/Chapter ViewAbstractPublication PageshotmobileConference Proceedingsconference-collections
research-article

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Published:24 February 2021Publication History

ABSTRACT

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are fast enough to provide the potential to achieve real-time inference of many CNNs. In this paper, we first analyze the inference time of the widely used CNNs on the recent mobile GPUs and reveal that significant overhead exists for the GPU kernel launches. Then, we identify various factors that cause the kernel launch overhead, from which we formulate a performance model that can predict the optimal period for the kernel flush that can lead to the minimal overhead. Our experimental results show that we could achieve up to 64% and 31% of speedups in the inference of various CNNs with TensorFlow Lite and ARM Compute Library on Adreno 650 GPU and Mali G76 GPU.

References

  1. Andrei Frumusanu. 2019. Galaxy Note10+- Full phone specifications. https://www.gsmarena.com/samsung_galaxy_note10+-9732.php.Google ScholarGoogle Scholar
  2. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  3. Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  4. Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In ECCV. 0--0.Google ScholarGoogle Scholar
  5. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR. 2704--2713.Google ScholarGoogle Scholar
  6. Khronos® OpenCL Working Group. 2020. The OpenCLTM Specification. https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdf.Google ScholarGoogle Scholar
  7. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).Google ScholarGoogle Scholar
  8. Chanyoung Oh, Gunju Park, Sumin Kim, Dohee Kim, and Youngmin Yi. 2020. Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper). In LCTES2020. 136--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qualcomm Technologies, Inc. 2019. Snapdragon 865 Mobile Hardware Development Kit. developer.qualcomm.com/hardware/snapdragon-865-hdk.Google ScholarGoogle Scholar
  10. Siqi Wang, Anuj Pathania, and Tulika Mitra. 2020. Neural Network Inference on Mobile SoCs. IEEE Design & Test (2020).Google ScholarGoogle ScholarCross RefCross Ref
  11. Lingqi Zhang, Mohamed Wahib, and Satoshi Matsuoka. 2019. Understanding the Overheads of Launching CUDA Kernels. In ICPP19.Google ScholarGoogle Scholar

Index Terms

  1. Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications
        February 2021
        192 pages
        ISBN:9781450383233
        DOI:10.1145/3446382

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 February 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate96of345submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader