research-article

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Authors:
Sumin Kim

University of Seoul

University of Seoul
View Profile

,
Seunghwan Oh

University of Seoul

University of Seoul
View Profile

,
Youngmin Yi

University of Seoul

University of Seoul
View Profile

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and ApplicationsFebruary 2021Pages 57–63https://doi.org/10.1145/3446382.3448606

Published:24 February 2021Publication History

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

Pages 57–63

ABSTRACT

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are fast enough to provide the potential to achieve real-time inference of many CNNs. In this paper, we first analyze the inference time of the widely used CNNs on the recent mobile GPUs and reveal that significant overhead exists for the GPU kernel launches. Then, we identify various factors that cause the kernel launch overhead, from which we formulate a performance model that can predict the optimal period for the kernel flush that can lead to the minimal overhead. Our experimental results show that we could achieve up to 64% and 31% of speedups in the inference of various CNNs with TensorFlow Lite and ARM Compute Library on Adreno 650 GPU and Mali G76 GPU.

References

Andrei Frumusanu. 2019. Galaxy Note10+- Full phone specifications. https://www.gsmarena.com/samsung_galaxy_note10+-9732.php.Google Scholar
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In ECCV. 0--0.Google Scholar
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR. 2704--2713.Google Scholar
Khronos® OpenCL Working Group. 2020. The OpenCLTM Specification. https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdf.Google Scholar
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).Google Scholar
Chanyoung Oh, Gunju Park, Sumin Kim, Dohee Kim, and Youngmin Yi. 2020. Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper). In LCTES2020. 136--140.Google ScholarDigital Library
Qualcomm Technologies, Inc. 2019. Snapdragon 865 Mobile Hardware Development Kit. developer.qualcomm.com/hardware/snapdragon-865-hdk.Google Scholar
Siqi Wang, Anuj Pathania, and Tulika Mitra. 2020. Neural Network Inference on Mobile SoCs. IEEE Design & Test (2020).Google ScholarCross Ref
Lingqi Zhang, Mohamed Wahib, and Satoshi Matsuoka. 2019. Understanding the Overheads of Launching CUDA Kernels. In ICPP19.Google Scholar

Index Terms

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications
February 2021
192 pages
ISBN:9781450383233
DOI:10.1145/3446382

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Learning
Kernel Launch Overhead
Mobile GPU
OpenCL
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate96of345submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 540
  Total Downloads
- Downloads (Last 12 months)132
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media