research-article

Open access

PiQi: Partially Quantized DNN Inference on HMPSoCs

Authors:

Ehsan Aghapour,

Anuj PathaniaAuthors Info & Claims

ISLPED '24: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design

Pages 1 - 6

https://doi.org/10.1145/3665314.3670841

Published: 09 September 2024 Publication History

Abstract

Deep Neural Network (DNN) inference is now ubiquitous in embedded applications at the edge. State-of-the-art Heterogeneous Multi-Processors System-on-Chip (HMPSoCs) powering these applications come equipped with powerful Neural Processing Units (NPUs) that significantly outperform other inference-capable HMPSoC components - namely, the CPUs and GPUs - in terms of power consumption and performance. However, CPUs and GPUs can perform full precision inference, whereas NPUs can often only perform a quantized inference. Consequently, low-latency, low-power inference by the NPU comes at an accuracy loss due to the quantization.

DNNs consist of several heterogeneous layers. Here, we introduce the PiQi framework that allows DNN inference to layer-wise switch between the three inference-capable HMPSoC components, CPU, GPU, and NPU, mid-inference with minimal overhead. Consequently, PiQi employs the novel idea of partially quantized DNN inference on HMPSoCs. However, different DNN layers experience different power-performance gains while projecting different accuracy losses on quantization. Therefore, we provide within PiQi a multi-objective Genetic Algorithm (GA) that provides a power-performance Pareto-front under an accuracy constraint by selective multi-layer quantization during inference. Additionally, PiQi utilizes a neural network to expedite search time by predicting accuracy when assigning DNN layers to the appropriate cores.

References

[1]

Ehsan Aghapour et al. 2022. CPU-GPU Layer-Switched Low Latency CNN Inference. In DSD.

[2]

Ehsan Aghapour et al. 2023. PELSI: Power-Efficient Layer-Switched Inference. In RTCSA.

[3]

Ehsan Aghapour, Dolly Sapra, Andy Pimentel, and Anuj Pathania. 2024. ARM-CO-UP: ARM COoperative Utilization of Processors. ACM Trans. Des. Autom. Electron. Syst. (apr 2024). Just Accepted.

Digital Library

[4]

Hyunho Ahn et al. 2023. Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version.

[5]

Claudionor N. Coelho et al. 2021. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence (2021).

[6]

Xiaotian Guo et al. 2023. Automated Exploration and Implementation of Distributed CNN Inference at the Edge. IoT Journal (2023).

[7]

Ramyad Hadidi et al. 2019. Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices. In IISWC.

[8]

Shubham Jain et al. 2018. Compensated-DNN: Energy Efficient Low-Precision Deep Neural Networks by Compensating Quantization Errors. In Proceedings of the 55th Annual Design Automation Conference (DAC '18).

[9]

Deb Kalyanmoy. 2002. A fast and elitist multi-objective genetic algorithm: NSGA-II. TEVC (2002).

[10]

Andreas Karatzas et al. 2023. OmniBoost: Boosting Throughput of Heterogeneous Embedded Devices under Multi-DNN Workload. In DAC.

[11]

Youngsok Kim et al. 2019. Layer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. in EuroSys.

[12]

Kyuho J. Lee. 2021. Architecture of neural processing unit for deep neural networks. In Hardware Accelerator Systems for Artificial Intelligence and Machine Learning. Elsevier.

[13]

Yuhang Li et al. 2021. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. arXiv:2102.05426

[14]

Markus Nagel et al. 2019. Data-Free Quantization Through Weight Equalization and Bias Correction. In ICCV.

[15]

Markus Nagel et al. 2021. A White Paper on Neural Network Quantization. arXiv:2106.08295

[16]

Jie Tang et al. 2017. Enabling Deep Learning on IoT Devices. Computer (2017).

[17]

Hokchhay Tann et al. 2017. Hardware-Software Codesign of Accurate, Multiplier-Free Deep Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017.

[18]

Satoki Tsuji et al. 2022. Greedy search algorithm for partial quantization of convolutional neural networks inspired by submodular optimization. Neural Computing and Applications (2022).

[19]

Peisong Wang et al. 2023. Optimization-Based Post-Training Quantization With Bit-Split and Stitching. TPAMI (2023).

[20]

Siqi Wang et al. 2020. High-Throughput CNN Inference on Embedded ARM Big.LITTLE Multicore Processors. TCAD (2020).

[21]

Siqi Wang et al. 2020. Neural Network Inference on Mobile SoCs. IEEE Design Test (2020).

Cited By

Pimentel A(2024)Education Abstract: Design Space Exploration for Deep Learning at the Edge2024 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)10.1109/CASES60062.2024.00006(1-2)Online publication date: 29-Sep-2024
https://doi.org/10.1109/CASES60062.2024.00006

Index Terms

PiQi: Partially Quantized DNN Inference on HMPSoCs

Recommendations

ARM-CO-UP: ARM COoperative Utilization of Processors
HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as ...
Quantized deep neural networks for energy efficient hardware-based inference
ASPDAC '18: Proceedings of the 23rd Asia and South Pacific Design Automation Conference

Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification accuracy, with custom hardware implementations great candidates for high-speed, accurate inference. While progress in achieving large scale, highly ...
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
GPUs have become the <italic>defacto</italic> hardware devices for accelerating Deep Neural Network (DNN) inference workloads. However, the conventional <italic>sequential execution mode of DNN operators</italic> in mainstream deep learning frameworks ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISLPED '24: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design

August 2024

384 pages

ISBN:9798400706882

DOI:10.1145/3665314

Chair:
Pascal Meinerzhagen,
Program Chair:
Kapil Dev,
Program Co-chair:
Jerald Yoo

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CAS
IEEE EDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISLPED '24

Sponsor:

SIGDA

ISLPED '24: 29th ACM/IEEE International Symposium on Low Power Electronics and Design

August 5 - 7, 2024

CA, Newport Beach, USA

Acceptance Rates

Overall Acceptance Rate 398 of 1,159 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
156
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)32

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pimentel A(2024)Education Abstract: Design Space Exploration for Deep Learning at the Edge2024 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)10.1109/CASES60062.2024.00006(1-2)Online publication date: 29-Sep-2024
https://doi.org/10.1109/CASES60062.2024.00006

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten