research-article

Open access

Drift: Leveraging Distribution-based Dynamic Precision Quantization for Efficient Deep Neural Network Acceleration

Authors:

Yinhe HanAuthors Info & Claims

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Article No.: 140, Pages 1 - 6

https://doi.org/10.1145/3649329.3655986

Published: 07 November 2024 Publication History

Abstract

Quantization is one of the most hardware-efficient ways to reduce inference costs for deep neural network (DNN) models. Nevertheless, with the continuous increase of DNN model sizes (240× in two years) and the emergence of large language models, existing static quantization methods fail to utilize the sparsity and redundancy of models sufficiently. Motivated by the pervasive dynamism in data tensors across DNN models, we propose a dynamic precision quantization algorithm to further reduce computational costs beyond statically quantized DNN models. Furthermore, we find that existing precision-flexible accelerators cannot support the DNN models with dynamic precision. To this end, we design a novel accelerator, Drift, and achieve online scheduling to efficiently support dynamic precision execution. We conduct experiments with various DNN models, including CNN-based and Transformer-based models. Evaluation results show that Drift achieves 2.85× speedup and 3.12× energy saving compared to existing precision-flexible accelerators with statically quantized models.

References

[1]

Tom Brown et al. 2020. Language models are few-shot learners. NIPS (2020).

[2]

Tim Dettmers et al. 2022. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022).

[3]

Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[4]

Jesse Dodge et al. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021).

[5]

Zhen Dong et al. 2019. Hawq: Hessian aware quantization of neural networks with mixed-precision. In ICCV.

[6]

Alexey Dosovitskiy et al. 2020. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. In ICLR.

[7]

Steven K Esser et al. 2019. Learned step size quantization. In ICLR.

[8]

Amir Gholami et al. 2021. Ai and memory wall. RiseLab Medium Post (2021).

[9]

Kaiming He et al. 2016. Deep residual learning for image recognition. In CVPR.

[10]

Itay Hubara et al. 2016. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. arXiv:1609.07061

[11]

Dongseok Im et al. 2023. Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation. In HPCA.

[12]

Patrick Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In MICRO.

[13]

Cangyuan Li, Ying Wang, Huawei Li, and Yinhe Han. 2023. APPEND: Rethinking ASIP Synthesis in the Era of AI. In DAC. IEEE, 1--6.

[14]

Shang Li et al. 2020. DRAMsim3: a cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters (2020).

[15]

Lian Liu et al. 2023. An Automatic Neural Network Architecture-and-Quantization Joint Optimization Framework for Efficient Model Inference. IEEE TCAD (2023).

[16]

Zhenhua Liu et al. 2021. Post-training quantization for vision transformer. NIPS (2021).

[17]

Erjing Luo et al. 2023. DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs. In ICCAD. IEEE, 1--9.

[18]

Daniel W Otter et al. 2020. A survey of the usages of deep learning for natural language processing. TNNLS (2020).

[19]

Alec Radford et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).

[20]

Sungju Ryu et al. 2019. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In DAC.

[21]

Ananda Samajdar et al. 2020. A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In ISPASS.

[22]

Hardik Sharma et al. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In ISCA.

[23]

Sheng Shen et al. 2020. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI.

[24]

Zhuoran Song et al. 2020. Drq: dynamic region-based quantization for deep neural network acceleration. In ISCA.

[25]

Hugo Touvron et al. 2021. Training data-efficient image transformers & distillation through attention. In ICML.

[26]

BigScience Workshop et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).

[27]

Tao Yang et al. 2022. DTQAtten: Leveraging dynamic token-based quantization for efficient attention architecture. In DATE.

[28]

Susan Zhang et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).

[29]

Wei Zhang et al. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In EMNLP.

[30]

Yichi Zhang et al. 2019. Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations. In ICLR.

[31]

Xiandong Zhao et al. 2020. Linear symmetric quantization of neural networks for low-precision integer hardware. (2020).

Index Terms

Drift: Leveraging Distribution-based Dynamic Precision Quantization for Efficient Deep Neural Network Acceleration

Index terms have been assigned to the content through auto-classification.

Recommendations

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially ...
Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks
2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the ...
Structured Dynamic Precision for Deep Neural Networks Quantization
Deep Neural Networks (DNNs) have achieved remarkable success in various Artificial Intelligence applications. Quantization is a critical step in DNNs compression and acceleration for deployment. To further boost DNN execution efficiency, many works ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

June 2024

2159 pages

ISBN:9798400706011

DOI:10.1145/3649329

Chair:
Vivek De

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Qualifiers

Research-article

Funding Sources

NSFC

Conference

DAC '24

Sponsor:

SIGDA

DAC '24: 61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
195
Total Downloads

Downloads (Last 12 months)195
Downloads (Last 6 weeks)68

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten