skip to main content
10.1145/3649329.3655986acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Open access

Drift: Leveraging Distribution-based Dynamic Precision Quantization for Efficient Deep Neural Network Acceleration

Published: 07 November 2024 Publication History

Abstract

Quantization is one of the most hardware-efficient ways to reduce inference costs for deep neural network (DNN) models. Nevertheless, with the continuous increase of DNN model sizes (240× in two years) and the emergence of large language models, existing static quantization methods fail to utilize the sparsity and redundancy of models sufficiently. Motivated by the pervasive dynamism in data tensors across DNN models, we propose a dynamic precision quantization algorithm to further reduce computational costs beyond statically quantized DNN models. Furthermore, we find that existing precision-flexible accelerators cannot support the DNN models with dynamic precision. To this end, we design a novel accelerator, Drift, and achieve online scheduling to efficiently support dynamic precision execution. We conduct experiments with various DNN models, including CNN-based and Transformer-based models. Evaluation results show that Drift achieves 2.85× speedup and 3.12× energy saving compared to existing precision-flexible accelerators with statically quantized models.

References

[1]
Tom Brown et al. 2020. Language models are few-shot learners. NIPS (2020).
[2]
Tim Dettmers et al. 2022. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022).
[3]
Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
Jesse Dodge et al. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021).
[5]
Zhen Dong et al. 2019. Hawq: Hessian aware quantization of neural networks with mixed-precision. In ICCV.
[6]
Alexey Dosovitskiy et al. 2020. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. In ICLR.
[7]
Steven K Esser et al. 2019. Learned step size quantization. In ICLR.
[8]
Amir Gholami et al. 2021. Ai and memory wall. RiseLab Medium Post (2021).
[9]
Kaiming He et al. 2016. Deep residual learning for image recognition. In CVPR.
[10]
Itay Hubara et al. 2016. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. arXiv:1609.07061
[11]
Dongseok Im et al. 2023. Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation. In HPCA.
[12]
Patrick Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In MICRO.
[13]
Cangyuan Li, Ying Wang, Huawei Li, and Yinhe Han. 2023. APPEND: Rethinking ASIP Synthesis in the Era of AI. In DAC. IEEE, 1--6.
[14]
Shang Li et al. 2020. DRAMsim3: a cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters (2020).
[15]
Lian Liu et al. 2023. An Automatic Neural Network Architecture-and-Quantization Joint Optimization Framework for Efficient Model Inference. IEEE TCAD (2023).
[16]
Zhenhua Liu et al. 2021. Post-training quantization for vision transformer. NIPS (2021).
[17]
Erjing Luo et al. 2023. DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs. In ICCAD. IEEE, 1--9.
[18]
Daniel W Otter et al. 2020. A survey of the usages of deep learning for natural language processing. TNNLS (2020).
[19]
Alec Radford et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
[20]
Sungju Ryu et al. 2019. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In DAC.
[21]
Ananda Samajdar et al. 2020. A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In ISPASS.
[22]
Hardik Sharma et al. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In ISCA.
[23]
Sheng Shen et al. 2020. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI.
[24]
Zhuoran Song et al. 2020. Drq: dynamic region-based quantization for deep neural network acceleration. In ISCA.
[25]
Hugo Touvron et al. 2021. Training data-efficient image transformers & distillation through attention. In ICML.
[26]
BigScience Workshop et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
[27]
Tao Yang et al. 2022. DTQAtten: Leveraging dynamic token-based quantization for efficient attention architecture. In DATE.
[28]
Susan Zhang et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
[29]
Wei Zhang et al. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In EMNLP.
[30]
Yichi Zhang et al. 2019. Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations. In ICLR.
[31]
Xiandong Zhao et al. 2020. Linear symmetric quantization of neural networks for low-precision integer hardware. (2020).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference
June 2024
2159 pages
ISBN:9798400706011
DOI:10.1145/3649329
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • NSFC

Conference

DAC '24
Sponsor:
DAC '24: 61st ACM/IEEE Design Automation Conference
June 23 - 27, 2024
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 195
    Total Downloads
  • Downloads (Last 12 months)195
  • Downloads (Last 6 weeks)68
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media