research-article

Open access

MERSIT: A Hardware-Efficient 8-bit Data Format with Enhanced Post-Training Quantization DNN Accuracy

Authors:

Nguyen-Dong Ho,

Cheol-Min Kang,

Ik-Joon ChangAuthors Info & Claims

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Article No.: 52, Pages 1 - 6

https://doi.org/10.1145/3649329.3655907

Published: 07 November 2024 Publication History

Abstract

Post-training quantization (PTQ) models utilizing conventional 8-bit Integer or floating-point formats still exhibit significant accuracy drops in modern deep neural networks (DNNs), rendering them unreliable. This paper presents MERSIT, a novel 8-bit PTQ data format designed for various DNNs. While leveraging the dynamic configuration of exponent and fraction bits derived from Posit data format, MERSIT demonstrates enhanced hardware efficiency through the proposed merged decoding scheme. Our evaluation indicates that MERSIT yields more reliable 8-bit PTQ models, exhibiting superior accuracy across various DNNs compared to conventional floating-point formats. Furthermore, the proposed processing unit saves 26.6% in area and 22.2% in power consumption compared to the Posit-based unit, while maintaining comparable efficiency to the floating-point-based unit.

References

[1]

Michael Andersch et al. 2022. NVIDIA Hopper Architecture In-Depth. Technical Report. NVIDIA Corporation.

[2]

Zachariah Carmichael et al. 2019. Deep Positron: A Deep Neural Network Using the Posit Number System. In DATE '19.

[3]

Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]

[4]

Amir Gholami et al. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs.CV]

[5]

Cong Guo et al. 2022. ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization. In MICRO '22.

[6]

Gustafson and Yonemoto. 2017. Beating Floating Point at Its Own Game: Posit Arithmetic. Supercomput. Front. Innov.: Int. J. (2017).

[7]

Ulrich Kulisch. 2012. Computer Arithmetic and Validity: Theory, Implementation, and Applications. De Gruyter, Berlin, Boston.

[8]

Hamed F. Langroudi et al. 2020. Adaptive Posit: Parameter aware numerical format for deep learning inference on the edge. In CVPR '20.

[9]

Ji Lin et al. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL]

[10]

Jiawei Liu et al. 2023. PD-Quant: Post-Training Quantization Based on Prediction Difference Metric. In CVPR '23.

[11]

Thierry Tambe et al. 2020. Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference. In DAC '20.

[12]

Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML '19.

[13]

Mart van Baalen et al. 2023. FP8 versus INT8 for efficient deep learning inference. arXiv:2303.17951 [cs.LG]

[14]

Xiuying Wei et al. 2022. QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization. In ICLR '22.

[15]

Thomas Yeh et al. 2022. Be Like Water: Adaptive Floating Point for Machine Learning. In ICML '22.

Index Terms

MERSIT: A Hardware-Efficient 8-bit Data Format with Enhanced Post-Training Quantization DNN Accuracy
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Hardware

Index terms have been assigned to the content through auto-classification.

Recommendations

Novel adaptive quantization methodology for 8-bit floating-point DNN training
Abstract
There is a high energy cost associated with training Deep Neural Networks (DNNs). Off-chip memory access contributes a major portion to the overall energy consumption. Reduction in the number of off-chip memory transactions can be achieved by ...
Novel 8-bit reversible full adder/subtractor using a QCA reversible gate

Conventional digital circuits consume a considerable amount of energy. If bits of information remain during logical operations, power consumption decreases considerably because the data bits in reversible computations are not lost. The types of ...
Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA
2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI)
This paper introduces a compact number format (Deep Nibble) and a resource- and performance-efficient dot product core (Deep Nibble Unit – DNU) designed to address the performance and memory bottlenecks of deep learning on resource-constrained ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

June 2024

2159 pages

ISBN:9798400706011

DOI:10.1145/3649329

Chair:
Vivek De

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea (NRF)
Institute of Information and Communications Technology Planning and Evaluation (IITP)

Conference

DAC '24

Sponsor:

SIGDA

DAC '24: 61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,654 of 5,209 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
204
Total Downloads

Downloads (Last 12 months)204
Downloads (Last 6 weeks)102

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents