research-article

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity

Authors:

Kun WangAuthors Info & Claims

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Article No.: 191, Pages 1 - 6

https://doi.org/10.1145/3649329.3656497

Published: 07 November 2024 Publication History

Abstract

Transformer models have become popular in various AI applications due to their exceptional performance. However, their impressive performance comes with significant computing and memory costs, hindering efficient deployment of Transformer-based applications. Many solutions focus on leveraging sparsity in weight matrix and attention computation. However, previous studies fail to exploit unified sparse pattern to accelerate all three modules of Transformer (QKV generation, attention computation and FFN). In this paper, we propose FNM-Trans, an adaptable and efficient algorithm-hardware co-design aimed at optimizing all three modules of the Transformer by fully harnessing N : M sparsity. At the algorithm level, we fully explore the interplay of dynamic pruning with static pruning under high N : M sparsity. At the hardware level, we develop a dedicated hardware architecture featuring a custom computing engine and a softmax module, tailored to support varying levels of N : M sparsity. Experiment results show that, our algorithm optimizes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight sparsity, compared to other methods. Additionally, FNM-Trans achieves speedups of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.

References

[1]

Yueyin Bai et al. 2023. LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks. In FPL. 283--287.

[2]

Jialin Cao et al. 2023. PP-Transformer: Enable Efficient Deployment of Transformers Through Pattern Pruning. In ICCAD. 1--9.

[3]

Zhaodong Chen et al. 2023. Dynamic N: M fine-grained structured sparse attention mechanism. In PPoPP. 369--379.

[4]

Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:2020.1810.04805 (2018).

[5]

Alexey Dosovitskiy et al. 2020. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:11929 (2020).

[6]

Chao Fang et al. 2022. An Efficient Hardware Accelerator for Sparse Transformer Neural Networks. In ISCAS. IEEE, 2670--2674.

[7]

Chao Fang et al. 2022. An algorithm-hardware co-optimized framework for accelerating n: M sparse transformers. VLSI 30, 11 (2022), 1573--1586.

[8]

Tae Jun Ham et al. 2020. Aˆ 3: Accelerating attention mechanisms in neural networks with approximation. In HPCA. IEEE, 328--341.

[9]

Bingbing Li et al. 2020. Ftrans: energy-efficient acceleration of transformers using fpga. In ISLPED. 175--180.

[10]

Shiwei Liu et al. 2023. 16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine. In ISSCC. 250--252.

[11]

Liqiang Lu et al. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In MICRO. 977--991.

[12]

Siyuan Lu et al. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. In SOCC. IEEE, 84--89.

[13]

Adam Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning library. NeurIPS 32 (2019).

[14]

Hongwu Peng et al. 2021. Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In ISQED. IEEE, 142--148.

[15]

Panjie Qi et al. 2021. Accommodating transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization. In GLSVLSI.

[16]

Yubin Qin et al. 2023. FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction. In ISCA. 1--14.

[17]

Guan Shen et al. 2022. SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences. In DAC. 571--576.

[18]

Yuhong Song et al. 2021. Dancing along battery: Enabling transformer with run-time reconfigurability on mobile devices. In DAC. IEEE, 1003--1008.

[19]

Ashish Vaswani et al. 2017. Attention is all you need. NeurIPS 30 (2017).

[20]

Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).

[21]

Thomas Wolf et al. 2020. Transformers: State-of-the-art natural language processing. In EMNLP. 38--45.

[22]

Aojun Zhou et al. 2021. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010 (2021).

Index Terms

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Reconfigurable computing
  2. Embedded and cyber-physical systems
    1. System on a chip
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications
  2. Very large scale integration design
    1. Application-specific VLSI designs

Index terms have been assigned to the content through auto-classification.

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays

We present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...
A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

June 2024

2159 pages

ISBN:9798400706011

DOI:10.1145/3649329

Chair:
Vivek De

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Shanghai Pujiang Program
Fudan Undergraduate Research Opportunities Program

Conference

DAC '24

Sponsor:

SIGDA

DAC '24: 61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
296
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)73

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten