skip to main content
10.1145/3649329.3656497acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity

Published: 07 November 2024 Publication History

Abstract

Transformer models have become popular in various AI applications due to their exceptional performance. However, their impressive performance comes with significant computing and memory costs, hindering efficient deployment of Transformer-based applications. Many solutions focus on leveraging sparsity in weight matrix and attention computation. However, previous studies fail to exploit unified sparse pattern to accelerate all three modules of Transformer (QKV generation, attention computation and FFN). In this paper, we propose FNM-Trans, an adaptable and efficient algorithm-hardware co-design aimed at optimizing all three modules of the Transformer by fully harnessing N : M sparsity. At the algorithm level, we fully explore the interplay of dynamic pruning with static pruning under high N : M sparsity. At the hardware level, we develop a dedicated hardware architecture featuring a custom computing engine and a softmax module, tailored to support varying levels of N : M sparsity. Experiment results show that, our algorithm optimizes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight sparsity, compared to other methods. Additionally, FNM-Trans achieves speedups of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.

References

[1]
Yueyin Bai et al. 2023. LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks. In FPL. 283--287.
[2]
Jialin Cao et al. 2023. PP-Transformer: Enable Efficient Deployment of Transformers Through Pattern Pruning. In ICCAD. 1--9.
[3]
Zhaodong Chen et al. 2023. Dynamic N: M fine-grained structured sparse attention mechanism. In PPoPP. 369--379.
[4]
Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:2020.1810.04805 (2018).
[5]
Alexey Dosovitskiy et al. 2020. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:11929 (2020).
[6]
Chao Fang et al. 2022. An Efficient Hardware Accelerator for Sparse Transformer Neural Networks. In ISCAS. IEEE, 2670--2674.
[7]
Chao Fang et al. 2022. An algorithm-hardware co-optimized framework for accelerating n: M sparse transformers. VLSI 30, 11 (2022), 1573--1586.
[8]
Tae Jun Ham et al. 2020. Aˆ 3: Accelerating attention mechanisms in neural networks with approximation. In HPCA. IEEE, 328--341.
[9]
Bingbing Li et al. 2020. Ftrans: energy-efficient acceleration of transformers using fpga. In ISLPED. 175--180.
[10]
Shiwei Liu et al. 2023. 16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine. In ISSCC. 250--252.
[11]
Liqiang Lu et al. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In MICRO. 977--991.
[12]
Siyuan Lu et al. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. In SOCC. IEEE, 84--89.
[13]
Adam Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning library. NeurIPS 32 (2019).
[14]
Hongwu Peng et al. 2021. Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In ISQED. IEEE, 142--148.
[15]
Panjie Qi et al. 2021. Accommodating transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization. In GLSVLSI.
[16]
Yubin Qin et al. 2023. FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction. In ISCA. 1--14.
[17]
Guan Shen et al. 2022. SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences. In DAC. 571--576.
[18]
Yuhong Song et al. 2021. Dancing along battery: Enabling transformer with run-time reconfigurability on mobile devices. In DAC. IEEE, 1003--1008.
[19]
Ashish Vaswani et al. 2017. Attention is all you need. NeurIPS 30 (2017).
[20]
Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[21]
Thomas Wolf et al. 2020. Transformers: State-of-the-art natural language processing. In EMNLP. 38--45.
[22]
Aojun Zhou et al. 2021. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010 (2021).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference
June 2024
2159 pages
ISBN:9798400706011
DOI:10.1145/3649329
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Author Tags

  1. algorithm-hardware codesign
  2. transformer
  3. FPGA

Qualifiers

  • Research-article

Funding Sources

Conference

DAC '24
Sponsor:
DAC '24: 61st ACM/IEEE Design Automation Conference
June 23 - 27, 2024
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 296
    Total Downloads
  • Downloads (Last 12 months)296
  • Downloads (Last 6 weeks)73
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media