skip to main content
10.1145/3649329.3655982acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

TSAcc: An Efficient \underline{T}empo-\underline{S}patial Similarity Aware \underline{Acc}elerator for Attention Acceleration

Published: 07 November 2024 Publication History

Abstract

Attention-based models provide significant accuracy improvement to Natural Language Processing (NLP) and computer vision (CV) fields at the cost of heavy computational and memory demands. Previous works seek to alleviate the performance bottleneck by removing useless relations for each position. However, their attempts only focus on intra-sentence optimization and overlook the opportunity in the temporal domain. In this paper, we accelerate attention by leveraging the tempo-spatial similarity across successive sentences, given the observation that successive sentences tend to bear high similarity. This is rational owing to many semantic similar words (namely tokens) in the attention-based models. We first propose an online-offline prediction algorithm to identify similar tokens/heads. We then design a recovery algorithm so that we can skip the computation on similar tokens/heads in succeeding sentences and recover their results by copying other tokens/heads features in preceding sentences to reserve accuracy. From the hardware aspect, we propose a specialized architecture TSAcc that includes a prediction engine and recovery engine to translate the computational saving in the algorithm to real speedup. Experiments show that TSAcc can achieve 8.5X, 2.7X, 14.1X, and 64.9X speedup compared to SpAtten, Sanger, 1080TI GPU, and Xeon CPU, with negligible accuracy loss.

References

[1]
Jochen Alber et al. 2004. Polynomial-time data reduction for dominating set. Journal of the ACM (JACM) 51, 3 (2004), 363--384.
[2]
Rajeev Balasubramonian et al. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. TACO 14 (2017), 1--25.
[3]
Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
Tae Jun Ham et al. 2020. A^ 3: Accelerating attention mechanisms in neural networks with approximation. In HPCA. IEEE, 328--341.
[5]
Norman P Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA. 1--12.
[6]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[7]
Liqiang Lu et al. 2021. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In MICRO.
[8]
Adam Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems (2019).
[9]
Zheng Qu et al. 2022. DOTA: detect and omit weak attentions for scalable transformer acceleration. In ASPLOS. 14--26.
[10]
Alec Radford et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[11]
Aurko Roy et al. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53--68.
[12]
Thierry Tambe et al. 2021. Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference. In MICRO-54. 830--844.
[13]
Yi Tay et al. 2020. Sparse sinkhorn attention. In ICML.
[14]
Ashish Vaswani et al. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[15]
Elena Voita et al. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv (2019).
[16]
Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv (2018).
[17]
Hanrui Wang et al. [n. d.]. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In HPCA. IEEE.
[18]
Ali Hadi Zadeh et al. [n. d.]. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In MICRO. IEEE.

Index Terms

  1. TSAcc: An Efficient \underline{T}empo-\underline{S}patial Similarity Aware \underline{Acc}elerator for Attention Acceleration
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference
            June 2024
            2159 pages
            ISBN:9798400706011
            DOI:10.1145/3649329
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Sponsors

            In-Cooperation

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 07 November 2024

            Check for updates

            Qualifiers

            • Research-article

            Funding Sources

            • National Natural Science Foundation of China

            Conference

            DAC '24
            Sponsor:
            DAC '24: 61st ACM/IEEE Design Automation Conference
            June 23 - 27, 2024
            CA, San Francisco, USA

            Acceptance Rates

            Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

            Upcoming Conference

            DAC '25
            62nd ACM/IEEE Design Automation Conference
            June 22 - 26, 2025
            San Francisco , CA , USA

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 128
              Total Downloads
            • Downloads (Last 12 months)128
            • Downloads (Last 6 weeks)36
            Reflects downloads up to 17 Feb 2025

            Other Metrics

            Citations

            View Options

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media