Loading web-font TeX/Main/Regular
Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow | IEEE Journals & Magazine | IEEE Xplore

Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow


Abstract:

Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial c...Show More

Abstract:

Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20– 258.9{\times } higher than the state-of-th...
Published in: IEEE Journal of Solid-State Circuits ( Volume: 59, Issue: 10, October 2024)
Page(s): 3342 - 3356
Date of Publication: 14 May 2024

ISSN Information:

Funding Agency:


References

References is not available for this document.