Journals & Magazines >IEEE Transactions on Computers >Volume: 73 Issue: 9

ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Transformer-based language models have recently gained popularity in numerous natural language processing (NLP) applications due to their superior performance compared to...Show More

Metadata

Abstract:

Transformer-based language models have recently gained popularity in numerous natural language processing (NLP) applications due to their superior performance compared to traditional algorithms. These models involve two execution stages: summarization and generation. The generation stage accounts for a significant portion of the total execution time due to its auto-regressive property, which necessitates considerable and repetitive off-chip accesses. Consequently, our objective is to minimize off-chip accesses during the generation stage to expedite transformer execution. To achieve the goal, we propose a token-adaptive early exit (ToEx) that generates output tokens using fewer decoders, thereby reducing off-chip accesses for loading weight parameters. Although our approach has the potential to minimize data communication, it brings two challenges: 1) inaccurate self-attention computation, and 2) significant overhead for exit decision. To overcome these challenges, we introduce a methodology that facilitates accurate self-attention by lazily performing computations for previously exited tokens. Moreover, we mitigate the overhead of exit decision by incorporating a lightweight output embedding layer. We also present a hardware design to efficiently support the proposed work. Evaluation results demonstrate that our work can reduce the number of decoders by 2.6

$\times$ on average. Accordingly, it achieves 3.2

$\times$ speedup on average compared to transformer execution without our work.

Published in: IEEE Transactions on Computers ( Volume: 73, Issue: 9, September 2024)

Page(s): 2248 - 2261

Date of Publication: 21 May 2024

ISSN Information:

DOI: 10.1109/TC.2024.3404051

Funding Agency:

Contents

References is not available for this document.

ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?