Conferences >2023 IEEE Asian Solid-State C...

CIMFormer: A 38.9TOPS/W-8b Systolic CIM-Array Based Transformer Processor with Token-Slimmed Attention Reformulating and Principal Possibility Gathering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Transformer models shows state-of-the-art results in natural language processing and computer vision, leveraging a multi-headed self-attention mechanism. In each head, th...Show More

Metadata

Abstract:

Transformer models shows state-of-the-art results in natural language processing and computer vision, leveraging a multi-headed self-attention mechanism. In each head, the operation is defined as

$\text{Attn}=\text{Softmax}(\mathrm{Q}\cdot \mathrm{K}^{\top})\cdot \mathrm{V}$ , where

$\mathrm{Q}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{Q}},\ \mathrm{K}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{K}}$ and

$\mathrm{V}=\mathrm{X}\cdot \mathrm{W}_{\mathrm{V}}$ are linear transformations for Query (Q), Key (K) and Value (V) with weight

$\mathrm{W}_{\mathrm{Q}}$ ,

$\mathrm{W}_{\mathrm{K}}$ and

$\mathrm{W}_{\mathrm{V}}$ , respectively.

$\mathrm{Q}\cdot \mathrm{K}^{\top}$ is responsible to learn relevance scores between tokens (X). Previous CIM chips faces new challenges within and across attention heads (Fig. 1): (1) Computing-in-memory (CIM) shows great advantages only if fixing pre-trained weights. However, since Q and K are both generated at runtime, loading K in CIM macros consumes more energy in the computing of

$\mathrm{Q}\cdot \mathrm{K}^{\top}$ . (2) A CIM macro performs multiply-accumulate (MAC) operations with bit-serial inputs, so the latency is determined by input precisions. The attention scores are normalized by Softmax to Probability (P). Most elements of P are close to or exactly zero. For the P.V of Bert-T, only 10% elements with large effective bit-width (EBW) exacerbate the computing latency of the rest 90% of inputs. (3) On-chip SRAM-CIM cannot hold weights of all heads. As completing the computation of the stored weights, all macros need to reload new weights, causing a significant performance loss. This work designs a CIM processor for Transformer, called CIMFormer, with three features to solve above challenges: (1) A token-slimmed

$\mathrm{Q}\cdot \mathrm{K}^{\top}$ reformulation is proposed to reduce the loading of intermediate data and redundant computations in CIM macros. Besides, a column-partitioned

$\mathrm{X}\vert \mathrm{W}- \text{CIM}$ with flexible set-aggre...

Published in: 2023 IEEE Asian Solid-State Circuits Conference (A-SSCC)

Date of Conference: 05-08 November 2023

Date Added to IEEE Xplore: 18 December 2023

ISBN Information:

DOI: 10.1109/A-SSCC58667.2023.10347930

Conference Location: Haikou, China

Funding Agency:

Contents

References is not available for this document.

CIMFormer: A 38.9TOPS/W-8b Systolic CIM-Array Based Transformer Processor with Token-Slimmed Attention Reformulating and Principal Possibility Gathering

Abstract:

Metadata

Abstract:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

CIMFormer: A 38.9TOPS/W-8b Systolic CIM-Array Based Transformer Processor with Token-Slimmed Attention Reformulating and Principal Possibility Gathering

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?