Abstract:
Vision Transformer (ViT) has recently demonstrated impressive nonlinear modeling capabilities and achieved state-of-the-art performance in various industrial applications...Show MoreMetadata
Abstract:
Vision Transformer (ViT) has recently demonstrated impressive nonlinear modeling capabilities and achieved state-of-the-art performance in various industrial applications, such as object recognition, anomaly detection, and robot control. However, their practical deployment can be hindered by high storage requirements and computational intensity. To alleviate these challenges, we propose a binary transformer called BinaryFormer, which quantizes the learned weights of the ViT module from 32-b precision to 1 b. Furthermore, we propose a hierarchical-adaptive architecture that replaces expensive matrix operations with more affordable addition and bit operations by switching between two attention modes. As a result, BinaryFormer is able to effectively compress the model size as well as reduce the computation cost of ViT. Experimental results on the ImageNet-1K benchmark datasets show that BinaryFormer reduces the size of a typical ViT model by an average of 27.7× and converts over 99% of multiplication operations into bit operations while maintaining reasonable accuracy.
Published in: IEEE Transactions on Industrial Informatics ( Volume: 20, Issue: 8, August 2024)