ABSTRACT
Non-linear operations such as GELU, Layer normalization, and Soft-max are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a Look-up table(LUT). The proposed framework called Neural network generated LUT(NN-LUT) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.
- NVIDIA Deep Learning Accelerator. http://nvdla.org/primer.html.Google Scholar
- A. Cantoni. 1971. Optimal Curve Fitting With Piecewise Linear Functions. IEEE Trans. Comput. C-20, 1 (1971), 59--67.Google ScholarDigital Library
- J. Chen and X. Liu. 2017. A high-performance deeply pipelined architecture for elementary transcendental function evaluation. In ICCD.Google Scholar
- G. Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303--314.Google Scholar
- S. Eldridge, F. Raudies, D. Zou, and A. Joshi. 2014. Neural network-based accelerators for transcendental function approximation. In GLSVLSI.Google Scholar
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012. Neural acceleration for general-purpose approximate programs. In MICRO.Google Scholar
- J.-W. Jang et al. 2021. Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC. In ISCA.Google Scholar
- S. Kim et al. 2021. I-BERT: Integer-only BERT Quantizatio. In ICML.Google Scholar
- Z. Lu et al. 2017. The expressive power of neural networks. In NeurIPS.Google Scholar
- J. R. Stevens et al. 2021. Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers. In DAC.Google Scholar
- A. Vaswani et al. 2017. Attention is all you need. In NeurIPS.Google Scholar
- H. Wang, Z. Zhang, and S. Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In HPCA.Google Scholar
- W. Zhang et al. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In EMNLP.Google Scholar
Recommendations
An Efficient FIR Filter Structure Based on Technology-Optimized Multiply-Adder Unit Targeting LUT-Based FPGAs
Finite impulse response (FIR) filter is a fundamental element in digital signal processing (DSP) systems. Traditional implementations have been using application specific integrated circuits (ASICs) or DSP processors. However, the increase in logic ...
A novel defect classification system of cast-resin transformers by neural network under acoustic emission signal
IMCAS'07: Proceedings of the 6th WSEAS International Conference on Instrumentation, Measurement, Circuits and SystemsDegraded insulating property of electric equipments will lead to serious accident and great loss for the utilities and customers. Partial discharge detection is an efficient diagnosis method to prevent the failure of electric equipments arising from ...
A Non-linear Function Approximation from Small Samples Based on Nadaraya-Watson Kernel Regression
CICSYN '10: Proceedings of the 2010 2nd International Conference on Computational Intelligence, Communication Systems and NetworksSolving function approximation problem is to appropriately find the relationship between dependent variable and independent variable(s). Function approximation algorithms normally require sufficient amount of samples to approximate a function. However, ...
Comments