ABSTRACT
Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.
- H. Awano and M. Hashimoto. 2020. BYNQNet: Bayesian Neural Network with Quadratic Activations for Sampling-Free Uncertainty Estimation on FPGA. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE). 1402--1407.Google Scholar
- Ruizhe Cai et al. 2018. VIBNN: Hardware acceleration of Bayesian neural networks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 53. 476--488.Google Scholar
- Kevin Clark et al. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).Google Scholar
- Jacob Devlin et al. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR (2018).Google Scholar
- Hongxiang Fan et al. 2019. Static block floating-point quantization for convolutional neural networks on fpga. In International Conference on Field-Programmable Technology (ICFPT). IEEE, 28--35.Google Scholar
- Hongxiang Fan et al. 2021. High-Performance FPGA-based Accelerator for Bayesian Neural Networks. In Proceedings of the 2021 ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
- Jeremy Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--14.Google Scholar
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML). 1050--1059.Google Scholar
- Tae Jun Ham et al. 2020. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 328--341.Google Scholar
- Ranganath Krishnan et al. 2020. Specifying weight priors in bayesian deep neural networks with empirical bayes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4477--4484.Google Scholar
- Christian Leibig et al. 2017. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7, 1 (2017), 1--14.Google Scholar
- Bingbing Li et al. 2020. FTRANS: energy-efficient acceleration of transformers using FPGA. In ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED). 175--180.Google Scholar
- Yinhan Liu et al. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- Siyuan Lu et al. 2020. Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer. arXiv preprint arXiv:2009.08605 (2020).Google Scholar
- Rowan McAllister et al. 2017. Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI),. 4745--4753.Google Scholar
- Adam Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Proceedings of the 2019 Advances in neural information processing systems (NeurIPS) 32 (2019), 8026--8037.Google Scholar
- Aditya Prakash, Kashyap Chitta, and Andreas Geiger. 2021. Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7077--7087.Google ScholarCross Ref
- Alec Radford et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
- Colin Raffel et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).Google Scholar
- Johanna Rock, Tiago Azevedo, René de Jong, Daniel Ruiz-Muñoz, and Partha Maji. 2021. On Efficient Uncertainty Estimation for Resource-Constrained Mobile Applications. arXiv preprint arXiv:2111.09838 (2021).Google Scholar
- Victor Sanh et al. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR (2019).Google Scholar
- Artem Shelmanov et al. 2021. How Certain is Your Transformer?. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL). 1833--1840.Google Scholar
- Zhourui Song et al. 2018. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google Scholar
- Nitish Srivastava et al. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google Scholar
- Evgenii Tsymbalov et al. 2020. Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling. arXiv preprint arXiv:2003.03274 (2020).Google Scholar
- Ashish Vaswani et al. 2017. Attention is all you need. In Advances in neural information processing systems (NeurIPS). 5998--6008.Google Scholar
- Qiyu Wan and Xin Fu. 2020. Fast-BCNN: Massive Neuron Skipping in Bayesian Convolutional Neural Networks. In Proceedings of the 2020 Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 229--240.Google ScholarCross Ref
- Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).Google Scholar
- Hanrui Wang et al. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. IEEE International Symposium on High Performance Computer Architecture (HPCA) (2021).Google Scholar
- Thomas Wolf et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).Google Scholar
- Bichen Wu et al. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020).Google Scholar
- Boyang Xue et al. 2021. Bayesian transformer language models for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7378--7382.Google Scholar
- Zhilu Zhang, Adrian V Dalca, and Mert R Sabuncu. 2019. Confidence calibration for convolutional neural networks using structured dropout. arXiv preprint arXiv:1906.09551 (2019).Google Scholar
- Wangchunshu Zhou et al. 2020. Scheduled drophead: A regularization method for transformer models. arXiv preprint arXiv:2004.13342 (2020).Google Scholar
Recommendations
Fast image deconvolution using closed-form thresholding formulas of Lq(q=12,23) regularization
In this paper, we focus on the research of fast deconvolution algorithm based on the non-convex L"q(q=12,23) sparse regularization. Recently, we have deduced the closed-form thresholding formula for L"1"2 regularization model (Xu (2010) [1]). In this ...
Image compressed sensing recovery via nonconvex garrote regularization
AbstractSparsity inducing model is one of the most important components of image compressed sensing (CS) recovery methods. These models are built on the image prior knowledge. The model which can reflect the image priors appropriately, yields high quality ...
HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThis poster presents a novel cross-layer-pipelined Convolutional Neural Network accelerator architecture, and network compiler, that make use of precision minimization and parameter pruning to fit ResNet-50 entirely into on-chip memory on a Stratix 10 ...
Comments