ABSTRACT
We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.
- A. Alaghi, W. Qian, and J. P. Hayes. 2017. The Promise and Challenge of Stochastic Computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. PP, 99 (2017), 1--1.Google Scholar
- Florent De Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design & Test of Computers, Vol. 28, 4 (2011), 18--27.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Ngadiuba, M. Pierini, R. Rivera, N. Tran, and Z. Wu. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation, Vol. 13, 07 (jul 2018), P07027--P07027. https://doi.org/10.1088/1748-0221/13/07/p07027Google ScholarCross Ref
- S. Rasoul. Faraji, Pierre Abillama, Gaurav Singh, and Kia Bazargan. 2020. HBUCNNA: Hybrid Binary-Unary Convolutional Neural Network Accelerator. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). https://doi.org/ISCAS.2020Google Scholar
- S Rasoul Faraji and Kia Bazargan. 2020a. Hybrid binary-unary hardware accelerator. IEEE Trans. Comput., Vol. 69, 9 (2020), 1308--1319.Google ScholarCross Ref
- S Rasoul Faraji and Kia Bazargan. 2020b. Hybrid binary-unary truncated multiplication for DSP Applications on FPGAs. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.Google Scholar
- S. Rasoul Faraji, Gaurav Singh, and Kia Bazargan. 2019. HBUNN - Hybrid Binary-Unary Neural Network: Realizing a Complete CNN on an FPGA. In IEEE International Conference on Computer Design (ICCD) (ICCD '19).Google ScholarCross Ref
- N. Eamon Gaffney and Armin Alaghi. 2016. scsynth. https://github.com/arminalaghi/scsynthGoogle Scholar
- B.R. Gaines. 1969. Stochastic Computing Systems. In Advances in Information Systems Science. Springer US, 37--172. http://dx.doi.org/10.1007/978--1--4899--5841--9_2Google ScholarCross Ref
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In International conference on machine learning. PMLR, 1243--1252.Google Scholar
- Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012).Google Scholar
- Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).Google Scholar
- Ruofei Hu, Binren Tian, Shouyi Yin, and Shaojun Wei. 2018. Efficient hardware architecture of softmax layer in deep neural network. In 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP). IEEE, 1--5.Google ScholarCross Ref
- Devon Jenson and Marc Riedel. 2016. A Deterministic Approach to Stochastic Computation. In Proceedings of the 35th International Conference on Computer-Aided Design (Austin, Texas) (ICCAD '16). New York, NY, USA, Article 102, 8 pages. https://doi.org/10.1145/2966986.2966988Google ScholarDigital Library
- Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2021. I-bert: Integer-only bert quantization. In International conference on machine learning. PMLR, 5506--5518.Google Scholar
- Peng Li, D.J. Lilja, W. Qian, M.D. Riedel, and K. Bazargan. 2014. Logical Computation on Stochastic Bit Streams with Linear Finite-State Machines. Computers, IEEE Transactions on, Vol. 63, 6 (June 2014), 1474--1486. https://doi.org/10.1109/TC.2012.231Google ScholarDigital Library
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- Soheil Mohajer, Zhiheng Wang, and Kia Bazargan. 2018. Routing Magic: Performing Computations Using Routing Networks and Voting Logic on Unary Encoded Data. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CALIFORNIA, USA) (FPGA '18). ACM, New York, NY, USA, 77--86.Google ScholarDigital Library
- Soheil Mohajer, Zhiheng Wang, Kia Bazargan, and Yuyang Li. 2020. Parallel unary computing based on function derivatives. ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 14, 1 (2020), 1--25.Google Scholar
- M. H. Najafi, S. R. Faraji, B. Li, D. J. Lilja, and K. Bazargan. 2019. Accelerating Deterministic Bit-Stream Computing with Resolution Splitting. In 20th International Symposium on Quality Electronic Design (ISQED). 157--162. https://doi.org/10.1109/ISQED.2019.8697443Google ScholarCross Ref
- M. Hassan Najafi, David J. Lilja, and Marc Riedel. 2018a. Deterministic Methods for Stochastic Computing Using Low-discrepancy Sequences. In Proceedings of the International Conference on Computer-Aided Design (San Diego, California) (ICCAD '18). ACM, New York, NY, USA, Article 51, 8 pages. https://doi.org/10.1145/3240765.3240797Google ScholarDigital Library
- M. Hassan Najafi, D. J. Lilja, M. D. Riedel, and K. Bazargan. 2018b. Low-Cost Sorting Network Circuits Using Unary Processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 26, 8 (Aug 2018), 1471--1480.Google ScholarCross Ref
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019).Google Scholar
- W.J. Poppelbaum, A. Dollas, J.B. Glickman, and C. O'Toole. 1987. Unary Processing. In Advances in Computers. Vol. 26. Elsevier, 47 -- 92.Google Scholar
- W. J. Poppelbaum, C. Afuso, and J. W. Esch. 1967. Stochastic Computing Elements and Systems. In Proceedings of the Joint Computer Conference (Anaheim, California) (AFIPS '67 (Fall)). ACM, New York, NY, USA, 635--644. https://doi.org/10.1145/1465611.1465696Google ScholarDigital Library
- Weikang Qian, Xin Li, Marc D. Riedel, Kia Bazargan, and David J. Lilja. 2011a. An Architecture for Fault-Tolerant Computation with Stochastic Logic. IEEE Trans. Comput., Vol. 60, 1 (2011), 93--105. https://doi.org/10.1109/TC.2010.202Google ScholarDigital Library
- W. Qian and M.D. Riedel. 2008. The Synthesis of Robust Polynomial Arithmetic with Stochastic Logic. In 45th ACM/IEEE Design Automation Conference, DAC'08. 648--653.Google Scholar
- Weikang Qian, Marc D. Riedel, and Ivo Rosenberg. 2011b. Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval. Eur. J. Comb., Vol. 32, 3 (April 2011), 448--463. https://doi.org/10.1016/j.ejc.2010.11.004Google ScholarDigital Library
- Sayed Ahmad Salehi, Yin Liu, Marc D. Riedel, and Keshab K. Parhi. 2017. Computing Polynomials with Positive Coefficients Using Stochastic Logic by Double-NAND Expansion. In Proceedings of the on Great Lakes Symposium on VLSI 2017 (Banff, Alberta, Canada) (GLSVLSI '17). ACM, New York, NY, USA, 471--474. https://doi.org/10.1145/3060403.3060410Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
- vloncar, Sioni Summers, Javier Duarte, Nhan Tran, Ben Kreis, jngadiub, Nicolò Ghielmetti, Duc Hoang, EJ Kreinar, Kelvin Lin, Maksymilian Graczyk, Adrian Alan Pol, ngpaladi, Dejan Golubovic, Yutaro Iiyama, Zhenbin Wu, Delon, Paolo Cretaro, veyron8800, Anders Wind, David, GDG, Jovan Mitrevski, Konstantin Vinogradov, Konstantin Vinogradov, Petr Zejdl, Sarun Nuntaviriyakul, Thea Aarrestad, and drankincms. 2021. fastmachinelearning/hls4ml: coris. https://doi.org/10.5281/zenodo.5680908Google ScholarCross Ref
- John Von Neumann. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata studies, Vol. 34, 34 (1956), 43--98.Google Scholar
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).Google Scholar
- Zhiheng Wang, Naman Saraf, Kia Bazargan, and Arnd Scheel. 2015. Randomness Meets Feedback: Stochastic Implementation of Logistic Map Dynamical System. In Proceedings of the 52Nd Annual Design Automation Conference (San Francisco, California) (DAC '15). ACM, New York, NY, USA, Article 132, 7 pages. https://doi.org/10.1145/2744769.2744898Google ScholarDigital Library
Index Terms
- Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing
Recommendations
Hybrid binary-unary hardware accelerator
ASPDAC '19: Proceedings of the 24th Asia and South Pacific Design Automation ConferenceStochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the low area advantage comes at an exponential price in latency, making the area x delay cost ...
Approximate Constant-Coefficient Multiplication Using Hybrid Binary-Unary Computing for FPGAs
Multipliers are used in virtually all Digital Signal Processing (DSP) applications such as image and video processing. Multiplier efficiency has a direct impact on the overall performance of such applications, especially when real-time processing is ...
Towards energy-efficient CGRAs via stochastic computing
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in EuropeStochastic computing (SC) is a promising computing paradigm for low-power and low-cost applications with the added benefit of high error tolerance. Meanwhile, Coarse-Grained Re-configurable Architecture (CGRA) is also a promising platform for domain-...
Comments