Abstract
This paper proposes a new hardware approach for accelerating matrix vector multiplication (MVM) employing systolic array architecture and parallel data processing units, which is particularly useful in multiplication intensive applications such as neural networks. The hardware complexity of the parallel computations is reduced by a technique named as split-matrix approach, in which the larger matrices are split into smaller matrices. In the proposed architecture, 8-bit fixed-point representation is considered and matrices are treated to be circulant in nature. The resulting MVM architecture benefits with reduced implementation complexity in terms of cell area, reduced delay, and power consumption. It is found to result in a 13.9% reduction in logic cell area and a 38.15% reduction in total power consumption when compared to those of the latest baseline design. Also, the proposed architecture is able to achieve a considerably improved minimum permissible clock period of 0.410ns. The development of a long short-term memory (LSTM) architecture using the proposed design also serves to prove the effectiveness of the proposed MVM architecture. The LSTM developed using the proposed MVM provides a 37.57% reduction in the cell area and a 22.86% reduction in the total power in comparison with the latest baseline design and is able to achieve a minimum clock period of 0.42 ns.











Similar content being viewed by others
Data Availability
This article has no associated data.
References
A. Alzahrani, N. Alalwan, M. Sarrab, Mobile cloud computing: advantage, disadvantage and open challenge, in Proceedings of the 7th Euro American Conference on Telematics and Information Systems (2014), pp. 1–4
A. Arora, M. Ghosh, S. Mehta, V. Betz, L.K. John, Tensor slices: FPGA building blocks for the deep learning era. ACM Trans. Reconfigurable Technol. Syst. 15(4), 1–34 (2022)
E. Bank-Tavakoli, S.A. Ghasemzadeh, M. Kamal, A. Afzali-Kusha, M. Pedram, POLAR: a pipelined/overlapped FPGA-based LSTM accelerator. IEEE Trans. Very Large Scale Integr. Syst. 28(3), 838–842 (2020)
D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi, H. Ghayvat, CNN variants for computer vision: history, architecture, application, challenges, and future scope. Electronics 10(20), 1–28 (2021)
M. Bishop, Neural networks and their applications. Rev. Sci. Instrum. 65(6), 1803–1832 (1994)
F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, L. Benini, Chipmunk: a systolically scalable 0.9 mm 2, 3.08 Gop/s/mW@ 1.2 mW accelerator for near-sensor recurrent neural network inference, in 2018 IEEE Custom Integrated Circuits Conference (CICC) (2018), pp. 1–4
Design Ware Building Block IP User Guide, Synposys, Inc., Mountain View, CA, USA, 06-SP2 (2012)
C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, B. Yuan, CIRCNN: accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the Annual International Symposium on Micro-architecture (2017), pp. 395–408
C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, Y. Wang, Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the ACM Great Lakes Symposium on VLSI (2018), pp. 353–358
C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, Y. Liang, REQ-YOLO: a resource-aware, efficient quantization framework for object detection on FPGAS, in FPGA 2019—Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2019), pp. 33-42
S. Dong, P. Zhao, X. Lin, D. Kaeli, Exploring GPU acceleration of deep neural networks using block circulant matrices. Parallel Comput. 100, 102701 (2020)
A. Garofalo, G. Ottavi, F. Conti, G. Karunaratne, I. Boybat, L. Benini, D. Rossi, A heterogeneous in-memory computing cluster for flexible end-to-end inference of real-world deep neural networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 12(2), 422–435 (2022)
A. Goel, S. Aghajanzadeh, C. Tung, S.H. Chen, G.K. Thiruvathukal, Y.H. Lu, Modular neural networks for low-power image classification on embedded devices. ACM Trans. Des. Autom. Electron. Syst. 26(1), 1–35 (2021)
K. Greff, R.K. Srivastava, J. Koutnik, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in Proceedings on 43rd International Symposium on Computer Architecture (ISCA) (2016), vol. 16 (2016), pp. 243–254
Y. He, J. Yue, Y. Liu, H. Yang, Block-circulant neural network accelerator featuring fine-grained frequency-domain quantization and reconfigurable FFT modules, in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE (2021), pp. 813–818
D. He, J. He, J. Liu, J. Yang, Q. Yan, Y. Yang, An FPGA-based LSTM acceleration engine for deep learning frameworks. Electronics 10(6), 1–15 (2021)
M.T. Khan, R.A. Shaik, Optimal complexity architectures for pipelined distributed arithmetic-based LMS adaptive filter. IEEE Trans. Circuits Syst I Regul. Pap. 66(2), 630–642 (2019)
S. Konwer, M. Sojan, P.A. Kenz, S.K. Santhosh, T. Joseph, T. Bindiya, Hardware realization of sigmoid and hyperbolic tangent activation functions, in 2022 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT). IEEE (2022), pp. 84–89
V.S. Lalapura, J. Amudha, H.S. Satheesh, Recurrent neural networks for edge intelligence: a survey. ACM Comput. Surv. 54(4), 1–38 (2021)
Z. Li, S. Wang, C. Ding, Q. Qiu, Y. Wang, Y. Liang, Efficient recurrent neural networks using structured matrices in FPGAS, in 6th International Conference on Learning Representations (ICLR 2018)—Workshop Track Proceedings (2018)
J. Li, G. Yan, W. Lu, S. Gong, S. Jiang, J. Wu, X. Li, SynergyFlow: an elastic accelerator architecture supporting batch processing of large-scale deep neural networks. ACM Trans. Des. Autom. Electron. Syst. 24(1), 1–27 (2019)
S. Liao, Z. Li, X. Lin, Q. Qiu, Y. Wang, B. Yuan, Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices, in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE (2017), pp. 458–465
M. Liu, Z. Xue, X. Xu, C. Zhong, J. Chen, Host-based intrusion detection system with system calls: review and future trends. ACM Comput. Surv. 51(5), 1–36 (2019)
P.K. Meher, Hardware-efficient systolization of DA-based calculation of finite digital convolution. IEEE Trans. Circuits Syst. II Express Briefs 53(8), 707–711 (2006)
G. Pang, C. Shen, L. Cao, A.V.D. Hengel, Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2), 1–38 (2021)
S.Y. Park, P.K. Meher, Adaptive FIR filter based on distributed arithmetic. IEEE Trans. Circuits Syst. II Express Briefs 60(6), 346–350 (2013)
S.Y. Park, P.K. Meher, Efficient FPGA and ASIC realizations of a DA-based reconfigurable FIR digital filter. IEEE Trans. Circuits Syst. II Express Briefs 61(7), 511–515 (2014)
Z. Qin, D. Zhu, X. Zhu, X. Chen, Y. Shi, Y. Gao, Z. Lu, Q. Shen, L. Li, H. Pan, Accelerating deep neural networks by combining block-circulant matrices and low-precision weights. Electronics 8(1), 1–18 (2019)
Z. Que, H. Nakahara, E. Nurvitadhi, A. Boutros, H. Fan, C. Zeng, J. Meng, K.H. Tsoi, X. Niu, W. Luk, Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. Very Large Scale Integr. Syst. 30(2), 227–237 (2021)
N.M. Rezk, M. Purnaprajna, T. Nordstrom, Z. Ul-Abdin, Recurrent neural networks: an embedded computing perspective. IEEE Access 8, 57967–57996 (2020)
M. Soltaniyeh, R.P. Martin, S. Nagarakatte, An accelerator for sparse convolutional neural networks leveraging systolic general matrix-matrix multiplication. ACM Trans. Archit. Code Optim. 19(3), 1–26 (2022)
Q.T. Truong, H.W. Lauw, Visual sentiment analysis for review images with item-oriented and user-oriented CNN, in Proceedings of 25th ACM international conference on Multimedia (2017), pp. 1274–1282
S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, Y. Liang, C-LSTM: enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2018), pp. 11–20
Z. Wang, J. Lin, Z. Wang, Accelerating recurrent neural networks: a memory-efficient approach. IEEE Trans. Very Large Scale Integr. Syst. 25(10), 2763–2775 (2017)
E. Wang, J.J. Davis, R. Zhao, X. Niu, W. Luk, P.Y. Cheung, Deep neural network approximation for custom hardware: where we’ve been, where we’re going. ACM Comput. Surv. (CSUR) 52(2), 1–39 (2019)
M. Wang, Z. Wang, J. Lu, J. Lin, Z. Wang, E-LSTM: an efficient hardware architecture for long short-term memory. IEEE J. Emerg. Sel. Top. Circuits Syst. 9(2), 280–291 (2019)
C. Xiong, N. Xu, Performance comparison of BLAS on CPU, GPU and FPGA, in 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), vol. 9. IEEE (2020), pp. 193–197
K.P. Yalamarthy, S. Dhall, M.T. Khan, R.A. Shaik, Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. Very Large Scale Integr. Syst. 28(2), 329–338 (2020)
Acknowledgements
The authors would like to thank the Department of Science & Technology, Government of India, for supporting this work under the FIST scheme No. SR/FST/ET-I/2017/68.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Joseph, T., Bindiya, T.S. Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM. Circuits Syst Signal Process 42, 6660–6683 (2023). https://doi.org/10.1007/s00034-023-02412-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02412-4