Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM

Joseph, Tresa; Bindiya, T. S.

doi:10.1007/s00034-023-02412-4

Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM

Published: 08 June 2023

Volume 42, pages 6660–6683, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

343 Accesses
Explore all metrics

Abstract

This paper proposes a new hardware approach for accelerating matrix vector multiplication (MVM) employing systolic array architecture and parallel data processing units, which is particularly useful in multiplication intensive applications such as neural networks. The hardware complexity of the parallel computations is reduced by a technique named as split-matrix approach, in which the larger matrices are split into smaller matrices. In the proposed architecture, 8-bit fixed-point representation is considered and matrices are treated to be circulant in nature. The resulting MVM architecture benefits with reduced implementation complexity in terms of cell area, reduced delay, and power consumption. It is found to result in a 13.9% reduction in logic cell area and a 38.15% reduction in total power consumption when compared to those of the latest baseline design. Also, the proposed architecture is able to achieve a considerably improved minimum permissible clock period of 0.410ns. The development of a long short-term memory (LSTM) architecture using the proposed design also serves to prove the effectiveness of the proposed MVM architecture. The LSTM developed using the proposed MVM provides a 37.57% reduction in the cell area and a 22.86% reduction in the total power in comparison with the latest baseline design and is able to achieve a minimum clock period of 0.42 ns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Power and Delay-Efficient Matrix Vector Multiplier Units for the LSTM Networks Using Activity Span Reduction Technique and Recursive Adders

Article 21 July 2023

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

Memristor-only LSTM Acceleration with Non-linear Activation Functions

Data Availability

This article has no associated data.

References

A. Alzahrani, N. Alalwan, M. Sarrab, Mobile cloud computing: advantage, disadvantage and open challenge, in Proceedings of the 7th Euro American Conference on Telematics and Information Systems (2014), pp. 1–4
A. Arora, M. Ghosh, S. Mehta, V. Betz, L.K. John, Tensor slices: FPGA building blocks for the deep learning era. ACM Trans. Reconfigurable Technol. Syst. 15(4), 1–34 (2022)
Article Google Scholar
E. Bank-Tavakoli, S.A. Ghasemzadeh, M. Kamal, A. Afzali-Kusha, M. Pedram, POLAR: a pipelined/overlapped FPGA-based LSTM accelerator. IEEE Trans. Very Large Scale Integr. Syst. 28(3), 838–842 (2020)
Article Google Scholar
D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi, H. Ghayvat, CNN variants for computer vision: history, architecture, application, challenges, and future scope. Electronics 10(20), 1–28 (2021)
Article Google Scholar
M. Bishop, Neural networks and their applications. Rev. Sci. Instrum. 65(6), 1803–1832 (1994)
Article Google Scholar
F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, L. Benini, Chipmunk: a systolically scalable 0.9 mm 2, 3.08 Gop/s/mW@ 1.2 mW accelerator for near-sensor recurrent neural network inference, in 2018 IEEE Custom Integrated Circuits Conference (CICC) (2018), pp. 1–4
Design Ware Building Block IP User Guide, Synposys, Inc., Mountain View, CA, USA, 06-SP2 (2012)
C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, B. Yuan, CIRCNN: accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the Annual International Symposium on Micro-architecture (2017), pp. 395–408
C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, Y. Wang, Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the ACM Great Lakes Symposium on VLSI (2018), pp. 353–358
C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, Y. Liang, REQ-YOLO: a resource-aware, efficient quantization framework for object detection on FPGAS, in FPGA 2019—Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2019), pp. 33-42
S. Dong, P. Zhao, X. Lin, D. Kaeli, Exploring GPU acceleration of deep neural networks using block circulant matrices. Parallel Comput. 100, 102701 (2020)
Article MathSciNet Google Scholar
A. Garofalo, G. Ottavi, F. Conti, G. Karunaratne, I. Boybat, L. Benini, D. Rossi, A heterogeneous in-memory computing cluster for flexible end-to-end inference of real-world deep neural networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 12(2), 422–435 (2022)
Article Google Scholar
A. Goel, S. Aghajanzadeh, C. Tung, S.H. Chen, G.K. Thiruvathukal, Y.H. Lu, Modular neural networks for low-power image classification on embedded devices. ACM Trans. Des. Autom. Electron. Syst. 26(1), 1–35 (2021)
Article Google Scholar
K. Greff, R.K. Srivastava, J. Koutnik, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)
Article MathSciNet Google Scholar
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in Proceedings on 43rd International Symposium on Computer Architecture (ISCA) (2016), vol. 16 (2016), pp. 243–254
Y. He, J. Yue, Y. Liu, H. Yang, Block-circulant neural network accelerator featuring fine-grained frequency-domain quantization and reconfigurable FFT modules, in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE (2021), pp. 813–818
D. He, J. He, J. Liu, J. Yang, Q. Yan, Y. Yang, An FPGA-based LSTM acceleration engine for deep learning frameworks. Electronics 10(6), 1–15 (2021)
Article Google Scholar
M.T. Khan, R.A. Shaik, Optimal complexity architectures for pipelined distributed arithmetic-based LMS adaptive filter. IEEE Trans. Circuits Syst I Regul. Pap. 66(2), 630–642 (2019)
Article Google Scholar
S. Konwer, M. Sojan, P.A. Kenz, S.K. Santhosh, T. Joseph, T. Bindiya, Hardware realization of sigmoid and hyperbolic tangent activation functions, in 2022 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT). IEEE (2022), pp. 84–89
V.S. Lalapura, J. Amudha, H.S. Satheesh, Recurrent neural networks for edge intelligence: a survey. ACM Comput. Surv. 54(4), 1–38 (2021)
Article Google Scholar
Z. Li, S. Wang, C. Ding, Q. Qiu, Y. Wang, Y. Liang, Efficient recurrent neural networks using structured matrices in FPGAS, in 6th International Conference on Learning Representations (ICLR 2018)—Workshop Track Proceedings (2018)
J. Li, G. Yan, W. Lu, S. Gong, S. Jiang, J. Wu, X. Li, SynergyFlow: an elastic accelerator architecture supporting batch processing of large-scale deep neural networks. ACM Trans. Des. Autom. Electron. Syst. 24(1), 1–27 (2019)
Article Google Scholar
S. Liao, Z. Li, X. Lin, Q. Qiu, Y. Wang, B. Yuan, Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices, in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE (2017), pp. 458–465
M. Liu, Z. Xue, X. Xu, C. Zhong, J. Chen, Host-based intrusion detection system with system calls: review and future trends. ACM Comput. Surv. 51(5), 1–36 (2019)
Article Google Scholar
P.K. Meher, Hardware-efficient systolization of DA-based calculation of finite digital convolution. IEEE Trans. Circuits Syst. II Express Briefs 53(8), 707–711 (2006)
Article Google Scholar
G. Pang, C. Shen, L. Cao, A.V.D. Hengel, Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2), 1–38 (2021)
Article Google Scholar
S.Y. Park, P.K. Meher, Adaptive FIR filter based on distributed arithmetic. IEEE Trans. Circuits Syst. II Express Briefs 60(6), 346–350 (2013)
Google Scholar
S.Y. Park, P.K. Meher, Efficient FPGA and ASIC realizations of a DA-based reconfigurable FIR digital filter. IEEE Trans. Circuits Syst. II Express Briefs 61(7), 511–515 (2014)
Google Scholar
Z. Qin, D. Zhu, X. Zhu, X. Chen, Y. Shi, Y. Gao, Z. Lu, Q. Shen, L. Li, H. Pan, Accelerating deep neural networks by combining block-circulant matrices and low-precision weights. Electronics 8(1), 1–18 (2019)
Article Google Scholar
Z. Que, H. Nakahara, E. Nurvitadhi, A. Boutros, H. Fan, C. Zeng, J. Meng, K.H. Tsoi, X. Niu, W. Luk, Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. Very Large Scale Integr. Syst. 30(2), 227–237 (2021)
Article Google Scholar
N.M. Rezk, M. Purnaprajna, T. Nordstrom, Z. Ul-Abdin, Recurrent neural networks: an embedded computing perspective. IEEE Access 8, 57967–57996 (2020)
Article Google Scholar
M. Soltaniyeh, R.P. Martin, S. Nagarakatte, An accelerator for sparse convolutional neural networks leveraging systolic general matrix-matrix multiplication. ACM Trans. Archit. Code Optim. 19(3), 1–26 (2022)
Article Google Scholar
Q.T. Truong, H.W. Lauw, Visual sentiment analysis for review images with item-oriented and user-oriented CNN, in Proceedings of 25th ACM international conference on Multimedia (2017), pp. 1274–1282
S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, Y. Liang, C-LSTM: enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2018), pp. 11–20
Z. Wang, J. Lin, Z. Wang, Accelerating recurrent neural networks: a memory-efficient approach. IEEE Trans. Very Large Scale Integr. Syst. 25(10), 2763–2775 (2017)
Article Google Scholar
E. Wang, J.J. Davis, R. Zhao, X. Niu, W. Luk, P.Y. Cheung, Deep neural network approximation for custom hardware: where we’ve been, where we’re going. ACM Comput. Surv. (CSUR) 52(2), 1–39 (2019)
Article Google Scholar
M. Wang, Z. Wang, J. Lu, J. Lin, Z. Wang, E-LSTM: an efficient hardware architecture for long short-term memory. IEEE J. Emerg. Sel. Top. Circuits Syst. 9(2), 280–291 (2019)
Article Google Scholar
C. Xiong, N. Xu, Performance comparison of BLAS on CPU, GPU and FPGA, in 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), vol. 9. IEEE (2020), pp. 193–197
K.P. Yalamarthy, S. Dhall, M.T. Khan, R.A. Shaik, Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. Very Large Scale Integr. Syst. 28(2), 329–338 (2020)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the Department of Science & Technology, Government of India, for supporting this work under the FIST scheme No. SR/FST/ET-I/2017/68.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Kattangal, Kerala, 673601, India
Tresa Joseph & T. S. Bindiya

Authors

Tresa Joseph
View author publications
You can also search for this author inPubMed Google Scholar
T. S. Bindiya
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tresa Joseph.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Joseph, T., Bindiya, T.S. Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM. Circuits Syst Signal Process 42, 6660–6683 (2023). https://doi.org/10.1007/s00034-023-02412-4

Download citation

Received: 14 December 2022
Revised: 17 May 2023
Accepted: 18 May 2023
Published: 08 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00034-023-02412-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Power and Delay-Efficient Matrix Vector Multiplier Units for the LSTM Networks Using Activity Span Reduction Technique and Recursive Adders

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

Memristor-only LSTM Acceleration with Non-linear Activation Functions

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now