Skip to main content
Log in

Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a new hardware approach for accelerating matrix vector multiplication (MVM) employing systolic array architecture and parallel data processing units, which is particularly useful in multiplication intensive applications such as neural networks. The hardware complexity of the parallel computations is reduced by a technique named as split-matrix approach, in which the larger matrices are split into smaller matrices. In the proposed architecture, 8-bit fixed-point representation is considered and matrices are treated to be circulant in nature. The resulting MVM architecture benefits with reduced implementation complexity in terms of cell area, reduced delay, and power consumption. It is found to result in a 13.9% reduction in logic cell area and a 38.15% reduction in total power consumption when compared to those of the latest baseline design. Also, the proposed architecture is able to achieve a considerably improved minimum permissible clock period of 0.410ns. The development of a long short-term memory (LSTM) architecture using the proposed design also serves to prove the effectiveness of the proposed MVM architecture. The LSTM developed using the proposed MVM provides a 37.57% reduction in the cell area and a 22.86% reduction in the total power in comparison with the latest baseline design and is able to achieve a minimum clock period of 0.42 ns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

This article has no associated data.

References

  1. A. Alzahrani, N. Alalwan, M. Sarrab, Mobile cloud computing: advantage, disadvantage and open challenge, in Proceedings of the 7th Euro American Conference on Telematics and Information Systems (2014), pp. 1–4

  2. A. Arora, M. Ghosh, S. Mehta, V. Betz, L.K. John, Tensor slices: FPGA building blocks for the deep learning era. ACM Trans. Reconfigurable Technol. Syst. 15(4), 1–34 (2022)

    Article  Google Scholar 

  3. E. Bank-Tavakoli, S.A. Ghasemzadeh, M. Kamal, A. Afzali-Kusha, M. Pedram, POLAR: a pipelined/overlapped FPGA-based LSTM accelerator. IEEE Trans. Very Large Scale Integr. Syst. 28(3), 838–842 (2020)

    Article  Google Scholar 

  4. D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi, H. Ghayvat, CNN variants for computer vision: history, architecture, application, challenges, and future scope. Electronics 10(20), 1–28 (2021)

    Article  Google Scholar 

  5. M. Bishop, Neural networks and their applications. Rev. Sci. Instrum. 65(6), 1803–1832 (1994)

    Article  Google Scholar 

  6. F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, L. Benini, Chipmunk: a systolically scalable 0.9 mm 2, 3.08 Gop/s/mW@ 1.2 mW accelerator for near-sensor recurrent neural network inference, in 2018 IEEE Custom Integrated Circuits Conference (CICC) (2018), pp. 1–4

  7. Design Ware Building Block IP User Guide, Synposys, Inc., Mountain View, CA, USA, 06-SP2 (2012)

  8. C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, B. Yuan, CIRCNN: accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the Annual International Symposium on Micro-architecture (2017), pp. 395–408

  9. C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, Y. Wang, Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the ACM Great Lakes Symposium on VLSI (2018), pp. 353–358

  10. C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, Y. Liang, REQ-YOLO: a resource-aware, efficient quantization framework for object detection on FPGAS, in FPGA 2019—Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2019), pp. 33-42

  11. S. Dong, P. Zhao, X. Lin, D. Kaeli, Exploring GPU acceleration of deep neural networks using block circulant matrices. Parallel Comput. 100, 102701 (2020)

    Article  MathSciNet  Google Scholar 

  12. A. Garofalo, G. Ottavi, F. Conti, G. Karunaratne, I. Boybat, L. Benini, D. Rossi, A heterogeneous in-memory computing cluster for flexible end-to-end inference of real-world deep neural networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 12(2), 422–435 (2022)

    Article  Google Scholar 

  13. A. Goel, S. Aghajanzadeh, C. Tung, S.H. Chen, G.K. Thiruvathukal, Y.H. Lu, Modular neural networks for low-power image classification on embedded devices. ACM Trans. Des. Autom. Electron. Syst. 26(1), 1–35 (2021)

    Article  Google Scholar 

  14. K. Greff, R.K. Srivastava, J. Koutnik, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)

    Article  MathSciNet  Google Scholar 

  15. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in Proceedings on 43rd International Symposium on Computer Architecture (ISCA) (2016), vol. 16 (2016), pp. 243–254

  16. Y. He, J. Yue, Y. Liu, H. Yang, Block-circulant neural network accelerator featuring fine-grained frequency-domain quantization and reconfigurable FFT modules, in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE (2021), pp. 813–818

  17. D. He, J. He, J. Liu, J. Yang, Q. Yan, Y. Yang, An FPGA-based LSTM acceleration engine for deep learning frameworks. Electronics 10(6), 1–15 (2021)

    Article  Google Scholar 

  18. M.T. Khan, R.A. Shaik, Optimal complexity architectures for pipelined distributed arithmetic-based LMS adaptive filter. IEEE Trans. Circuits Syst I Regul. Pap. 66(2), 630–642 (2019)

    Article  Google Scholar 

  19. S. Konwer, M. Sojan, P.A. Kenz, S.K. Santhosh, T. Joseph, T. Bindiya, Hardware realization of sigmoid and hyperbolic tangent activation functions, in 2022 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT). IEEE (2022), pp. 84–89

  20. V.S. Lalapura, J. Amudha, H.S. Satheesh, Recurrent neural networks for edge intelligence: a survey. ACM Comput. Surv. 54(4), 1–38 (2021)

    Article  Google Scholar 

  21. Z. Li, S. Wang, C. Ding, Q. Qiu, Y. Wang, Y. Liang, Efficient recurrent neural networks using structured matrices in FPGAS, in 6th International Conference on Learning Representations (ICLR 2018)—Workshop Track Proceedings (2018)

  22. J. Li, G. Yan, W. Lu, S. Gong, S. Jiang, J. Wu, X. Li, SynergyFlow: an elastic accelerator architecture supporting batch processing of large-scale deep neural networks. ACM Trans. Des. Autom. Electron. Syst. 24(1), 1–27 (2019)

    Article  Google Scholar 

  23. S. Liao, Z. Li, X. Lin, Q. Qiu, Y. Wang, B. Yuan, Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices, in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE (2017), pp. 458–465

  24. M. Liu, Z. Xue, X. Xu, C. Zhong, J. Chen, Host-based intrusion detection system with system calls: review and future trends. ACM Comput. Surv. 51(5), 1–36 (2019)

    Article  Google Scholar 

  25. P.K. Meher, Hardware-efficient systolization of DA-based calculation of finite digital convolution. IEEE Trans. Circuits Syst. II Express Briefs 53(8), 707–711 (2006)

    Article  Google Scholar 

  26. G. Pang, C. Shen, L. Cao, A.V.D. Hengel, Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2), 1–38 (2021)

    Article  Google Scholar 

  27. S.Y. Park, P.K. Meher, Adaptive FIR filter based on distributed arithmetic. IEEE Trans. Circuits Syst. II Express Briefs 60(6), 346–350 (2013)

    Google Scholar 

  28. S.Y. Park, P.K. Meher, Efficient FPGA and ASIC realizations of a DA-based reconfigurable FIR digital filter. IEEE Trans. Circuits Syst. II Express Briefs 61(7), 511–515 (2014)

    Google Scholar 

  29. Z. Qin, D. Zhu, X. Zhu, X. Chen, Y. Shi, Y. Gao, Z. Lu, Q. Shen, L. Li, H. Pan, Accelerating deep neural networks by combining block-circulant matrices and low-precision weights. Electronics 8(1), 1–18 (2019)

    Article  Google Scholar 

  30. Z. Que, H. Nakahara, E. Nurvitadhi, A. Boutros, H. Fan, C. Zeng, J. Meng, K.H. Tsoi, X. Niu, W. Luk, Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. Very Large Scale Integr. Syst. 30(2), 227–237 (2021)

    Article  Google Scholar 

  31. N.M. Rezk, M. Purnaprajna, T. Nordstrom, Z. Ul-Abdin, Recurrent neural networks: an embedded computing perspective. IEEE Access 8, 57967–57996 (2020)

    Article  Google Scholar 

  32. M. Soltaniyeh, R.P. Martin, S. Nagarakatte, An accelerator for sparse convolutional neural networks leveraging systolic general matrix-matrix multiplication. ACM Trans. Archit. Code Optim. 19(3), 1–26 (2022)

    Article  Google Scholar 

  33. Q.T. Truong, H.W. Lauw, Visual sentiment analysis for review images with item-oriented and user-oriented CNN, in Proceedings of 25th ACM international conference on Multimedia (2017), pp. 1274–1282

  34. S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, Y. Liang, C-LSTM: enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2018), pp. 11–20

  35. Z. Wang, J. Lin, Z. Wang, Accelerating recurrent neural networks: a memory-efficient approach. IEEE Trans. Very Large Scale Integr. Syst. 25(10), 2763–2775 (2017)

    Article  Google Scholar 

  36. E. Wang, J.J. Davis, R. Zhao, X. Niu, W. Luk, P.Y. Cheung, Deep neural network approximation for custom hardware: where we’ve been, where we’re going. ACM Comput. Surv. (CSUR) 52(2), 1–39 (2019)

    Article  Google Scholar 

  37. M. Wang, Z. Wang, J. Lu, J. Lin, Z. Wang, E-LSTM: an efficient hardware architecture for long short-term memory. IEEE J. Emerg. Sel. Top. Circuits Syst. 9(2), 280–291 (2019)

    Article  Google Scholar 

  38. C. Xiong, N. Xu, Performance comparison of BLAS on CPU, GPU and FPGA, in 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), vol. 9. IEEE (2020), pp. 193–197

  39. K.P. Yalamarthy, S. Dhall, M.T. Khan, R.A. Shaik, Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. Very Large Scale Integr. Syst. 28(2), 329–338 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Department of Science & Technology, Government of India, for supporting this work under the FIST scheme No. SR/FST/ET-I/2017/68.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tresa Joseph.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joseph, T., Bindiya, T.S. Performance-Driven LSTM Accelerator Hardware Using Split-Matrix-Based MVM. Circuits Syst Signal Process 42, 6660–6683 (2023). https://doi.org/10.1007/s00034-023-02412-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02412-4

Keywords