skip to main content
research-article

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

Published: 22 December 2022 Publication History

Abstract

This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.

References

[1]
Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 411–4117.
[2]
Dario Amodei et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning.
[3]
Kota Ando, Shinya Takamaeda-Yamazaki, Masayuki Ikebe, Tetsuya Asai, and Masato Motomura. 2017. A multithreaded CGRA for convolutional neural network processing. Circ. Syst. 8, 6 (2017), 149–170.
[4]
Eunjin Baek, Dongup Kwon, and Jangwoo Kim. 2020. A multi-neural network acceleration architecture. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 940–953.
[5]
Erfan Bank-Tavakoli, Seyed Abolfazl Ghasemzadeh, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2019. Polar: A pipelined/overlapped fpga-based lstm accelerator. IEEE Trans. VLSI Syst. 28, 3 (2019), 838–842.
[6]
John M. Borkenhagen et al. 2000. A multithreaded PowerPC processor for commercial servers. IBM J. Res. Dev. 44, 6 (2000), 885–898.
[7]
Andrew Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.
[8]
Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 10–19.
[9]
Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 63–72.
[10]
Andre Xian Ming Chang, Berin Martini, and Eugenio Culurciello. 2015. Recurrent neural networks hardware implementation on FPGA. arXiv:1511.05552. Retrieved from https://arxiv.org/abs/1511.05552.
[11]
Zhe Chen, Garrett J. Blair, Hugh T. Blair, and Jason Cong. 2020. BLINK: Bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 217–222.
[12]
Zhe Chen, Hugh T. Blair, and Jason Cong. 2022. Energy efficient LSTM inference accelerator for real-time causal prediction. ACM Trans. Des. Autom. Electr. Syst. 27, 5, Article 44 (September 2022), 19 pages.
[13]
Zhe Chen, Andrew Howe, Hugh T. Blair, and Jason Cong. 2018. CLINK: Compact LSTM inference kernel for energy efficient neurofeedback devices. In Proceedings of the International Symposium on Low Power Electronics and Design. 1–6.
[14]
Yujeong Choi and Minsoo Rhu. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 220–233.
[15]
François Chollet et al. 2015. Keras: Deep Learning Library for theano and tensorflow. https://keras.io/k.
[16]
R. Dimond, O. Mencer, and W. Luk. 2006. Application-specific customisation of multi-threaded soft processors. IEE Proc. Comput. Digit. Techn. 153, 3 (2006), 173–180.
[17]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.
[18]
Martin Ferianc, Zhiqiang Que, Hongxiang Fan, Wayne Luk, and Miguel Rodrigues. 2021. Optimizing Bayesian recurrent neural networks on an FPGA-based accelerator. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’21). IEEE, 1–10.
[19]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1–14.
[20]
Chang Gao, Tobi Delbruck, and Shih-Chii Liu. 2021. Spartus: A 9.4 TOp/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity. IEEE Transactions on Neural Networks and Learning Systems. (Early Access)
[21]
Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. DeltaRNN: A power-efficient recurrent neural network accelerator. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 21–30.
[22]
Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency RNN inference with cellular batching. In Proceedings of the 13th European Conference on Computer Systems Conference (EuroSys’18). 1–15.
[23]
Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2021. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification. arXiv:2101.02667. Retrieved from https://arxiv.org/abs/2101.02667.
[24]
Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, et al. 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 681–697.
[25]
Yoav Goldberg. 2016. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57 (2016), 345–420.
[26]
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.
[27]
Yijin Guan, Zhihang Yuan, Guangyu Sun, and Jason Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629–634.
[28]
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75–84.
[29]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567. Retrieved from https://arxiv.org/abs/1412.5567.
[30]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[31]
Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using LSTM and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 387–395.
[32]
Intel. [n.d.]. Understanding How Hyperflex Architecture Enables High Performance Systems. White Paper 01231.
[33]
Intel. 2020. Intel Agilex Variable Precision DSP Blocks User Guide.
[34]
Jingfei Jiang, Tao Xiao, Jinwei Xu, Dong Wen, Lei Gao, and Yong Dou. 2022. A low-latency LSTM accelerator using balanced sparsity based on FPGA. Microprocess. Microsyst. 89 (2022), 104417.
[35]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[36]
Sheng-Chun Kao and Tushar Krishna. 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE.
[37]
Kasem Khalil, Bappaditya Dey, Ashok Kumar, and Magdy Bayoumi. 2021. A reversible-logic based architecture for long short-term memory (LSTM) network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). IEEE, 1–5.
[38]
Jinwon Kim, Jiho Kim, and Tae-Hwan Kim. 2021. AERO: A 1.28 MOP/s/LUT reconfigurable inference processor for recurrent neural networks in a resource-limited FPGA. Electronics 10, 11 (2021), 1249.
[39]
Dongup Kwon, Suyeon Hur, Hamin Jang, Eriko Nurvitadhi, and Jangwoo Kim. 2020. Scalable multi-FPGA acceleration for large RNNs with full parallelism levels. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.
[40]
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 71–83.
[41]
Martin Langhammer, Bogdan Pasca, Gregg Baeckler, and Sergey Gribok. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’19). IEEE.
[42]
Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, Xue Lin, Xuehai Qian, et al. 2019. E-RNN: Design optimization for efficient recurrent neural networks in FPGAs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 69–80.
[43]
Yidong Liu, Leibo Liu, Fabrizio Lombardi, and Jie Han. 2019. An energy-efficient and noise-tolerant recurrent neural network using stochastic computing. IEEE Trans. VLSI Syst. 27, 9 (2019), 2213–2221.
[44]
Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, and Derek Chiou. 2021. Specializing FGPU for persistent deep learning. ACM Trans. Reconfig. Technol. Syst. 14, 2 (2021), 1–23.
[45]
Andrew Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150.
[46]
Cameron McNairy and Rohit Bhatia. 2005. Montecito: A dual-core, dual-thread itanium processor. IEEE Micro 25, 2 (2020), 10–20.
[47]
Sparsh Mittal and Sumanth Umesh. 2021. A survey on hardware accelerators and optimization techniques for RNNs. J. Syst. Arch. 112 (2021), 101839.
[48]
Guocai Nan, Chenghua Wang, Weiqiang Liu, and Fabrizio Lombardi. 2020. DC-LSTM: Deep compressed LSTM with low bit-width and structured matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.
[49]
Eriko Nurvitadhi, Andrew Boutros, Prerna Budhkar, Ali Jafari, Dongup Kwon, David Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana, and Mishali Naik. 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 307–310.
[50]
Eriko Nurvitadhi, James C. Hoe, Shih-Lien L. Lu, and Timothy Kam. 2010. Automatic multithreaded pipeline synthesis from transactional datapath specifications. In Proceedings of the Design Automation Conference. IEEE, 314–319.
[51]
Eriko Nurvitadhi, Dongup Kwon, Ali Jafari, Andrew Boutros, Jaewoong Sim, Phillip Tomson, Huseyin Sumbul, Gregory Chen, Phil Knag, Raghavan Kumar, et al. 2019. Why compete when you can work together: Fpga-asic integration for persistent rnns. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 199–207.
[52]
Eriko Nurvitadhi, Jaewoong Sim, David Sheffield, Asit Mishra, Srivatsan Krishnan, and Debbie Marr. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.
[53]
Young H. Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W. Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 584–597.
[54]
Naebeom Park, Yulhwa Kim, Daehyun Ahn, Taesu Kim, and Jae-Joon Kim. 2020. Time-step interleaved weight reuse for LSTM neural network computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 13–18.
[55]
Lu Peng, Wentao Shi, Jian Zhang, and Samuel Irving. 2019. Exploiting model-level parallelism in recurrent neural network accelerators. In Proceedings of the IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC’19). IEEE, 241–248.
[56]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 389–402.
[57]
Zhiqiang Que et al. 2020. Optimizing reconfigurable recurrent neural networks. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.
[58]
Zhiqiang Que, Yanyang Liu, Ce Guo, Xinyu Niu, Yongxin Zhu, and Wayne Luk. 2019. Real-time anomaly detection for flight testing using AutoEncoder and LSTM. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 379–382.
[59]
Zhiqiang Que, Hiroki Nakahara, Hongxiang Fan, Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, Eriko Nurvitadhi, and Wayne Luk. 2020. A reconfigurable multithreaded accelerator for recurrent neural networks. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 20–28.
[60]
Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, Andrew Boutros, Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, and Wayne Luk. 2022. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. VLSI Syst. (2022).
[61]
Zhiqiang Que, Thomas Nugent, Shuanglong Liu, Li Tian, Xinyu Niu, Yongxin Zhu, and Wayne Luk. 2019. Efficient weight reuse for large LSTMs. In Proceedings of the IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP’19), Vol. 2160. IEEE, 17–24.
[62]
Zhiqiang Que, Erwei Wang, Umar Marikar, Eric Moreno, Jennifer Ngadiuba, Hamza Javed, Bartłomiej Borzyszkowski, Thea Aarrestad, Vladimir Loncar, Sioni Summers, Maurizio Pierini, Peter Y. Cheung, and Wayne Luk. 2021. Accelerating recurrent neural networks for gravitational wave experiments. In Proceedings of the 32th International Conference on Application-specific Systems, Architectures and Processors (ASAP’21). IEEE.
[63]
Zhiqiang Que, Yongxin Zhu, Hongxiang Fan, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Mapping large LSTMs to FPGAs with weight reuse. J. Sign. Process. Syst. 92, 9 (2020), 965–979.
[64]
Stefano Ribes, Pedro Trancoso, Ioannis Sourdis, and Christos-Savvas Bouganis. 2020. Mapping multiple LSTM models on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 1–9.
[65]
Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under computation time constraints. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 3–15.
[66]
Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.
[67]
Vladimir Rybalkin, Chirag Sudarshan, Christian Weis, Jan Lappas, Norbert Wehn, and Li Cheng. 2020. Efficient hardware architectures for 1D-and MD-LSTM networks. J. Sign. Process. Syst. 92, 11 (2020), 1219–1245.
[68]
Vladimir Rybalkin and Norbert Wehn. 2020. When massive GPU parallelism ain’t enough: A novel hardware architecture of 2D-LSTM neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
[69]
Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. 1394–1399.
[70]
Runbin Shi, Junjie Liu, K.-H. Hayden So, Shuo Wang, and Yun Liang. 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.
[71]
Gil Shomron, Tal Horowitz, and Uri Weiser. 2019. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Comput. Arch. Lett. 18, 2 (2019), 99–102.
[72]
Franyell Silfa, Jose Maria Arnau, and Antonio Gonzalez. 2020. E-BATCH: Energy-efficient and high-throughput RNN batching. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2020), 1–23.
[73]
Burton J. Smith. 1986. A pipelined, shared resource MIMD computer. In Advanced Computer Architecture. 39–41.
[74]
Yuxi Sun, Akram Ben Ahmed, and Hideharu Amano. 2019. Acceleration of deep recurrent neural networks with an FPGA cluster. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–4.
[75]
Zhanrui Sun, Yongxin Zhu, Yu Zheng, Hao Wu, Zihao Cao, Peng Xiong, Junjie Hou, Tian Huang, and Zhiqiang Que. 2018. FPGA acceleration of LSTM based on data for test flight. In Proceedings of the IEEE International Conference on Smart Cloud (SmartCloud’18). IEEE, 1–6.
[76]
Tian Tan, Eriko Nurvitadhi, David Shih, and Derek Chiou. 2018. Evaluating the highly-pipelined intel stratix 10 FPGA architecture using open-source benchmarks. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 206–213.
[77]
James E. Thornton. 1964. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. 33–40.
[78]
Stylianos I. Venieris and Christos-Savvas Bouganis. 2018. f-CNNx: A toolflow for mapping multiple convolutional neural networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 381–3817.
[79]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[80]
Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 11–20.
[81]
Zhisheng Wang, Jun Lin, and Zhongfeng Wang. 2017. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. VLSI Syst. 25, 10 (2017), 2763–2775.
[82]
Zhao Wang, Guangyu Sun, Jingchen Zhu, Zhe Zhou, Yijiang Guo, and Zhihang Yuan. 2021. METRO: A software-hardware co-design of interconnections for spatial DNN accelerators. arxiv:2108.10570 [cs.AR]. Retrieved from https://arxiv.org/abs/2108.10570.
[83]
Jiaquan Wu, Feiteng Li, Zhijian Chen, and Xiaoyan Xiang. 2019. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation. IEEE Trans. VLSI Syst. 27, 12 (2019), 2939–2943.
[84]
Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/whitepapers/wp486-deep-learning-int8.pdf.
[85]
Krishna Praveen Yalamarthy et al. 2019. Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. VLSI Syst. 28, 2 (2019), 329–338.
[86]
Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, and Antonio González. 2019. LSTM-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv:1911.01258. Retrieved from https://arxiv.org/abs/1911.01258.
[87]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.
[88]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv:1409.2329. Retrieved from https://arxiv.org/abs/1409.2329.
[89]
Tian Zhao, Yaqi Zhang, and Kunle Olukotun. 2019. Serving recurrent neural networks efficiently with a spatial accelerator. Proc. Mach. Learn. Syst. 1 (2019), 166–177.
[90]
Yong Zheng, Haigang Yang, Yiping Jia, and Zhihong Huang. 2021. PermLSTM: A high energy-efficiency LSTM accelerator architecture. Electronics 10, 8 (2021), 882.

Cited By

View all
  • (2024)Energy-Efficient Computing Acceleration of Unmanned Aerial Vehicles Based on a CPU/FPGA/NPU Heterogeneous SystemIEEE Internet of Things Journal10.1109/JIOT.2024.339764911:16(27126-27138)Online publication date: 15-Aug-2024
  • (2024)Pomelo: Alternative Mechanism of Threads Communication for Accelerating Convolution on SIMT Based Processor2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP62122.2024.10743993(1357-1360)Online publication date: 19-Apr-2024
  • (2024)Auto Batching Scheme for Optimizing LSTM Inference on FPGA PlatformsIEEE Access10.1109/ACCESS.2024.348803312(159380-159394)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 1
March 2023
403 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/35733111
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022
Online AM: 17 May 2022
Accepted: 01 May 2022
Revised: 18 February 2022
Received: 02 September 2021
Published in TRETS Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Accelerator architecture
  2. recurrent neural networks
  3. multi-tenant execution

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • United Kingdom EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)207
  • Downloads (Last 6 weeks)14
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Energy-Efficient Computing Acceleration of Unmanned Aerial Vehicles Based on a CPU/FPGA/NPU Heterogeneous SystemIEEE Internet of Things Journal10.1109/JIOT.2024.339764911:16(27126-27138)Online publication date: 15-Aug-2024
  • (2024)Pomelo: Alternative Mechanism of Threads Communication for Accelerating Convolution on SIMT Based Processor2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP62122.2024.10743993(1357-1360)Online publication date: 19-Apr-2024
  • (2024)Auto Batching Scheme for Optimizing LSTM Inference on FPGA PlatformsIEEE Access10.1109/ACCESS.2024.348803312(159380-159394)Online publication date: 2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media