research-article

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

Authors:

Hiroki Nakahara,

Kuen Hung Tsoi,

Eriko Nurvitadhi,

Wayne LukAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 1

Article No.: 4, Pages 1 - 26

https://doi.org/10.1145/3534969

Published: 22 December 2022 Publication History

Abstract

This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.

References

[1]

Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 411–4117.

[2]

Dario Amodei et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning.

[3]

Kota Ando, Shinya Takamaeda-Yamazaki, Masayuki Ikebe, Tetsuya Asai, and Masato Motomura. 2017. A multithreaded CGRA for convolutional neural network processing. Circ. Syst. 8, 6 (2017), 149–170.

[4]

Eunjin Baek, Dongup Kwon, and Jangwoo Kim. 2020. A multi-neural network acceleration architecture. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 940–953.

Digital Library

[5]

Erfan Bank-Tavakoli, Seyed Abolfazl Ghasemzadeh, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2019. Polar: A pipelined/overlapped fpga-based lstm accelerator. IEEE Trans. VLSI Syst. 28, 3 (2019), 838–842.

Digital Library

[6]

John M. Borkenhagen et al. 2000. A multithreaded PowerPC processor for commercial servers. IBM J. Res. Dev. 44, 6 (2000), 885–898.

Digital Library

[7]

Andrew Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.

[8]

Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 10–19.

[9]

Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 63–72.

Digital Library

[10]

Andre Xian Ming Chang, Berin Martini, and Eugenio Culurciello. 2015. Recurrent neural networks hardware implementation on FPGA. arXiv:1511.05552. Retrieved from https://arxiv.org/abs/1511.05552.

[11]

Zhe Chen, Garrett J. Blair, Hugh T. Blair, and Jason Cong. 2020. BLINK: Bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 217–222.

Digital Library

[12]

Zhe Chen, Hugh T. Blair, and Jason Cong. 2022. Energy efficient LSTM inference accelerator for real-time causal prediction. ACM Trans. Des. Autom. Electr. Syst. 27, 5, Article 44 (September 2022), 19 pages.

[13]

Zhe Chen, Andrew Howe, Hugh T. Blair, and Jason Cong. 2018. CLINK: Compact LSTM inference kernel for energy efficient neurofeedback devices. In Proceedings of the International Symposium on Low Power Electronics and Design. 1–6.

Digital Library

[14]

Yujeong Choi and Minsoo Rhu. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 220–233.

[15]

François Chollet et al. 2015. Keras: Deep Learning Library for theano and tensorflow. https://keras.io/k.

[16]

R. Dimond, O. Mencer, and W. Luk. 2006. Application-specific customisation of multi-threaded soft processors. IEE Proc. Comput. Digit. Techn. 153, 3 (2006), 173–180.

[17]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.

[18]

Martin Ferianc, Zhiqiang Que, Hongxiang Fan, Wayne Luk, and Miguel Rodrigues. 2021. Optimizing Bayesian recurrent neural networks on an FPGA-based accelerator. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’21). IEEE, 1–10.

[19]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1–14.

[20]

Chang Gao, Tobi Delbruck, and Shih-Chii Liu. 2021. Spartus: A 9.4 TOp/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity. IEEE Transactions on Neural Networks and Learning Systems. (Early Access)

[21]

Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. DeltaRNN: A power-efficient recurrent neural network accelerator. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 21–30.

Digital Library

[22]

Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency RNN inference with cellular batching. In Proceedings of the 13th European Conference on Computer Systems Conference (EuroSys’18). 1–15.

Digital Library

[23]

Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2021. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification. arXiv:2101.02667. Retrieved from https://arxiv.org/abs/2101.02667.

[24]

Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, et al. 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 681–697.

[25]

Yoav Goldberg. 2016. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57 (2016), 345–420.

[26]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.

[27]

Yijin Guan, Zhihang Yuan, Guangyu Sun, and Jason Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629–634.

Digital Library

[28]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75–84.

Digital Library

[29]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567. Retrieved from https://arxiv.org/abs/1412.5567.

[30]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[31]

Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using LSTM and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 387–395.

Digital Library

[32]

Intel. [n.d.]. Understanding How Hyperflex Architecture Enables High Performance Systems. White Paper 01231.

[33]

Intel. 2020. Intel Agilex Variable Precision DSP Blocks User Guide.

[34]

Jingfei Jiang, Tao Xiao, Jinwei Xu, Dong Wen, Lei Gao, and Yong Dou. 2022. A low-latency LSTM accelerator using balanced sparsity based on FPGA. Microprocess. Microsyst. 89 (2022), 104417.

Digital Library

[35]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.

Digital Library

[36]

Sheng-Chun Kao and Tushar Krishna. 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE.

[37]

Kasem Khalil, Bappaditya Dey, Ashok Kumar, and Magdy Bayoumi. 2021. A reversible-logic based architecture for long short-term memory (LSTM) network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). IEEE, 1–5.

[38]

Jinwon Kim, Jiho Kim, and Tae-Hwan Kim. 2021. AERO: A 1.28 MOP/s/LUT reconfigurable inference processor for recurrent neural networks in a resource-limited FPGA. Electronics 10, 11 (2021), 1249.

[39]

Dongup Kwon, Suyeon Hur, Hamin Jang, Eriko Nurvitadhi, and Jangwoo Kim. 2020. Scalable multi-FPGA acceleration for large RNNs with full parallelism levels. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.

Digital Library

[40]

Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 71–83.

[41]

Martin Langhammer, Bogdan Pasca, Gregg Baeckler, and Sergey Gribok. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’19). IEEE.

[42]

Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, Xue Lin, Xuehai Qian, et al. 2019. E-RNN: Design optimization for efficient recurrent neural networks in FPGAs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 69–80.

[43]

Yidong Liu, Leibo Liu, Fabrizio Lombardi, and Jie Han. 2019. An energy-efficient and noise-tolerant recurrent neural network using stochastic computing. IEEE Trans. VLSI Syst. 27, 9 (2019), 2213–2221.

[44]

Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, and Derek Chiou. 2021. Specializing FGPU for persistent deep learning. ACM Trans. Reconfig. Technol. Syst. 14, 2 (2021), 1–23.

Digital Library

[45]

Andrew Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150.

Digital Library

[46]

Cameron McNairy and Rohit Bhatia. 2005. Montecito: A dual-core, dual-thread itanium processor. IEEE Micro 25, 2 (2020), 10–20.

Digital Library

[47]

Sparsh Mittal and Sumanth Umesh. 2021. A survey on hardware accelerators and optimization techniques for RNNs. J. Syst. Arch. 112 (2021), 101839.

[48]

Guocai Nan, Chenghua Wang, Weiqiang Liu, and Fabrizio Lombardi. 2020. DC-LSTM: Deep compressed LSTM with low bit-width and structured matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.

[49]

Eriko Nurvitadhi, Andrew Boutros, Prerna Budhkar, Ali Jafari, Dongup Kwon, David Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana, and Mishali Naik. 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 307–310.

[50]

Eriko Nurvitadhi, James C. Hoe, Shih-Lien L. Lu, and Timothy Kam. 2010. Automatic multithreaded pipeline synthesis from transactional datapath specifications. In Proceedings of the Design Automation Conference. IEEE, 314–319.

Digital Library

[51]

Eriko Nurvitadhi, Dongup Kwon, Ali Jafari, Andrew Boutros, Jaewoong Sim, Phillip Tomson, Huseyin Sumbul, Gregory Chen, Phil Knag, Raghavan Kumar, et al. 2019. Why compete when you can work together: Fpga-asic integration for persistent rnns. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 199–207.

[52]

Eriko Nurvitadhi, Jaewoong Sim, David Sheffield, Asit Mishra, Srivatsan Krishnan, and Debbie Marr. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.

[53]

Young H. Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W. Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 584–597.

[54]

Naebeom Park, Yulhwa Kim, Daehyun Ahn, Taesu Kim, and Jae-Joon Kim. 2020. Time-step interleaved weight reuse for LSTM neural network computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 13–18.

Digital Library

[55]

Lu Peng, Wentao Shi, Jian Zhang, and Samuel Irving. 2019. Exploiting model-level parallelism in recurrent neural network accelerators. In Proceedings of the IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC’19). IEEE, 241–248.

[56]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 389–402.

Digital Library

[57]

Zhiqiang Que et al. 2020. Optimizing reconfigurable recurrent neural networks. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.

[58]

Zhiqiang Que, Yanyang Liu, Ce Guo, Xinyu Niu, Yongxin Zhu, and Wayne Luk. 2019. Real-time anomaly detection for flight testing using AutoEncoder and LSTM. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 379–382.

[59]

Zhiqiang Que, Hiroki Nakahara, Hongxiang Fan, Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, Eriko Nurvitadhi, and Wayne Luk. 2020. A reconfigurable multithreaded accelerator for recurrent neural networks. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 20–28.

[60]

Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, Andrew Boutros, Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, and Wayne Luk. 2022. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. VLSI Syst. (2022).

[61]

Zhiqiang Que, Thomas Nugent, Shuanglong Liu, Li Tian, Xinyu Niu, Yongxin Zhu, and Wayne Luk. 2019. Efficient weight reuse for large LSTMs. In Proceedings of the IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP’19), Vol. 2160. IEEE, 17–24.

[62]

Zhiqiang Que, Erwei Wang, Umar Marikar, Eric Moreno, Jennifer Ngadiuba, Hamza Javed, Bartłomiej Borzyszkowski, Thea Aarrestad, Vladimir Loncar, Sioni Summers, Maurizio Pierini, Peter Y. Cheung, and Wayne Luk. 2021. Accelerating recurrent neural networks for gravitational wave experiments. In Proceedings of the 32th International Conference on Application-specific Systems, Architectures and Processors (ASAP’21). IEEE.

[63]

Zhiqiang Que, Yongxin Zhu, Hongxiang Fan, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Mapping large LSTMs to FPGAs with weight reuse. J. Sign. Process. Syst. 92, 9 (2020), 965–979.

[64]

Stefano Ribes, Pedro Trancoso, Ioannis Sourdis, and Christos-Savvas Bouganis. 2020. Mapping multiple LSTM models on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 1–9.

[65]

Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under computation time constraints. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 3–15.

[66]

Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.

[67]

Vladimir Rybalkin, Chirag Sudarshan, Christian Weis, Jan Lappas, Norbert Wehn, and Li Cheng. 2020. Efficient hardware architectures for 1D-and MD-LSTM networks. J. Sign. Process. Syst. 92, 11 (2020), 1219–1245.

Digital Library

[68]

Vladimir Rybalkin and Norbert Wehn. 2020. When massive GPU parallelism ain’t enough: A novel hardware architecture of 2D-LSTM neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Digital Library

[69]

Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. 1394–1399.

Digital Library

[70]

Runbin Shi, Junjie Liu, K.-H. Hayden So, Shuo Wang, and Yun Liang. 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.

Digital Library

[71]

Gil Shomron, Tal Horowitz, and Uri Weiser. 2019. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Comput. Arch. Lett. 18, 2 (2019), 99–102.

Digital Library

[72]

Franyell Silfa, Jose Maria Arnau, and Antonio Gonzalez. 2020. E-BATCH: Energy-efficient and high-throughput RNN batching. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2020), 1–23.

[73]

Burton J. Smith. 1986. A pipelined, shared resource MIMD computer. In Advanced Computer Architecture. 39–41.

[74]

Yuxi Sun, Akram Ben Ahmed, and Hideharu Amano. 2019. Acceleration of deep recurrent neural networks with an FPGA cluster. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–4.

Digital Library

[75]

Zhanrui Sun, Yongxin Zhu, Yu Zheng, Hao Wu, Zihao Cao, Peng Xiong, Junjie Hou, Tian Huang, and Zhiqiang Que. 2018. FPGA acceleration of LSTM based on data for test flight. In Proceedings of the IEEE International Conference on Smart Cloud (SmartCloud’18). IEEE, 1–6.

[76]

Tian Tan, Eriko Nurvitadhi, David Shih, and Derek Chiou. 2018. Evaluating the highly-pipelined intel stratix 10 FPGA architecture using open-source benchmarks. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 206–213.

[77]

James E. Thornton. 1964. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. 33–40.

[78]

Stylianos I. Venieris and Christos-Savvas Bouganis. 2018. f-CNNx: A toolflow for mapping multiple convolutional neural networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 381–3817.

[79]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[80]

Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 11–20.

Digital Library

[81]

Zhisheng Wang, Jun Lin, and Zhongfeng Wang. 2017. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. VLSI Syst. 25, 10 (2017), 2763–2775.

Digital Library

[82]

Zhao Wang, Guangyu Sun, Jingchen Zhu, Zhe Zhou, Yijiang Guo, and Zhihang Yuan. 2021. METRO: A software-hardware co-design of interconnections for spatial DNN accelerators. arxiv:2108.10570 [cs.AR]. Retrieved from https://arxiv.org/abs/2108.10570.

[83]

Jiaquan Wu, Feiteng Li, Zhijian Chen, and Xiaoyan Xiang. 2019. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation. IEEE Trans. VLSI Syst. 27, 12 (2019), 2939–2943.

[84]

Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/whitepapers/wp486-deep-learning-int8.pdf.

[85]

Krishna Praveen Yalamarthy et al. 2019. Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. VLSI Syst. 28, 2 (2019), 329–338.

[86]

Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, and Antonio González. 2019. LSTM-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv:1911.01258. Retrieved from https://arxiv.org/abs/1911.01258.

[87]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.

[88]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv:1409.2329. Retrieved from https://arxiv.org/abs/1409.2329.

[89]

Tian Zhao, Yaqi Zhang, and Kunle Olukotun. 2019. Serving recurrent neural networks efficiently with a spatial accelerator. Proc. Mach. Learn. Syst. 1 (2019), 166–177.

[90]

Yong Zheng, Haigang Yang, Yiping Jia, and Zhihong Huang. 2021. PermLSTM: A high energy-efficiency LSTM accelerator architecture. Electronics 10, 8 (2021), 882.

Cited By

Liu XXu WWang QZhang M(2024)Energy-Efficient Computing Acceleration of Unmanned Aerial Vehicles Based on a CPU/FPGA/NPU Heterogeneous SystemIEEE Internet of Things Journal10.1109/JIOT.2024.339764911:16(27126-27138)Online publication date: 15-Aug-2024
https://doi.org/10.1109/JIOT.2024.3397649
Feng ZYang LZhang Y(2024)Pomelo: Alternative Mechanism of Threads Communication for Accelerating Convolution on SIMT Based Processor2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP62122.2024.10743993(1357-1360)Online publication date: 19-Apr-2024
https://doi.org/10.1109/ICSP62122.2024.10743993
Jin Kim BChung E(2024)Auto Batching Scheme for Optimizing LSTM Inference on FPGA PlatformsIEEE Access10.1109/ACCESS.2024.348803312(159380-159394)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3488033

Index Terms

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
      2. Special purpose systems
    2. Parallel architectures
      1. Multicore architectures
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

Wavefront parallelization of recurrent neural networks on multi-core architectures
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental ...
A CGRA-Based Approach for Accelerating Convolutional Neural Networks
MCSOC '15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip

Convolutional neural network (CNN) is an emerging approach for achieving high recognition accuracy in various machine learning applications. To accelerate CNN computations, various GPU-based or application-specific hardware approaches have been recently ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 16, Issue 1

March 2023

403 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/35733111

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022

Online AM: 17 May 2022

Accepted: 01 May 2022

Revised: 18 February 2022

Received: 02 September 2021

Published in TRETS Volume 16, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

United Kingdom EPSRC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
891
Total Downloads

Downloads (Last 12 months)207
Downloads (Last 6 weeks)14

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu XXu WWang QZhang M(2024)Energy-Efficient Computing Acceleration of Unmanned Aerial Vehicles Based on a CPU/FPGA/NPU Heterogeneous SystemIEEE Internet of Things Journal10.1109/JIOT.2024.339764911:16(27126-27138)Online publication date: 15-Aug-2024
https://doi.org/10.1109/JIOT.2024.3397649
Feng ZYang LZhang Y(2024)Pomelo: Alternative Mechanism of Threads Communication for Accelerating Convolution on SIMT Based Processor2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP62122.2024.10743993(1357-1360)Online publication date: 19-Apr-2024
https://doi.org/10.1109/ICSP62122.2024.10743993
Jin Kim BChung E(2024)Auto Batching Scheme for Optimizing LSTM Inference on FPGA PlatformsIEEE Access10.1109/ACCESS.2024.348803312(159380-159394)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3488033

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents