research-article

When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Authors:
Vladimir Rybalkin

University of Kaiserslautern, Kaiserslautern, Germany

University of Kaiserslautern, Kaiserslautern, Germany
View Profile

,
Norbert Wehn

University of Kaiserslautern, Kaiserslautern, Germany

University of Kaiserslautern, Kaiserslautern, Germany
View Profile

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2020Pages 111–121https://doi.org/10.1145/3373087.3375301

Published:24 February 2020Publication History

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 111–121

ABSTRACT

Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to show state-of-the-art results in various applications including handwritten text recognition, medical imaging, and many more. However, efficient implementation suffers from very sequential execution that tremendously slows down both training and inference compared to other neural networks. This is the primary reason that prevents intensive research involving MD-LSTM in the recent years, despite large progress in microelectronics and architectures. The main goal of the current research is to provide acceleration for inference of MD-LSTM, so to open a door for efficient training that can boost application of MD-LSTM. By this research we advocate that FPGA is an alternative platform for deep learning that can offer a solution in cases when a massive parallelism of GPUs does not provide the necessary performance required by the application. In this paper, we present the first hardware architecture for MD-LSTM. We conduct a systematic exploration of precision vs. accuracy trade-off using challenging dataset for historical document image binarization from DIBCO 2017 contest, and well known MNIST dataset for handwritten digits recognition. Based on our new architecture we implement FPGA-based accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by up to 50x and energy efficiency by up to 746x. At the same time, our accelerator demonstrates higher accuracy and comparable throughput in comparison with state-of-the-art FPGA-based implementations of multilayer perceptron for MNIST dataset.

References

[n. d.]. http://pytorch.org/Google Scholar
[n. d.]. Zynq UltraScaleï MPSoC Power Advantage Tool. https://xilinxwiki. atlassian.net/wiki/spaces/A/pages/18841813/Zynq+UltraScale+MP\SoC+ Power+Management.Google Scholar
Muhammad Zeshan Afzal, Joan Pastor-Pellicer, Faisal Shafait, Thomas M Breuel, Andreas Dengel, and Marcus Liwicki. 2015. Document image binarization using lstm: A sequence learning approach. In Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing. ACM, 79--84.Google ScholarDigital Library
Hande Alemdar, Vincent Leroy, Adrien Prost-Boucle, and Frédéric Pétrot. 2017. Ternary neural networks for resource-efficient AI applications. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2547--2554.Google ScholarCross Ref
Wonmin Byeon, Thomas M Breuel, Federico Raue, and Marcus Liwicki. 2015. Scene labeling with lstm recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3547--3555.Google ScholarCross Ref
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4960--4964.Google ScholarDigital Library
Benjamin Davidson, Angelos Kalitzeos, Joseph Carroll, Alfredo Dubra, Sebastien Ourselin, Michel Michaelides, and Christos Bergeles. 2018. Automatic cone photoreceptor localisation in healthy and Stargardt afflicted retinas using deep learning. Scientific reports 8, 1 (2018), 7911.Google Scholar
Tong Geng, Tianqi Wang, Ang Li, Xi Jin, and Martin Herbordt. 2019. A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing. arXiv preprint arXiv:1901.01007 (2019).Google Scholar
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.Google Scholar
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2007. Multidimensional recurrent neural networks. In International conference on artificial neural networks. Springer, 549--558.Google Scholar
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. Ese: Efficient Speech Recognition Rngine with Sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75--84.Google ScholarDigital Library
Lu Hou, Quanming Yao, and James T Kwok. 2016. Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600 (2016).Google Scholar
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).Google Scholar
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Grid long short-term memory. arXiv preprint arXiv:1507.01526 (2015).Google Scholar
Kamran Kowsari, Mojtaba Heidarysafa, Donald E Brown, Kiana Jafari Meimandi, and Laura E Barnes. 2018. Rmdl: Random multimodel deep learning for classification. In Proceedings of the 2nd International Conference on Information System and Data Mining. ACM, 19--28.Google ScholarDigital Library
Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278-- 2324.Google ScholarCross Ref
Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. 2016. Cells in multidimensional recurrent neural networks. The Journal of Machine Learning Research 17, 1 (2016), 3313--3349.Google ScholarDigital Library
Shuang Liang, Shouyi Yin, Leibo Liu,Wayne Luk, and ShaojunWei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072--1086.Google ScholarDigital Library
Bastien Moysset and Ronaldo Messina. 2018. Are 2D-LSTM really dead for offline text recognition? arXiv preprint arXiv:1811.10899 (2018).Google Scholar
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016).Google Scholar
Jinhwan Park andWonyong Sung. 2016. FPGA based implementation of deep neural networks using on-chip memory only. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1011--1015.Google ScholarCross Ref
Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. 2014. Dropout improves recurrent neural networks for handwriting recognition. In 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 285--290.Google ScholarCross Ref
Ioannis Pratikakis, Konstantinos Zagoris, George Barlas, and Basilis Gatos. 2017. ICDAR2017 competition on document image binarization (DIBCO 2017). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 1395--1403.Google Scholar
Joan Puigcerver. 2017. Are multidimensional recurrent layers really necessary for handwritten text recognition?. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 67--72.Google ScholarCross Ref
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525--542.Google ScholarCross Ref
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 89--897.Google ScholarCross Ref
Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. European Design and Automation Association, 1394--1399.Google ScholarDigital Library
Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. 2015. Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. In Advances in neural information processing systems. 2998--3006.Google Scholar
Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.Google ScholarDigital Library
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.Google ScholarCross Ref
Paul Voigtlaender, Patrick Doetsch, and Hermann Ney. 2016. Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 228--233.Google ScholarCross Ref
Gideon Maillette de Buy Wenniger, Lambert Schomaker, and Andy Way. 2019. No Padding Please: Efficient Neural Handwriting Recognition. arXiv preprint arXiv:1902.11208 (2019).Google Scholar
Chen Xu, Jianqiang Yao, Zhouchen Lin,Wenwu Ou, Yuanbin Cao, ZhirongWang, and Hongbin Zha. 2018. Alternating Multi-bit Quantization for Recurrent Neural Networks. arXiv preprint arXiv:1802.00150 (2018).Google Scholar
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).Google Scholar

Index Terms

When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network
Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension. MD-LSTM achieves state-of-the-art results in various applications, including handwritten text recognition, ...
Read More
Efficient Hardware Architectures for 1D- and MD-LSTM Networks
Abstract
Recurrent Neural Networks, in particular One-dimensional and Multidimensional Long Short-Term Memory (1D-LSTM and MD-LSTM) have achieved state-of-the-art classification accuracy in many applications such as machine translation, image caption ...
Read More
Rapid Implementation of Embedded Systems using Xilinx Zynq Platform
SEEDA-CECNSM '16: Proceedings of the SouthEast European Design Automation, Computer Engineering, Computer Networks and Social Media Conference

In any digital system design, it is crucial to achieve the lowest time-to-market possible. Indeed, that need has pushed large FPGA manufacturers to produce SoCs which will implement reprogrammable logic along with CPU and DSP cores. Especially, during ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2020
346 pages
ISBN:9781450370998
DOI:10.1145/3373087
General Chair:
Stephen Neuendorffer
Xilinx, USA
,
Program Chair:
Lesley Shannon
Simon Fraser University, Canada
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
2d-lstm
deep learning
dibco
fpga
hardware architecture
image binarization
long short-term memory
lstm
md-lstm
mnist
zynq
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 604
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Efficient Hardware Architectures for 1D- and MD-LSTM Networks

Rapid Implementation of Embedded Systems using Xilinx Zynq Platform

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Efficient Hardware Architectures for 1D- and MD-LSTM Networks

Rapid Implementation of Embedded Systems using Xilinx Zynq Platform

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media