skip to main content
research-article

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection

Published: 16 December 2019 Publication History

Abstract

Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. To tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To verify the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method can achieve competitive results, and well handle scene text objects with arbitrary shapes (i.e., curved, oriented, and horizontal forms).

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from Arxiv Preprint Arxiv:1409.0473 (2014).
[2]
Michal Busta, Lukas Neumann, and Jiri Matas. 2015. Fastext: Efficient unconstrained scene text detector. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1206--1214.
[3]
Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 935--942.
[4]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2 (2018), 48.
[5]
Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 6773--6780.
[6]
Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2963--2970.
[7]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.
[8]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315--2324.
[9]
Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, G. Alexander, I. I. Ororbia, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 474--483.
[10]
Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the International Conference on Computer Vision (ICCV’17). 3047--3055.
[11]
Tong He, Weilin Huang, Yu Qiao, and Jian Yao. 2016. Accurate text localization in natural image with cascaded convolutional text network. Retrieved from: Arxiv Preprint Arxiv:1603.09423 (2016).
[12]
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5020--5029.
[13]
Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 745--753.
[14]
Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character-based text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 4940--4949.
[15]
Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput., Commun., Applic. 14, 1 (2017), 10.
[16]
Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the International Conference on Computer Vision (ICCV’13). 1241--1248.
[17]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (ACMMM’14). 675--678.
[18]
Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. Retrieved from Arxiv Preprint Arxiv:1706.09579 (2017).
[19]
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156--1160.
[20]
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). 1484--1493.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.
[22]
Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 384--393.
[23]
Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. 2018. Shape robust text detection with progressive scale expansion network. Retrieved from Arxiv Preprint Arxiv:1806.02559 (2018).
[24]
Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4161--4167.
[25]
Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5909--5918.
[26]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). 21--37.
[27]
Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. Retrieved from Arxiv Preprint Arxiv:1712.02170 (2017).
[28]
Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multimedia Tools Applic. 78, 13 (2019), 18205--18227.
[29]
Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19--35.
[30]
Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7202--7209.
[31]
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67--83.
[32]
Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7553--7563.
[33]
Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 11 (2018), 3111--3122.
[34]
Andrew Mehnert and Paul Jackway. 1997. An improved seeded region growing algorithm. Pattern Recog. Lett. 18, 10 (1997), 1065--1071.
[35]
Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 1454--1459.
[36]
Lukas Neumann and Jiri Matas. 2010. A method for text localization and recognition in real-world images. In Proceedings of the Asian Conference on Computer Vision (ACCV’10). 770--783.
[37]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.
[38]
Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 1491--1496.
[39]
Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2550--2558.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from Arxiv Preprint Arxiv:1409.1556 (2014).
[41]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.
[42]
Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV’16). 56--72.
[43]
Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2s (2018), 40.
[44]
Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280--296.
[45]
Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1395--1403.
[46]
Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3073--3082.
[47]
Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1083--1090.
[48]
Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. Retrieved from Arxiv Preprint Arxiv:1606.09002 (2016).
[49]
Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 5 (2014), 970--983.
[50]
Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: a comprehensive survey. IEEE Transactions on Image Processing (TIP) 25, 6 (2016), 2752--2773.
[51]
Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1741--1750.
[52]
Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4159--4167.
[53]
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2642--2651.
[54]
Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. (FCS) 10, 1 (2016), 19--36.

Cited By

View all
  • (2025)Robust Image Hashing With Weighted Saliency Map and Laplacian EigenmapsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351655220(665-676)Online publication date: 1-Jan-2025
  • (2024)CoDe: Customizing Urban HD Map Deployment Strategy with Spatio-Temporal GPS TraceACM Transactions on Sensor Networks10.1145/368982320:6(1-21)Online publication date: 27-Sep-2024
  • (2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
November 2019
322 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3376119
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2019
Accepted: 01 August 2019
Revised: 01 August 2019
Received: 01 December 2018
Published in TOMM Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Scene text detection
  2. attention
  3. bidirectional LSTM
  4. feature fusion
  5. semantic segmentation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Robust Image Hashing With Weighted Saliency Map and Laplacian EigenmapsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351655220(665-676)Online publication date: 1-Jan-2025
  • (2024)CoDe: Customizing Urban HD Map Deployment Strategy with Spatio-Temporal GPS TraceACM Transactions on Sensor Networks10.1145/368982320:6(1-21)Online publication date: 27-Sep-2024
  • (2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
  • (2024)Yi printed character recognition based on deep learningProcedia Computer Science10.1016/j.procs.2024.08.111242(584-591)Online publication date: 2024
  • (2024)Buffer-textEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107774130:COnline publication date: 1-Apr-2024
  • (2024)Person Reidentification using 3D inception based Spatio-temporal features learning, attribute recognition, and RerankingMultimedia Tools and Applications10.1007/s11042-023-15473-z83:1(2007-2030)Online publication date: 1-Jan-2024
  • (2023)Multi-level Attention-based Domain Disentanglement for BCDRACM Transactions on Information Systems10.1145/357692541:4(1-24)Online publication date: 23-Mar-2023
  • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
  • (2023)Perceptual Hashing of Deep Convolutional Neural Networks for Model Copy DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357277719:3(1-20)Online publication date: 2-Mar-2023
  • (2023)Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354868819:2(1-21)Online publication date: 6-Feb-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media