ABSTRACT
Recent scene text spotters that integrate text detection module and recognition module have made significant progress. However, existing methods encounter two problems. 1). The data imbalance issue between text detection module and text recognition module limits the performance of text spotters. 2). The default left-to-right reading direction leads to errors in unconventional text spotting. In this paper, we propose a novel scene text spotter TDI to solve these problems. Firstly, in order to solve the data imbalance problem, a sample generation algorithm is proposed to generate plenty of samples online for training the text recognition module by using character features and character labels. Secondly, a weakly supervised character generation algorithm is designed to generate character-level labels from word-level labels for the sample generation algorithm and the training of the text detection module. Finally, in order to spot arbitrarily arranged text correctly, a direction perception module is proposed to perceive the reading direction of text instance. Experiments on several benchmarks show that these designs can significantly improve the performance of text spotter. Specifically, our method outperforms state-of-the-art methods on three public datasets in both text detection and end-to-end text recognition, which fully proves the effectiveness and robustness of our method.
Supplemental Material
- Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In IEEE International Conference on Computer Vision.Google Scholar
- Michal Buvs ta, Yash Patel, and Jiri Matas. 2018. E2E-MLT-an unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919 (2018).Google Scholar
- Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 935--942.Google ScholarCross Ref
- Chee-Kheng Ch'ng, Chee Seng Chan, and Cheng-Lin Liu. 2020. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), Vol. 23, 1 (2020), 31--52.Google ScholarCross Ref
- Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 248--256. Google ScholarDigital Library
- Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 9076--9085.Google ScholarCross Ref
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 (2017).Google Scholar
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarDigital Library
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, 369--376. Google ScholarDigital Library
- Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In IEEE Conference on Computer Vision & Pattern Recognition.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5020--5029.Google ScholarCross Ref
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. International Journal of Computer Vision, Vol. 116, 1 (2016), 1--20. Google ScholarDigital Library
- Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, and Shijian Lu. 2015. ICDAR 2015 competition on Robust Reading. In International Conference on Document Analysis & Recognition. Google ScholarDigital Library
- Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez I Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In International Conference on Document Analysis & Recognition. Google ScholarDigital Library
- Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 5238--5246.Google ScholarCross Ref
- Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang Bai. 2019. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE TPAMI (2019).Google Scholar
- Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16. Springer, 706--722.Google Scholar
- Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. Textboxes+: A single-shot oriented scene text detector. IEEE transactions on image processing, Vol. 27, 8 (2018), 3676--3690.Google ScholarCross Ref
- Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Thirty-First AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.Google ScholarCross Ref
- Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5676--5685.Google ScholarCross Ref
- Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9809--9818.Google ScholarCross Ref
- Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67--83.Google ScholarCross Ref
- Lukávs Neumann and Jivr 'i Matas. 2016. Real-time lexicon-free scene text localization and recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 9 (2016), 1872--1885.Google ScholarDigital Library
- Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. 2019. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 4704--4714.Google ScholarCross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI, Vol. 39, 11 (2016), 2298--2304.Google ScholarDigital Library
- Yipeng Sun, Chengquan Zhang, Zuming Huang, Jiaming Liu, Junyu Han, and Errui Ding. 2018. Textnet: Irregular text reading from images with an end-to-end trainable network. In Asian Conference on Computer Vision. Springer, 83--99.Google Scholar
- Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020 a. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12160--12167.Google ScholarCross Ref
- Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision. IEEE, 1457--1464. Google ScholarDigital Library
- Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Youliang Tian, Zilong Fu, and Yongdong Zhang. 2020 b. R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection. IEEE Transactions on Multimedia (2020).Google Scholar
- Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1s (2019), 1--17. Google ScholarDigital Library
- Fangneng Zhan, Chuhui Xue, and Shijian Lu. 2019. GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. arXiv preprint arXiv:1904.06535 (2019).Google Scholar
- Zhuoyao Zhong, Lianwen Jin, Shuye Zhang, and Ziyong Feng. 2016. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 (2016).Google Scholar
- Yu Zhou, Hongtao Xie, Shancheng Fang, Yan Li, and Yongdong Zhang. 2020. CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes. In Proceedings of the 28th ACM International Conference on Multimedia. 2571--2580. Google ScholarDigital Library
Index Terms
- TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting
Recommendations
AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting
Computer Vision – ECCV 2020Character Region Attention for Text Spotting
Computer Vision – ECCV 2020AbstractA scene text spotter is composed of text detection and recognition modules. Many studies have been conducted to unify these modules into an end-to-end trainable model to achieve better performance. A typical architecture places detection and ...
Automatic detection and recognition of Korean text in outdoor signboard images
In this paper, an automatic translation system for Korean signboard images is described. The system includes detection and extraction of text for the recognition and translation of shop names into English. It deals with impediments caused by different ...
Comments