research-article

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

Authors:
Yu Zhou

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Hongtao Xie

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Shancheng Fang

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Jing Wang

Huawei Cloud & AI, Shenzhen, China

Huawei Cloud & AI, Shenzhen, China
View Profile

,
Zhengjun Zha

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Yongdong Zhang

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 2510–2518https://doi.org/10.1145/3474085.3475423

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2510–2518

ABSTRACT

Recent scene text spotters that integrate text detection module and recognition module have made significant progress. However, existing methods encounter two problems. 1). The data imbalance issue between text detection module and text recognition module limits the performance of text spotters. 2). The default left-to-right reading direction leads to errors in unconventional text spotting. In this paper, we propose a novel scene text spotter TDI to solve these problems. Firstly, in order to solve the data imbalance problem, a sample generation algorithm is proposed to generate plenty of samples online for training the text recognition module by using character features and character labels. Secondly, a weakly supervised character generation algorithm is designed to generate character-level labels from word-level labels for the sample generation algorithm and the training of the text detection module. Finally, in order to spot arbitrarily arranged text correctly, a direction perception module is proposed to perceive the reading direction of text instance. Experiments on several benchmarks show that these designs can significantly improve the performance of text spotter. Specifically, our method outperforms state-of-the-art methods on three public datasets in both text detection and end-to-end text recognition, which fully proves the effectiveness and robustness of our method.

Supplemental Material

MM21-fp1473.mp4

mp4

15.8 MB

Download

References

Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In IEEE International Conference on Computer Vision.Google Scholar
Michal Buvs ta, Yash Patel, and Jiri Matas. 2018. E2E-MLT-an unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919 (2018).Google Scholar
Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 935--942.Google ScholarCross Ref
Chee-Kheng Ch'ng, Chee Seng Chan, and Cheng-Lin Liu. 2020. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), Vol. 23, 1 (2020), 31--52.Google ScholarCross Ref
Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 248--256. Google ScholarDigital Library
Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 9076--9085.Google ScholarCross Ref
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 (2017).Google Scholar
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarDigital Library
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, 369--376. Google ScholarDigital Library
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In IEEE Conference on Computer Vision & Pattern Recognition.Google Scholar
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5020--5029.Google ScholarCross Ref
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. International Journal of Computer Vision, Vol. 116, 1 (2016), 1--20. Google ScholarDigital Library
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, and Shijian Lu. 2015. ICDAR 2015 competition on Robust Reading. In International Conference on Document Analysis & Recognition. Google ScholarDigital Library
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez I Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In International Conference on Document Analysis & Recognition. Google ScholarDigital Library
Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 5238--5246.Google ScholarCross Ref
Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang Bai. 2019. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE TPAMI (2019).Google Scholar
Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16. Springer, 706--722.Google Scholar
Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. Textboxes+: A single-shot oriented scene text detector. IEEE transactions on image processing, Vol. 27, 8 (2018), 3676--3690.Google ScholarCross Ref
Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Thirty-First AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.Google ScholarCross Ref
Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5676--5685.Google ScholarCross Ref
Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9809--9818.Google ScholarCross Ref
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67--83.Google ScholarCross Ref
Lukávs Neumann and Jivr 'i Matas. 2016. Real-time lexicon-free scene text localization and recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 9 (2016), 1872--1885.Google ScholarDigital Library
Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. 2019. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 4704--4714.Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarDigital Library
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI, Vol. 39, 11 (2016), 2298--2304.Google ScholarDigital Library
Yipeng Sun, Chengquan Zhang, Zuming Huang, Jiaming Liu, Junyu Han, and Errui Ding. 2018. Textnet: Irregular text reading from images with an end-to-end trainable network. In Asian Conference on Computer Vision. Springer, 83--99.Google Scholar
Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020 a. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12160--12167.Google ScholarCross Ref
Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision. IEEE, 1457--1464. Google ScholarDigital Library
Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Youliang Tian, Zilong Fu, and Yongdong Zhang. 2020 b. R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection. IEEE Transactions on Multimedia (2020).Google Scholar
Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1s (2019), 1--17. Google ScholarDigital Library
Fangneng Zhan, Chuhui Xue, and Shijian Lu. 2019. GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. arXiv preprint arXiv:1904.06535 (2019).Google Scholar
Zhuoyao Zhong, Lianwen Jin, Shuye Zhang, and Ziyong Feng. 2016. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 (2016).Google Scholar
Yu Zhou, Hongtao Xie, Shancheng Fang, Yan Li, and Yongdong Zhang. 2020. CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes. In Proceedings of the 28th ACM International Conference on Multimedia. 2571--2580. Google ScholarDigital Library

Index Terms

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting
Computer Vision – ECCV 2020
Abstract
Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are ...
Read More
Character Region Attention for Text Spotting
Computer Vision – ECCV 2020
Abstract
A scene text spotter is composed of text detection and recognition modules. Many studies have been conducted to unify these modules into an end-to-end trainable model to achieve better performance. A typical architecture places detection and ...
Read More
Automatic detection and recognition of Korean text in outdoor signboard images

In this paper, an automatic translation system for Korean signboard images is described. The system includes detection and extraction of text for the recognition and translation of shop names into English. It deals with impediments caused by different ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
text detection
text recognition
text spotting
weakly supervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 174
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

Character Region Attention for Text Spotting

Automatic detection and recognition of Korean text in outdoor signboard images