skip to main content
10.1145/3474085.3475423acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Recent scene text spotters that integrate text detection module and recognition module have made significant progress. However, existing methods encounter two problems. 1). The data imbalance issue between text detection module and text recognition module limits the performance of text spotters. 2). The default left-to-right reading direction leads to errors in unconventional text spotting. In this paper, we propose a novel scene text spotter TDI to solve these problems. Firstly, in order to solve the data imbalance problem, a sample generation algorithm is proposed to generate plenty of samples online for training the text recognition module by using character features and character labels. Secondly, a weakly supervised character generation algorithm is designed to generate character-level labels from word-level labels for the sample generation algorithm and the training of the text detection module. Finally, in order to spot arbitrarily arranged text correctly, a direction perception module is proposed to perceive the reading direction of text instance. Experiments on several benchmarks show that these designs can significantly improve the performance of text spotter. Specifically, our method outperforms state-of-the-art methods on three public datasets in both text detection and end-to-end text recognition, which fully proves the effectiveness and robustness of our method.

Skip Supplemental Material Section

Supplemental Material

MM21-fp1473.mp4

mp4

15.8 MB

References

  1. Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In IEEE International Conference on Computer Vision.Google ScholarGoogle Scholar
  2. Michal Buvs ta, Yash Patel, and Jiri Matas. 2018. E2E-MLT-an unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919 (2018).Google ScholarGoogle Scholar
  3. Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 935--942.Google ScholarGoogle ScholarCross RefCross Ref
  4. Chee-Kheng Ch'ng, Chee Seng Chan, and Cheng-Lin Liu. 2020. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), Vol. 23, 1 (2020), 31--52.Google ScholarGoogle ScholarCross RefCross Ref
  5. Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 248--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 9076--9085.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 (2017).Google ScholarGoogle Scholar
  8. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, 369--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In IEEE Conference on Computer Vision & Pattern Recognition.Google ScholarGoogle Scholar
  11. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarGoogle ScholarCross RefCross Ref
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  13. Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5020--5029.Google ScholarGoogle ScholarCross RefCross Ref
  14. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. International Journal of Computer Vision, Vol. 116, 1 (2016), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, and Shijian Lu. 2015. ICDAR 2015 competition on Robust Reading. In International Conference on Document Analysis & Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez I Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In International Conference on Document Analysis & Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 5238--5246.Google ScholarGoogle ScholarCross RefCross Ref
  18. Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang Bai. 2019. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE TPAMI (2019).Google ScholarGoogle Scholar
  19. Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16. Springer, 706--722.Google ScholarGoogle Scholar
  20. Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. Textboxes+: A single-shot oriented scene text detector. IEEE transactions on image processing, Vol. 27, 8 (2018), 3676--3690.Google ScholarGoogle ScholarCross RefCross Ref
  21. Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Thirty-First AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5676--5685.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9809--9818.Google ScholarGoogle ScholarCross RefCross Ref
  25. Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67--83.Google ScholarGoogle ScholarCross RefCross Ref
  26. Lukávs Neumann and Jivr 'i Matas. 2016. Real-time lexicon-free scene text localization and recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 9 (2016), 1872--1885.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. 2019. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 4704--4714.Google ScholarGoogle ScholarCross RefCross Ref
  28. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI, Vol. 39, 11 (2016), 2298--2304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yipeng Sun, Chengquan Zhang, Zuming Huang, Jiaming Liu, Junyu Han, and Errui Ding. 2018. Textnet: Irregular text reading from images with an end-to-end trainable network. In Asian Conference on Computer Vision. Springer, 83--99.Google ScholarGoogle Scholar
  31. Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020 a. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12160--12167.Google ScholarGoogle ScholarCross RefCross Ref
  32. Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision. IEEE, 1457--1464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Youliang Tian, Zilong Fu, and Yongdong Zhang. 2020 b. R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  34. Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1s (2019), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fangneng Zhan, Chuhui Xue, and Shijian Lu. 2019. GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition. In The IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  36. Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. arXiv preprint arXiv:1904.06535 (2019).Google ScholarGoogle Scholar
  37. Zhuoyao Zhong, Lianwen Jin, Shuye Zhang, and Ziyong Feng. 2016. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 (2016).Google ScholarGoogle Scholar
  38. Yu Zhou, Hongtao Xie, Shancheng Fang, Yan Li, and Yongdong Zhang. 2020. CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes. In Proceedings of the 28th ACM International Conference on Multimedia. 2571--2580. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader