Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

Feng, Wei; Yin, Fei; Zhang, Xu-Yao; He, Wenhao; Liu, Cheng-Lin

doi:10.1007/s11263-020-01388-x

Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

Published: 24 October 2020

Volume 129, pages 619–637, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Wei Feng^1,2,
Fei Yin^1,2,
Xu-Yao Zhang^1,2,
Wenhao He³ &
…
Cheng-Lin Liu ORCID: orcid.org/0000-0002-6743-4175^1,2,4

1268 Accesses
Explore all metrics

Abstract

Existing methods for arbitrary shaped text spotting can be divided into two categories: bottom-up methods detect and recognize local areas of text, and then group them into text lines or words; top-down methods detect text regions of interest, then apply polygon fitting and text recognition to the detected regions. In this paper, we analyze the advantages and disadvantages of these two methods, and propose a novel text spotter by fusing bottom-up and top-down processing. To detect text of arbitrary shapes, we employ a bottom-up detector to describe text with a series of rotated squares, and design a top-down detector to represent the region of interest with a minimum enclosing rotated rectangle. Then the text boundary is determined by fusing the outputs of two detectors. To connect arbitrary shaped text detection and recognition, we propose a differentiable operator named RoISlide, which can extract features for arbitrary text regions from whole image feature maps. Based on the extracted features through RoISlide, a CNN and CTC based text recognizer is introduced to make the framework free from character-level annotations. To improve the robustness against scale variance, we further propose a residual dual scale spotting mechanism, where two spotters work on different feature levels, and the high-level spotter is based on residuals of the low-level spotter. Our method has achieved state-of-the-art performance on four English datasets and one Chinese dataset, including both arbitrary shaped and oriented texts. We also provide abundant ablation experiments to analyze how the key components affect the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 12

Arbitrary-shaped scene text detection with keypoint-based shape representation

Article 25 March 2022

SFENet: Arbitrary Shapes Scene Text Detection with Semantic Feature Extractor

Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019). Character region awareness for text detection. In Proceedings of IEEE conference on computer vision and pattern recognition.
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 785–792). IEEE.
Bušta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2231).
Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017). Arbitrarily-oriented text recognition. arXiv preprint arXiv:1711.04226
Ch’ng, C. K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the international conference on document analysis and recognition (Vol. 1, pp. 935–942).
Feng, W., He, W., Yin, F., Zhang, X. Y., & Liu, C. L. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE international conference on computer vision.
Gómez, L., & Karatzas, D. (2017). Textproposals: A text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.
Article Google Scholar
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the international conference on machine learning (pp. 369–376).
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2315–2324).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017a). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017b). Single shot text detector with regional attention. In Proceedings of the IEEE international conference on computer vision (pp. 3066–3074).
He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016a). Reading scene text in deep convolutional sequences. In Proceedings of the AAAI conference on artificial intelligence (Vol. 16, pp. 3501–3508).
He, T., Huang, W., Qiao, Y., & Yao, J. (2016b). Text-attentional convolutional neural network for scene text detection. IEEE Transactions on Image Processing, 25(6), 2529–2541.
Article MathSciNet Google Scholar
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018a). An end-to-end textspotter with explicit alignment and attention. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5020–5029).
He, W., Zhang, X., Yin, F., & Liu, C. (2018b). Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Transactions on Image Processing, 27(11), 5406–5419.
Article MathSciNet Google Scholar
He, W., Zhang, X., Yin, F., Luo, Z., Ogier, J., & Liu, C. (2020). Realtime multi-scale scene text detection with scale-based region proposal network. Pattern Recognition, 98, 107026.
Article Google Scholar
He, W., Zhang, X. Y., Yin, F., Liu, & C. L. (2017c). Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision (pp. 745–753).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision.
Huang, L., Yang, Y., Deng, Y., & Yu, Y. (2015). Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Article MathSciNet Google Scholar
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In Proceedings of the European conference on computer vision (pp. 512–528).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM international conference on multimedia (pp. 675–678).
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In Proceedings of the international conference on document analysis and recognition (pp. 1156–1160).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the neural information processing systems (pp. 1097–1105).
Li, H., Wang, P., & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (pp. 5238–5246).
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2019). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1908.08207.
Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Article MathSciNet Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence (pp. 4161–4167).
Liao, M., Zhu, Z., Shi, B., Xia, G., & Bai, X. (2018). Rotation-sensitive regression for oriented scene text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5909–5918).
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (p. 4).
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5676–5685).
Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 3454–3461).
Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European conference on computer vision (pp. 19–35).
Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision.
Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018). Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 7553–7563).
Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In Proceedings of IEEE conference on computer vision and pattern recognition.
Neumann, L., & Matas, J. (2016). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.
Article Google Scholar
Patel, Y., Bušta, M., & Matas, J. (2018). E2e-mlt—An unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the neural information processing systems (pp. 91–99).
Shi, B., Bai, X., & Belongie, S. (2017). Detecting oriented text in natural images by linking segments. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2550–2558).
Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2019). ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2035–2048.
Article Google Scholar
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 761–769).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations.
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Lim Tan, C. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision (pp. 4651–4659).
Veit, A., Matera, T., Neumann, L., Matas, J., & Belongie, S. J. (2016). Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1457–1464).
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In Proceedings of IEEE conference on pattern recognition (pp. 3304–3308).
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., et al. (2019). Shape robust text detection with progressive scale expansion network. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, X., Jiang, Y., Luo, Z., Liu, C. L., Choi, H., & Kim, S. (2019). Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wikipedia. Eye movement in reading—wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Eye_movement_in_reading
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, & C. L. (2017). Learning to read irregular text with attention mechanisms. In Proceedings of international joint conference on artificial intelligence (pp. 3280–3286).
Ye, Q., & Doermann, D. S. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.
Article Google Scholar
Yin, F., Wu, Y. C., Zhang, X. Y., & Liu, C. L. (2017). Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727.
Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. S. (2016). Unitbox: An advanced object detection network. In Proceedings of the ACM international conference on multimedia (pp. 516–520).
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., et al. (2017). East: An efficient and accurate scene text detector. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5551–5560).
Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Major Project for New Generation AI (Grant No. 2018AAA0100400), the National Natural Science Foundation of China (Grant Nos. 61733007, 61721004), the Key Research Program of Frontier Sciences of CAS under Grant ZDBS-LY-7004, and the Youth Innovation Promotion Association of CAS under Grant 2019141.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, People’s Republic of China
Wei Feng, Fei Yin, Xu-Yao Zhang & Cheng-Lin Liu
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, People’s Republic of China
Wei Feng, Fei Yin, Xu-Yao Zhang & Cheng-Lin Liu
Tencent Map Big Data Lab, Beijing, 100193, People’s Republic of China
Wenhao He
CAS Center for Excellence of Brain Science and Intelligence Technology, Beijing, 100190, People’s Republic of China
Cheng-Lin Liu

Authors

Wei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Fei Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xu-Yao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao He
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng-Lin Liu.

Additional information

Communicated by Jiaya Jia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, W., Yin, F., Zhang, XY. et al. Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing. Int J Comput Vis 129, 619–637 (2021). https://doi.org/10.1007/s11263-020-01388-x

Download citation

Received: 22 February 2020
Accepted: 24 September 2020
Published: 24 October 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11263-020-01388-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arbitrary-shaped scene text detection with keypoint-based shape representation

SFENet: Arbitrary Shapes Scene Text Detection with Semantic Feature Extractor

Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arbitrary-shaped scene text detection with keypoint-based shape representation

SFENet: Arbitrary Shapes Scene Text Detection with Semantic Feature Extractor

Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation