Abstract
Existing methods for arbitrary shaped text spotting can be divided into two categories: bottom-up methods detect and recognize local areas of text, and then group them into text lines or words; top-down methods detect text regions of interest, then apply polygon fitting and text recognition to the detected regions. In this paper, we analyze the advantages and disadvantages of these two methods, and propose a novel text spotter by fusing bottom-up and top-down processing. To detect text of arbitrary shapes, we employ a bottom-up detector to describe text with a series of rotated squares, and design a top-down detector to represent the region of interest with a minimum enclosing rotated rectangle. Then the text boundary is determined by fusing the outputs of two detectors. To connect arbitrary shaped text detection and recognition, we propose a differentiable operator named RoISlide, which can extract features for arbitrary text regions from whole image feature maps. Based on the extracted features through RoISlide, a CNN and CTC based text recognizer is introduced to make the framework free from character-level annotations. To improve the robustness against scale variance, we further propose a residual dual scale spotting mechanism, where two spotters work on different feature levels, and the high-level spotter is based on residuals of the low-level spotter. Our method has achieved state-of-the-art performance on four English datasets and one Chinese dataset, including both arbitrary shaped and oriented texts. We also provide abundant ablation experiments to analyze how the key components affect the performance.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig12_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01388-x/MediaObjects/11263_2020_1388_Fig14_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019). Character region awareness for text detection. In Proceedings of IEEE conference on computer vision and pattern recognition.
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 785–792). IEEE.
Bušta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2231).
Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017). Arbitrarily-oriented text recognition. arXiv preprint arXiv:1711.04226
Ch’ng, C. K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the international conference on document analysis and recognition (Vol. 1, pp. 935–942).
Feng, W., He, W., Yin, F., Zhang, X. Y., & Liu, C. L. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE international conference on computer vision.
Gómez, L., & Karatzas, D. (2017). Textproposals: A text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the international conference on machine learning (pp. 369–376).
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2315–2324).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017a). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017b). Single shot text detector with regional attention. In Proceedings of the IEEE international conference on computer vision (pp. 3066–3074).
He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016a). Reading scene text in deep convolutional sequences. In Proceedings of the AAAI conference on artificial intelligence (Vol. 16, pp. 3501–3508).
He, T., Huang, W., Qiao, Y., & Yao, J. (2016b). Text-attentional convolutional neural network for scene text detection. IEEE Transactions on Image Processing, 25(6), 2529–2541.
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018a). An end-to-end textspotter with explicit alignment and attention. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5020–5029).
He, W., Zhang, X., Yin, F., & Liu, C. (2018b). Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Transactions on Image Processing, 27(11), 5406–5419.
He, W., Zhang, X., Yin, F., Luo, Z., Ogier, J., & Liu, C. (2020). Realtime multi-scale scene text detection with scale-based region proposal network. Pattern Recognition, 98, 107026.
He, W., Zhang, X. Y., Yin, F., Liu, & C. L. (2017c). Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision (pp. 745–753).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision.
Huang, L., Yang, Y., Deng, Y., & Yu, Y. (2015). Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In Proceedings of the European conference on computer vision (pp. 512–528).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM international conference on multimedia (pp. 675–678).
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In Proceedings of the international conference on document analysis and recognition (pp. 1156–1160).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the neural information processing systems (pp. 1097–1105).
Li, H., Wang, P., & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (pp. 5238–5246).
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2019). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1908.08207.
Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence (pp. 4161–4167).
Liao, M., Zhu, Z., Shi, B., Xia, G., & Bai, X. (2018). Rotation-sensitive regression for oriented scene text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5909–5918).
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (p. 4).
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5676–5685).
Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 3454–3461).
Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European conference on computer vision (pp. 19–35).
Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision.
Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018). Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 7553–7563).
Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In Proceedings of IEEE conference on computer vision and pattern recognition.
Neumann, L., & Matas, J. (2016). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.
Patel, Y., Bušta, M., & Matas, J. (2018). E2e-mlt—An unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the neural information processing systems (pp. 91–99).
Shi, B., Bai, X., & Belongie, S. (2017). Detecting oriented text in natural images by linking segments. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2550–2558).
Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2019). ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2035–2048.
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 761–769).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations.
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Lim Tan, C. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision (pp. 4651–4659).
Veit, A., Matera, T., Neumann, L., Matas, J., & Belongie, S. J. (2016). Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1457–1464).
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In Proceedings of IEEE conference on pattern recognition (pp. 3304–3308).
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., et al. (2019). Shape robust text detection with progressive scale expansion network. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, X., Jiang, Y., Luo, Z., Liu, C. L., Choi, H., & Kim, S. (2019). Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wikipedia. Eye movement in reading—wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Eye_movement_in_reading
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, & C. L. (2017). Learning to read irregular text with attention mechanisms. In Proceedings of international joint conference on artificial intelligence (pp. 3280–3286).
Ye, Q., & Doermann, D. S. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.
Yin, F., Wu, Y. C., Zhang, X. Y., & Liu, C. L. (2017). Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727.
Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. S. (2016). Unitbox: An advanced object detection network. In Proceedings of the ACM international conference on multimedia (pp. 516–520).
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., et al. (2017). East: An efficient and accurate scene text detector. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5551–5560).
Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.
Acknowledgements
This work was supported by the Major Project for New Generation AI (Grant No. 2018AAA0100400), the National Natural Science Foundation of China (Grant Nos. 61733007, 61721004), the Key Research Program of Frontier Sciences of CAS under Grant ZDBS-LY-7004, and the Youth Innovation Promotion Association of CAS under Grant 2019141.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jiaya Jia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Feng, W., Yin, F., Zhang, XY. et al. Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing. Int J Comput Vis 129, 619–637 (2021). https://doi.org/10.1007/s11263-020-01388-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01388-x