Abstract
In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bušta, M., Neumann, L., Matas, J.: Deep TextSpotter: an End-to-End Trainable Scene Text Localization and Recognition Framework. In: ICCV, pp. 2223–2231 (2017)
Chen, K., Pang, J., Wang, J., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)
Chen, K., Wang, J., Pang, J., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Dasgupta, K., Das, S., Bhattacharya, U.: Stratified multi-task learning for robust spotting of scene texts. In: ICPR, pp. 3130–3137 (2021)
Deng, J., Dong, W., Socher, R., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Gao, Y., Huang, Z., Dai, Y., et al.: DSAN: double supervised network with attention mechanism for scene text recognition. In: VCIP, pp. 1–4 (2019)
Guo, M.H., Xu, T.X., Liu, J.J., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)
He, K., Gkioxari, G., Dollár, P., et al.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Hu, Y., Zhang, Y., Yu, W., et al.: Transformer-convolution network for arbitrary shape text detection. In: ICMLSC, pp. 120–126 (2022)
Huang, Z., Li, W., Xia, X.G., et al.: A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2021)
Kittenplon, Y., Lavi, I., Fogel, S., et al.: Towards weakly-supervised text spotting using a multi-task transformer. In: CVPR, pp. 4604–4613 (2022)
Liang, T., Chu, X., Liu, Y., et al.: CBNetV2: a composite backbone network architecture for object detection. arXiv preprint arXiv:2107.00420 (2021)
Liu, Y., Wang, Y., Wang, S., et al.: CBNet: a novel composite backbone network architecture for object detection. In: AAAI, pp. 11653–11660 (2020)
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Nayef, N., Yin, F., Bizid, I., et al.: ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In: ICDAR, pp. 1454–1459 (2017)
Nayef, N., Patel, Y., Busta, M., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587 (2019)
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 32, 8024–8035 (2019)
Petsiuk, V., Jain, R., Manjunatha, V., et al.: Black-box explanation of object detectors via saliency maps. In: CVPR, pp. 11443–11452 (2021)
Qin, X., Zhou, Y., Guo, Y., et al.: Mask is all you need: rethinking mask R-CNN for dense and arbitrary-shaped scene text detection. In: ACM MM, pp. 414–423 (2021)
Raisi, Z., Naiel, M.A., Younes, G., et al.: Transformer-based text detection in the wild. In: CVPR Workshop, pp. 3162–3171 (2021)
Sarshogh, M.R., Hines, K.: A multi-task network for localization and recognition of text in images. In: ICDAR, pp. 494–501 (2019)
Tang, J., Zhang, W., Liu, H., et al.: few could be better than all: feature sampling and grouping for scene text detection. In: CVPR, pp. 4563–4572 (2022)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS (2017)
Wu, X., Wang, T., Wang, S.: Cross-modal learning based on semantic correlation and multi-task learning for text-video retrieval. Electronics 9(12), 2125 (2020)
Yan, R., Peng, L., Xiao, S., et al.: MEAN: multi-element attention network for scene text recognition. In: ICPR, pp. 6850–6857 (2021)
Yan, R., Peng, L., Xiao, S., et al.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)
Zhang, C., Liang, B., Huang, Z., et al.: Look more than once: an accurate detector for text of arbitrary shapes. In: CVPR, pp. 10552–10561 (2019)
Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461 (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ding, N., Peng, L., Liu, C., Zhang, Y., Zhang, R., Li, J. (2023). Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-25069-9_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25068-2
Online ISBN: 978-3-031-25069-9
eBook Packages: Computer ScienceComputer Science (R0)