Skip to main content

Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 Workshops (ECCV 2022)

Abstract

In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bušta, M., Neumann, L., Matas, J.: Deep TextSpotter: an End-to-End Trainable Scene Text Localization and Recognition Framework. In: ICCV, pp. 2223–2231 (2017)

    Google Scholar 

  2. Chen, K., Pang, J., Wang, J., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)

    Google Scholar 

  3. Chen, K., Wang, J., Pang, J., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  4. Dasgupta, K., Das, S., Bhattacharya, U.: Stratified multi-task learning for robust spotting of scene texts. In: ICPR, pp. 3130–3137 (2021)

    Google Scholar 

  5. Deng, J., Dong, W., Socher, R., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  6. Gao, Y., Huang, Z., Dai, Y., et al.: DSAN: double supervised network with attention mechanism for scene text recognition. In: VCIP, pp. 1–4 (2019)

    Google Scholar 

  7. Guo, M.H., Xu, T.X., Liu, J.J., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)

    Google Scholar 

  8. He, K., Gkioxari, G., Dollár, P., et al.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)

    Google Scholar 

  9. Hu, Y., Zhang, Y., Yu, W., et al.: Transformer-convolution network for arbitrary shape text detection. In: ICMLSC, pp. 120–126 (2022)

    Google Scholar 

  10. Huang, Z., Li, W., Xia, X.G., et al.: A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2021)

    Google Scholar 

  11. Kittenplon, Y., Lavi, I., Fogel, S., et al.: Towards weakly-supervised text spotting using a multi-task transformer. In: CVPR, pp. 4604–4613 (2022)

    Google Scholar 

  12. Liang, T., Chu, X., Liu, Y., et al.: CBNetV2: a composite backbone network architecture for object detection. arXiv preprint arXiv:2107.00420 (2021)

  13. Liu, Y., Wang, Y., Wang, S., et al.: CBNet: a novel composite backbone network architecture for object detection. In: AAAI, pp. 11653–11660 (2020)

    Google Scholar 

  14. Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  15. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  16. Nayef, N., Yin, F., Bizid, I., et al.: ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In: ICDAR, pp. 1454–1459 (2017)

    Google Scholar 

  17. Nayef, N., Patel, Y., Busta, M., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587 (2019)

    Google Scholar 

  18. Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 32, 8024–8035 (2019)

    Google Scholar 

  19. Petsiuk, V., Jain, R., Manjunatha, V., et al.: Black-box explanation of object detectors via saliency maps. In: CVPR, pp. 11443–11452 (2021)

    Google Scholar 

  20. Qin, X., Zhou, Y., Guo, Y., et al.: Mask is all you need: rethinking mask R-CNN for dense and arbitrary-shaped scene text detection. In: ACM MM, pp. 414–423 (2021)

    Google Scholar 

  21. Raisi, Z., Naiel, M.A., Younes, G., et al.: Transformer-based text detection in the wild. In: CVPR Workshop, pp. 3162–3171 (2021)

    Google Scholar 

  22. Sarshogh, M.R., Hines, K.: A multi-task network for localization and recognition of text in images. In: ICDAR, pp. 494–501 (2019)

    Google Scholar 

  23. Tang, J., Zhang, W., Liu, H., et al.: few could be better than all: feature sampling and grouping for scene text detection. In: CVPR, pp. 4563–4572 (2022)

    Google Scholar 

  24. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  25. Wu, X., Wang, T., Wang, S.: Cross-modal learning based on semantic correlation and multi-task learning for text-video retrieval. Electronics 9(12), 2125 (2020)

    Article  Google Scholar 

  26. Yan, R., Peng, L., Xiao, S., et al.: MEAN: multi-element attention network for scene text recognition. In: ICPR, pp. 6850–6857 (2021)

    Google Scholar 

  27. Yan, R., Peng, L., Xiao, S., et al.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)

    Google Scholar 

  28. Zhang, C., Liang, B., Huang, Z., et al.: Look more than once: an accurate detector for text of arbitrary shapes. In: CVPR, pp. 10552–10561 (2019)

    Google Scholar 

  29. Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461 (2021)

  30. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liangrui Peng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ding, N., Peng, L., Liu, C., Zhang, Y., Zhang, R., Li, J. (2023). Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25069-9_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25068-2

  • Online ISBN: 978-3-031-25069-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics