Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection

Ding, Ning; Peng, Liangrui; Liu, Changsong; Zhang, Yuqi; Zhang, Ruixue; Li, Jie

doi:10.1007/978-3-031-25069-9_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13804))

Included in the following conference series:

European Conference on Computer Vision

1584 Accesses

Abstract

In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TextFuse: Fusing Deep Scene Text Detection Models for Enhanced Performance

Article 07 August 2023

Semantic Segmentation Architecture for Text Detection with an Attention Module

Enhancing Scene Text Detection via Fused Semantic Segmentation Network with Attention

References

Bušta, M., Neumann, L., Matas, J.: Deep TextSpotter: an End-to-End Trainable Scene Text Localization and Recognition Framework. In: ICCV, pp. 2223–2231 (2017)
Google Scholar
Chen, K., Pang, J., Wang, J., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)
Google Scholar
Chen, K., Wang, J., Pang, J., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Dasgupta, K., Das, S., Bhattacharya, U.: Stratified multi-task learning for robust spotting of scene texts. In: ICPR, pp. 3130–3137 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Gao, Y., Huang, Z., Dai, Y., et al.: DSAN: double supervised network with attention mechanism for scene text recognition. In: VCIP, pp. 1–4 (2019)
Google Scholar
Guo, M.H., Xu, T.X., Liu, J.J., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)
Google Scholar
He, K., Gkioxari, G., Dollár, P., et al.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
Hu, Y., Zhang, Y., Yu, W., et al.: Transformer-convolution network for arbitrary shape text detection. In: ICMLSC, pp. 120–126 (2022)
Google Scholar
Huang, Z., Li, W., Xia, X.G., et al.: A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2021)
Google Scholar
Kittenplon, Y., Lavi, I., Fogel, S., et al.: Towards weakly-supervised text spotting using a multi-task transformer. In: CVPR, pp. 4604–4613 (2022)
Google Scholar
Liang, T., Chu, X., Liu, Y., et al.: CBNetV2: a composite backbone network architecture for object detection. arXiv preprint arXiv:2107.00420 (2021)
Liu, Y., Wang, Y., Wang, S., et al.: CBNet: a novel composite backbone network architecture for object detection. In: AAAI, pp. 11653–11660 (2020)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Nayef, N., Yin, F., Bizid, I., et al.: ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In: ICDAR, pp. 1454–1459 (2017)
Google Scholar
Nayef, N., Patel, Y., Busta, M., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587 (2019)
Google Scholar
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 32, 8024–8035 (2019)
Google Scholar
Petsiuk, V., Jain, R., Manjunatha, V., et al.: Black-box explanation of object detectors via saliency maps. In: CVPR, pp. 11443–11452 (2021)
Google Scholar
Qin, X., Zhou, Y., Guo, Y., et al.: Mask is all you need: rethinking mask R-CNN for dense and arbitrary-shaped scene text detection. In: ACM MM, pp. 414–423 (2021)
Google Scholar
Raisi, Z., Naiel, M.A., Younes, G., et al.: Transformer-based text detection in the wild. In: CVPR Workshop, pp. 3162–3171 (2021)
Google Scholar
Sarshogh, M.R., Hines, K.: A multi-task network for localization and recognition of text in images. In: ICDAR, pp. 494–501 (2019)
Google Scholar
Tang, J., Zhang, W., Liu, H., et al.: few could be better than all: feature sampling and grouping for scene text detection. In: CVPR, pp. 4563–4572 (2022)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wu, X., Wang, T., Wang, S.: Cross-modal learning based on semantic correlation and multi-task learning for text-video retrieval. Electronics 9(12), 2125 (2020)
Article Google Scholar
Yan, R., Peng, L., Xiao, S., et al.: MEAN: multi-element attention network for scene text recognition. In: ICPR, pp. 6850–6857 (2021)
Google Scholar
Yan, R., Peng, L., Xiao, S., et al.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)
Google Scholar
Zhang, C., Liang, B., Huang, Z., et al.: Look more than once: an accurate detector for text of arbitrary shapes. In: CVPR, pp. 10552–10561 (2019)
Google Scholar
Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461 (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

Download references

Author information

Authors and Affiliations

Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
Ning Ding & Liangrui Peng
Department of Electronic Engineering, Tsinghua University, Beijing, China
Ning Ding, Liangrui Peng & Changsong Liu
Shanghai Pudong Development Bank, Shanghai, China
Yuqi Zhang, Ruixue Zhang & Jie Li

Authors

Ning Ding
View author publications
You can also search for this author in PubMed Google Scholar
Liangrui Peng
View author publications
You can also search for this author in PubMed Google Scholar
Changsong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuqi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ruixue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liangrui Peng .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, N., Peng, L., Liu, C., Zhang, Y., Zhang, R., Li, J. (2023). Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-25069-9_21
Published: 14 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25068-2
Online ISBN: 978-3-031-25069-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection