Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling

Wang, Zixiao; Xie, Hongtao; Wang, YuXin; Qu, Yadong; Guo, Fengjun; Liu, Pengwei

doi:10.1007/978-3-031-72848-8_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15124))

Included in the following conference series:

European Conference on Computer Vision

152 Accesses

Abstract

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Background-Insensitive Scene Text Recognition with Text Semantic Segmentation

A Cost-Efficient Framework for Scene Text Detection in the Wild

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

Article 05 March 2024

References

Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935–942. IEEE (2017)
Google Scholar
Chng, C.K., et al.: ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)
Google Scholar
Du, X., Zhou, Z., Zheng, Y., Ma, T., Wu, X., Jin, C.: Modeling stroke mask for end-to-end text erasing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6151–6159 (2023)
Google Scholar
Feng, H., Wang, W., Liu, S., Deng, J., Zhou, W., Li, H.: Deeperaser: deep iterative context mining for generic text eraser. arXiv preprint arXiv:2402.19108 (2024)
Ge, J., Xie, H., Min, S., Li, P., Zhang, Y.: Dual part discovery network for zero-shot learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3244–3252 (2022)
Google Scholar
Ge, J., Xie, H., Min, S., Zhang, Y.: Semantic-guided reinforced region embedding for generalized zero-shot learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1406–1414 (2021)
Google Scholar
Guo, X., Yang, H., Huang, D.: Image inpainting via conditional texture and structure dual generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14134–14143 (2021)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Hou, Y., Chen, J.J., Wang, Z.: Multi-branch network with ensemble learning for text removal in the wild. In: Proceedings of the Asian Conference on Computer Vision, pp. 1333–1349 (2022)
Google Scholar
Jiang, G., Wang, S., Ge, T., Jiang, Y., Wei, Y., Lian, D.: Self-supervised text erasing with controllable image synthesis. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1973–1983 (2022)
Google Scholar
Jiang, L., Dai, B., Wu, W., Loy, C.C.: Focal frequency loss for image reconstruction and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13919–13929 (2021)
Google Scholar
Jiang, Q., Wang, J., Peng, D., Liu, C., Jin, L.: Revisiting scene text recognition: a data perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20543–20554 (2023)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Google Scholar
Khodadadi, M., Behrad, A.: Text localization, extraction and inpainting in color images. In: 20th Iranian Conference on Electrical Engineering (ICEE 2012), pp. 1035–1040. IEEE (2012)
Google Scholar
Lee, H., Choi, C.: The surprisingly straightforward scene text removal method with gated attention and region of interest generation: a comprehensive prominent model analysis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 457–472. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_26
Chapter Google Scholar
Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10758–10768 (2022)
Google Scholar
Liu, C., et al.: Don’t forget me: accurate background recovery for text removal via modeling local-global context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 409–426. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_24
Chapter Google Scholar
Liu, C., Liu, Y., Jin, L., Zhang, S., Luo, C., Wang, Y.: Erasenet: end-to-end text removal in the wild. IEEE Trans. Image Process. 29, 8760–8775 (2020)
Article Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lyu, G., Liu, K., Zhu, A., Uchida, S., Iwana, B.K.: Fetnet: feature erasing and transferring network for scene text removal. Pattern Recogn. 140, 109531 (2023)
Article Google Scholar
Lyu, G., Zhu, A.: Psstrnet: progressive segmentation-guided scene text removal network. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Google Scholar
Nayef, N., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1582–1587. IEEE (2019)
Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Google Scholar
Peng, D., Liu, C., Liu, Y., Jin, L.: Viteraser: harnessing the power of vision transformers for scene text removal with segmim pretraining. arXiv preprint arXiv:2306.12106 (2023)
Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W., Hassner, T.: Textocr: towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8802–8812 (2021)
Google Scholar
Sun, Y., et al.: ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1557–1562. IEEE (2019)
Google Scholar
Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2149–2159 (2022)
Google Scholar
Tang, Z., Miyazaki, T., Sugaya, Y., Omachi, S.: Stroke-based scene text erasing using synthetic data for training. IEEE Trans. Image Process. 30, 9306–9320 (2021)
Article Google Scholar
Tian, K., Jiang, Y., Diao, Q., Lin, C., Wang, L., Yuan, Z.: Designing BERT for convolutional networks: sparse and hierarchical masked modeling. arXiv preprint arXiv:2301.03580 (2023)
Tursun, O., Denman, S., Zeng, R., Sivapalan, S., Sridharan, S., Fookes, C.: MTRNet++: one-stage mask-based scene text eraser. Comput. Vis. Image Underst. 201, 103066 (2020)
Article Google Scholar
Tursun, O., Zeng, R., Denman, S., Sivapalan, S., Sridharan, S., Fookes, C.: MTRNet: a generic scene text eraser. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 39–44. IEEE (2019)
Google Scholar
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wagh, P.D., Patil, D.: Text detection and removal from image using inpainting with smoothing. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–4. IEEE (2015)
Google Scholar
Wang, K., et al.: Masked text modeling: a self-supervised pre-training method for scene text detection. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2006–2015 (2023)
Google Scholar
Wang, Y., Xie, H., Wang, Z., Qu, Y., Zhang, Y.: What is the real need for scene text removal? Exploring the background integrity and erasure exhaustivity properties. IEEE Trans. Image Process. (2023)
Google Scholar
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: UFormer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)
Google Scholar
Yang, X., Song, Z., King, I., Xu, Z.: A survey on deep semi-supervised learning. IEEE Trans. Knowl. Data Eng. 35(9), 8934–8954 (2022)
Article Google Scholar
Zdenek, J., Nakayama, H.: Erasing scene text with weak supervision. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2238–2246 (2020)
Google Scholar
Zhang, R., et al.: ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1577–1581. IEEE (2019)
Google Scholar
Zhang, S., Liu, Y., Jin, L., Huang, Y., Lai, S.: ENSNet: ensconce text in the wild. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 801–808 (2019)
Google Scholar
Zhang, W., et al.: Context-aware image inpainting with learned semantic priors. arXiv preprint arXiv:2106.07220 (2021)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2022YFB3104700), the National Nature Science Foundation of China (62121002, U23B2028, 62102384). We acknowledge the support of GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC. We also thank the USTC supercomputing center for providing computational resources for this project.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Zixiao Wang, Hongtao Xie, YuXin Wang & Yadong Qu
IntSig Information Co., Ltd., Shanghai, China
Fengjun Guo & Pengwei Liu

Authors

Zixiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongtao Xie
View author publications
You can also search for this author in PubMed Google Scholar
YuXin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yadong Qu
View author publications
You can also search for this author in PubMed Google Scholar
Fengjun Guo
View author publications
You can also search for this author in PubMed Google Scholar
Pengwei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongtao Xie .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5553 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Xie, H., Wang, Y., Qu, Y., Guo, F., Liu, P. (2025). Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-72848-8_21
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Background-Insensitive Scene Text Recognition with Text Semantic Segmentation

A Cost-Efficient Framework for Scene Text Detection in the Wild

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5553 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Background-Insensitive Scene Text Recognition with Text Semantic Segmentation

A Cost-Efficient Framework for Scene Text Detection in the Wild

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5553 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation