Skip to main content

Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)

    Google Scholar 

  2. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  3. Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935–942. IEEE (2017)

    Google Scholar 

  4. Chng, C.K., et al.: ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)

    Google Scholar 

  5. Du, X., Zhou, Z., Zheng, Y., Ma, T., Wu, X., Jin, C.: Modeling stroke mask for end-to-end text erasing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6151–6159 (2023)

    Google Scholar 

  6. Feng, H., Wang, W., Liu, S., Deng, J., Zhou, W., Li, H.: Deeperaser: deep iterative context mining for generic text eraser. arXiv preprint arXiv:2402.19108 (2024)

  7. Ge, J., Xie, H., Min, S., Li, P., Zhang, Y.: Dual part discovery network for zero-shot learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3244–3252 (2022)

    Google Scholar 

  8. Ge, J., Xie, H., Min, S., Zhang, Y.: Semantic-guided reinforced region embedding for generalized zero-shot learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1406–1414 (2021)

    Google Scholar 

  9. Guo, X., Yang, H., Huang, D.: Image inpainting via conditional texture and structure dual generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14134–14143 (2021)

    Google Scholar 

  10. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)

    Google Scholar 

  11. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  12. Hou, Y., Chen, J.J., Wang, Z.: Multi-branch network with ensemble learning for text removal in the wild. In: Proceedings of the Asian Conference on Computer Vision, pp. 1333–1349 (2022)

    Google Scholar 

  13. Jiang, G., Wang, S., Ge, T., Jiang, Y., Wei, Y., Lian, D.: Self-supervised text erasing with controllable image synthesis. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1973–1983 (2022)

    Google Scholar 

  14. Jiang, L., Dai, B., Wu, W., Loy, C.C.: Focal frequency loss for image reconstruction and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13919–13929 (2021)

    Google Scholar 

  15. Jiang, Q., Wang, J., Peng, D., Liu, C., Jin, L.: Revisiting scene text recognition: a data perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20543–20554 (2023)

    Google Scholar 

  16. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)

    Google Scholar 

  17. Khodadadi, M., Behrad, A.: Text localization, extraction and inpainting in color images. In: 20th Iranian Conference on Electrical Engineering (ICEE 2012), pp. 1035–1040. IEEE (2012)

    Google Scholar 

  18. Lee, H., Choi, C.: The surprisingly straightforward scene text removal method with gated attention and region of interest generation: a comprehensive prominent model analysis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 457–472. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_26

    Chapter  Google Scholar 

  19. Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10758–10768 (2022)

    Google Scholar 

  20. Liu, C., et al.: Don’t forget me: accurate background recovery for text removal via modeling local-global context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 409–426. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_24

    Chapter  Google Scholar 

  21. Liu, C., Liu, Y., Jin, L., Zhang, S., Luo, C., Wang, Y.: Erasenet: end-to-end text removal in the wild. IEEE Trans. Image Process. 29, 8760–8775 (2020)

    Article  Google Scholar 

  22. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  23. Lyu, G., Liu, K., Zhu, A., Uchida, S., Iwana, B.K.: Fetnet: feature erasing and transferring network for scene text removal. Pattern Recogn. 140, 109531 (2023)

    Article  Google Scholar 

  24. Lyu, G., Zhu, A.: Psstrnet: progressive segmentation-guided scene text removal network. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)

    Google Scholar 

  25. Nayef, N., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1582–1587. IEEE (2019)

    Google Scholar 

  26. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)

    Google Scholar 

  27. Peng, D., Liu, C., Liu, Y., Jin, L.: Viteraser: harnessing the power of vision transformers for scene text removal with segmim pretraining. arXiv preprint arXiv:2306.12106 (2023)

  28. Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W., Hassner, T.: Textocr: towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8802–8812 (2021)

    Google Scholar 

  29. Sun, Y., et al.: ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1557–1562. IEEE (2019)

    Google Scholar 

  30. Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2149–2159 (2022)

    Google Scholar 

  31. Tang, Z., Miyazaki, T., Sugaya, Y., Omachi, S.: Stroke-based scene text erasing using synthetic data for training. IEEE Trans. Image Process. 30, 9306–9320 (2021)

    Article  Google Scholar 

  32. Tian, K., Jiang, Y., Diao, Q., Lin, C., Wang, L., Yuan, Z.: Designing BERT for convolutional networks: sparse and hierarchical masked modeling. arXiv preprint arXiv:2301.03580 (2023)

  33. Tursun, O., Denman, S., Zeng, R., Sivapalan, S., Sridharan, S., Fookes, C.: MTRNet++: one-stage mask-based scene text eraser. Comput. Vis. Image Underst. 201, 103066 (2020)

    Article  Google Scholar 

  34. Tursun, O., Zeng, R., Denman, S., Sivapalan, S., Sridharan, S., Fookes, C.: MTRNet: a generic scene text eraser. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 39–44. IEEE (2019)

    Google Scholar 

  35. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)

  36. Wagh, P.D., Patil, D.: Text detection and removal from image using inpainting with smoothing. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–4. IEEE (2015)

    Google Scholar 

  37. Wang, K., et al.: Masked text modeling: a self-supervised pre-training method for scene text detection. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2006–2015 (2023)

    Google Scholar 

  38. Wang, Y., Xie, H., Wang, Z., Qu, Y., Zhang, Y.: What is the real need for scene text removal? Exploring the background integrity and erasure exhaustivity properties. IEEE Trans. Image Process. (2023)

    Google Scholar 

  39. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: UFormer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)

    Google Scholar 

  40. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  41. Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)

    Google Scholar 

  42. Yang, X., Song, Z., King, I., Xu, Z.: A survey on deep semi-supervised learning. IEEE Trans. Knowl. Data Eng. 35(9), 8934–8954 (2022)

    Article  Google Scholar 

  43. Zdenek, J., Nakayama, H.: Erasing scene text with weak supervision. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2238–2246 (2020)

    Google Scholar 

  44. Zhang, R., et al.: ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1577–1581. IEEE (2019)

    Google Scholar 

  45. Zhang, S., Liu, Y., Jin, L., Huang, Y., Lai, S.: ENSNet: ensconce text in the wild. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 801–808 (2019)

    Google Scholar 

  46. Zhang, W., et al.: Context-aware image inpainting with learned semantic priors. arXiv preprint arXiv:2106.07220 (2021)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2022YFB3104700), the National Nature Science Foundation of China (62121002, U23B2028, 62102384). We acknowledge the support of GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC. We also thank the USTC supercomputing center for providing computational resources for this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongtao Xie .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5553 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Xie, H., Wang, Y., Qu, Y., Guo, F., Liu, P. (2025). Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72848-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72847-1

  • Online ISBN: 978-3-031-72848-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics