Skip to main content

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training. The code and datasets are available at https://hyangyu.github.io/EAFormer/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Andreini, P., et al.: A two-stage gan for high-resolution retinal image generation and segmentation. Electronics 11(1), 60 (2021)

    Article  Google Scholar 

  2. Bai, B., Yin, F., Liu, C.L.: A seed-based segmentation method for scene text extraction. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 262–266. IEEE (2014)

    Google Scholar 

  3. Bonechi, S., Andreini, P., Bianchini, M., Scarselli, F.: COCO_TS dataset: pixel–level annotations based on weak supervision for scene text segmentation. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 238–250. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_20

    Chapter  Google Scholar 

  4. Bonechi, S., Bianchini, M., Scarselli, F., Andreini, P.: Weak supervision for generating pixel-level annotations in scene text segmentation. Pattern Recogn. Lett. 138, 1–7 (2020)

    Article  Google Scholar 

  5. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986)

    Article  Google Scholar 

  6. Chen, J., Li, J., Pan, D., Zhu, Q., Mao, Z.: Edge-guided multiscale segmentation of satellite multispectral imagery. IEEE Trans. Geosci. Remote Sens. 50(11), 4513–4520 (2012)

    Article  Google Scholar 

  7. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  8. Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935–942. IEEE (2017)

    Google Scholar 

  9. Cong, R., Zhang, Y., Yang, N., Li, H., Zhang, X., Li, R., Chen, Z., Zhao, Y., Kwong, S.: Boundary guided semantic learning for real-time covid-19 lung infection segmentation system. IEEE Trans. Consum. Electron. 68(4), 376–386 (2022)

    Article  Google Scholar 

  10. Conrad, B., Chen, P.I.: Two-stage seamless text erasing on real-world scene images. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1309–1313. IEEE (2021)

    Google Scholar 

  11. Dai, Y., et al.: Fused text segmentation networks for multi-oriented scene text detection. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3604–3609. IEEE (2018)

    Google Scholar 

  12. Du, X., Zhou, Z., Zheng, Y., Ma, T., Wu, X., Jin, C.: Modeling stroke mask for end-to-end text erasing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6151–6159 (2023)

    Google Scholar 

  13. Ess, A., Müller, T., Grabner, H., Van Gool, L.: Segmentation-based urban traffic scene understanding. In: BMVC, vol. 1, p. 2. Citeseer (2009)

    Google Scholar 

  14. Fu, J., Liu, J., Jiang, J., Li, Y., Bao, Y., Lu, H.: Scene segmentation with dual relation-aware attention network. IEEE Trans. Neural Networks Learn. Syst. 32(6), 2547–2560 (2020)

    Article  Google Scholar 

  15. Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation methods for character recognition: from segmentation to document structure analysis. Proc. IEEE 80(7), 1079–1092 (1992)

    Article  Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  17. He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 745–753 (2017)

    Google Scholar 

  18. He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Trans. Image Process. 27(11), 5406–5419 (2018)

    Article  MathSciNet  Google Scholar 

  19. Karatzas, D., et al.: Icdar 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)

    Google Scholar 

  20. Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11474–11481 (2020)

    Google Scholar 

  21. Liu, X., Samarabandu, J.: Multiscale edge-based text extraction from complex images. In: 2006 IEEE International Conference on Multimedia and Expo, pp. 1721–1724. IEEE (2006)

    Google Scholar 

  22. Liu, Z., Li, J., Song, R., Wu, C., Liu, W., Li, Z., Li, Y.: Edge guided context aggregation network for semantic segmentation of remote sensing imagery. Remote Sensing 14(6), 1353 (2022)

    Article  Google Scholar 

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  24. Lyu, G., Liu, K., Zhu, A., Uchida, S., Iwana, B.K.: Fetnet: feature erasing and transferring network for scene text removal. Pattern Recogn. 140, 109531 (2023)

    Article  Google Scholar 

  25. Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7553–7563 (2018)

    Google Scholar 

  26. Ma, H., Yang, H., Huang, D.: Boundary guided context aggregation for semantic segmentation. arXiv preprint arXiv:2110.14587 (2021)

  27. Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20(11), 3111–3122 (2018)

    Article  Google Scholar 

  28. Ma, J., Jin, L., Zhang, J., Jiang, J., Xue, Y., He, M.: Textsrnet: scene text super-resolution based on contour prior and atrous convolution. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 3252–3258. IEEE (2022)

    Google Scholar 

  29. Mustafa, W.A., Kader, M.M.M.A.: Binarization of document image using optimum threshold modification. In: Journal of Physics: Conference Series, vol. 1019, p. 012022. IOP Publishing (2018)

    Google Scholar 

  30. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  Google Scholar 

  31. Pack, C., Soh, L.K., Lorang, E.: Perceptual cue-guided adaptive image downscaling for enhanced semantic segmentation on large document images. Int. J. Document Anal. Recogn. (IJDAR), 1–17 (2023)

    Google Scholar 

  32. Ren, Y., Zhang, J., Chen, B., Zhang, X., Jin, L.: Looking from a higher-level perspective: attention and recognition enhanced multi-scale scene text segmentation. In: Proceedings of the Asian Conference on Computer Vision, pp. 3138–3154 (2022)

    Google Scholar 

  33. Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000)

    Article  Google Scholar 

  34. Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2550–2558 (2017)

    Google Scholar 

  35. Shu, R., Zhao, C., Feng, S., Zhu, L., Miao, D.: Text-enhanced scene image super-resolution via stroke mask and orthogonal attention. IEEE Trans. Circuits Syst. Video Technol. (2023)

    Google Scholar 

  36. Su, B., Lu, S., Tan, C.L.: Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 159–166 (2010)

    Google Scholar 

  37. Tang, J., Yang, Z., Wang, Y., Zheng, Q., Xu, Y., Bai, X.: Seglink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recogn. 96, 106954 (2019)

    Article  Google Scholar 

  38. Tang, Y., Wu, X.: Scene text detection and segmentation based on cascaded convolution neural networks. IEEE Trans. Image Process. 26(3), 1509–1520 (2017)

    Article  MathSciNet  Google Scholar 

  39. Vo, Q.N., Kim, S.H., Yang, H.J., Lee, G.: Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recogn. 74, 568–586 (2018)

    Article  Google Scholar 

  40. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)

    Article  Google Scholar 

  41. Wang, X., Wu, C., Yu, H., Li, B., Xue, X.: Textformer: component-aware text segmentation with transformer. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1877–1882. IEEE (2023)

    Google Scholar 

  42. Wu, Y., Natarajan, P., Rawls, S., AbdAlmageed, W.: Learning document image binarization from data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3763–3767. IEEE (2016)

    Google Scholar 

  43. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  44. Xu, X., Zhang, Z., Wang, Z., Price, B., Wang, Z., Shi, H.: Rethinking text segmentation: a novel dataset and a text-specific refinement approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12045–12055 (2021)

    Google Scholar 

  45. Xu, X., Qi, Z., Ma, J., Zhang, H., Shan, Y., Qie, X.: Bts: a bi-lingual benchmark for text segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19152–19162 (2022)

    Google Scholar 

  46. Yin, X., Li, X., Ni, P., Xu, Q., Kong, D.: A novel real-time edge-guided lidar semantic segmentation network for unstructured environments. Remote Sensing 15(4), 1093 (2023)

    Article  Google Scholar 

  47. Yu, C., Wang, J., Gao, C., Yu, G., Shen, C., Sang, N.: Context prior for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12416–12425 (2020)

    Google Scholar 

  48. Yu, H., Wang, X., Niu, K., Li, B., Xue, X.: Scene text segmentation with text-focused transformers. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2898–2907 (2023)

    Google Scholar 

  49. Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 173–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_11

    Chapter  Google Scholar 

  50. Zdenek, J., Nakayama, H.: Erasing scene text with weak supervision. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2238–2246 (2020)

    Google Scholar 

  51. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

    Google Scholar 

  52. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

    Google Scholar 

  53. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)

    Google Scholar 

  54. Zhou, Y., Feild, J., Learned-Miller, E., Wang, R.: Scene text segmentation via inverse rendering. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 457–461. IEEE (2013)

    Google Scholar 

  55. Zu, X., Yu, H., Li, B., Xue, X.: Weakly-supervised text instance segmentation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 1915–1923 (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 62176060), STCSM project (No. 22511105000), Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103), and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 21721 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, H., Fu, T., Li, B., Xue, X. (2025). EAFormer: Scene Text Segmentation with Edge-Aware Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72698-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72697-2

  • Online ISBN: 978-3-031-72698-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics