Abstract
Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training. The code and datasets are available at https://hyangyu.github.io/EAFormer/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andreini, P., et al.: A two-stage gan for high-resolution retinal image generation and segmentation. Electronics 11(1), 60 (2021)
Bai, B., Yin, F., Liu, C.L.: A seed-based segmentation method for scene text extraction. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 262–266. IEEE (2014)
Bonechi, S., Andreini, P., Bianchini, M., Scarselli, F.: COCO_TS dataset: pixel–level annotations based on weak supervision for scene text segmentation. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 238–250. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_20
Bonechi, S., Bianchini, M., Scarselli, F., Andreini, P.: Weak supervision for generating pixel-level annotations in scene text segmentation. Pattern Recogn. Lett. 138, 1–7 (2020)
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986)
Chen, J., Li, J., Pan, D., Zhu, Q., Mao, Z.: Edge-guided multiscale segmentation of satellite multispectral imagery. IEEE Trans. Geosci. Remote Sens. 50(11), 4513–4520 (2012)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935–942. IEEE (2017)
Cong, R., Zhang, Y., Yang, N., Li, H., Zhang, X., Li, R., Chen, Z., Zhao, Y., Kwong, S.: Boundary guided semantic learning for real-time covid-19 lung infection segmentation system. IEEE Trans. Consum. Electron. 68(4), 376–386 (2022)
Conrad, B., Chen, P.I.: Two-stage seamless text erasing on real-world scene images. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1309–1313. IEEE (2021)
Dai, Y., et al.: Fused text segmentation networks for multi-oriented scene text detection. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3604–3609. IEEE (2018)
Du, X., Zhou, Z., Zheng, Y., Ma, T., Wu, X., Jin, C.: Modeling stroke mask for end-to-end text erasing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6151–6159 (2023)
Ess, A., Müller, T., Grabner, H., Van Gool, L.: Segmentation-based urban traffic scene understanding. In: BMVC, vol. 1, p. 2. Citeseer (2009)
Fu, J., Liu, J., Jiang, J., Li, Y., Bao, Y., Lu, H.: Scene segmentation with dual relation-aware attention network. IEEE Trans. Neural Networks Learn. Syst. 32(6), 2547–2560 (2020)
Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation methods for character recognition: from segmentation to document structure analysis. Proc. IEEE 80(7), 1079–1092 (1992)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 745–753 (2017)
He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Trans. Image Process. 27(11), 5406–5419 (2018)
Karatzas, D., et al.: Icdar 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11474–11481 (2020)
Liu, X., Samarabandu, J.: Multiscale edge-based text extraction from complex images. In: 2006 IEEE International Conference on Multimedia and Expo, pp. 1721–1724. IEEE (2006)
Liu, Z., Li, J., Song, R., Wu, C., Liu, W., Li, Z., Li, Y.: Edge guided context aggregation network for semantic segmentation of remote sensing imagery. Remote Sensing 14(6), 1353 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Lyu, G., Liu, K., Zhu, A., Uchida, S., Iwana, B.K.: Fetnet: feature erasing and transferring network for scene text removal. Pattern Recogn. 140, 109531 (2023)
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7553–7563 (2018)
Ma, H., Yang, H., Huang, D.: Boundary guided context aggregation for semantic segmentation. arXiv preprint arXiv:2110.14587 (2021)
Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20(11), 3111–3122 (2018)
Ma, J., Jin, L., Zhang, J., Jiang, J., Xue, Y., He, M.: Textsrnet: scene text super-resolution based on contour prior and atrous convolution. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 3252–3258. IEEE (2022)
Mustafa, W.A., Kader, M.M.M.A.: Binarization of document image using optimum threshold modification. In: Journal of Physics: Conference Series, vol. 1019, p. 012022. IOP Publishing (2018)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Pack, C., Soh, L.K., Lorang, E.: Perceptual cue-guided adaptive image downscaling for enhanced semantic segmentation on large document images. Int. J. Document Anal. Recogn. (IJDAR), 1–17 (2023)
Ren, Y., Zhang, J., Chen, B., Zhang, X., Jin, L.: Looking from a higher-level perspective: attention and recognition enhanced multi-scale scene text segmentation. In: Proceedings of the Asian Conference on Computer Vision, pp. 3138–3154 (2022)
Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000)
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2550–2558 (2017)
Shu, R., Zhao, C., Feng, S., Zhu, L., Miao, D.: Text-enhanced scene image super-resolution via stroke mask and orthogonal attention. IEEE Trans. Circuits Syst. Video Technol. (2023)
Su, B., Lu, S., Tan, C.L.: Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 159–166 (2010)
Tang, J., Yang, Z., Wang, Y., Zheng, Q., Xu, Y., Bai, X.: Seglink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recogn. 96, 106954 (2019)
Tang, Y., Wu, X.: Scene text detection and segmentation based on cascaded convolution neural networks. IEEE Trans. Image Process. 26(3), 1509–1520 (2017)
Vo, Q.N., Kim, S.H., Yang, H.J., Lee, G.: Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recogn. 74, 568–586 (2018)
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Wang, X., Wu, C., Yu, H., Li, B., Xue, X.: Textformer: component-aware text segmentation with transformer. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1877–1882. IEEE (2023)
Wu, Y., Natarajan, P., Rawls, S., AbdAlmageed, W.: Learning document image binarization from data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3763–3767. IEEE (2016)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)
Xu, X., Zhang, Z., Wang, Z., Price, B., Wang, Z., Shi, H.: Rethinking text segmentation: a novel dataset and a text-specific refinement approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12045–12055 (2021)
Xu, X., Qi, Z., Ma, J., Zhang, H., Shan, Y., Qie, X.: Bts: a bi-lingual benchmark for text segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19152–19162 (2022)
Yin, X., Li, X., Ni, P., Xu, Q., Kong, D.: A novel real-time edge-guided lidar semantic segmentation network for unstructured environments. Remote Sensing 15(4), 1093 (2023)
Yu, C., Wang, J., Gao, C., Yu, G., Shen, C., Sang, N.: Context prior for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12416–12425 (2020)
Yu, H., Wang, X., Niu, K., Li, B., Xue, X.: Scene text segmentation with text-focused transformers. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2898–2907 (2023)
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 173–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_11
Zdenek, J., Nakayama, H.: Erasing scene text with weak supervision. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2238–2246 (2020)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)
Zhou, Y., Feild, J., Learned-Miller, E., Wang, R.: Scene text segmentation via inverse rendering. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 457–461. IEEE (2013)
Zu, X., Yu, H., Li, B., Xue, X.: Weakly-supervised text instance segmentation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 1915–1923 (2023)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (No. 62176060), STCSM project (No. 22511105000), Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103), and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, H., Fu, T., Li, B., Xue, X. (2025). EAFormer: Scene Text Segmentation with Edge-Aware Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72698-9_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)