research-article

Scene Text Segmentation with Text-Focused Transformers

Authors:

Xiangyang XueAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2898 - 2907

https://doi.org/10.1145/3581783.3611755

Published: 27 October 2023 Publication History

Abstract

Text segmentation is a crucial aspect of various text-related tasks, including text erasing, text editing, and font style transfer. In recent years, multiple text segmentation datasets, such as TextSeg focusing on Latin text segmentation and BTS on bilingual text segmentation, have been proposed. However, existing methods either disregard the annotations of text location or directly use pre-trained text detectors. In general, these methods cannot fully utilize the annotations of text location in the datasets. To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. In the proposed framework, we first extract multi-level global visual features through residual convolution blocks and then predict the mask of text areas using a text detection head. Subsequently, we develop a text-focused module that compels the model to pay more attention to text areas. Specifically, we introduce two types of attention masks to extract corresponding features: text-aware and instance-aware features. Finally, we employ hierarchical Transformer encoders to fuse multi-level features and predict the text mask with a text segmentation head. To evaluate the effectiveness of our method, we conduct experiments on six text segmentation benchmarks. The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art (SOTA) methods by a clear margin in most cases. The code and supplementary materials are available at https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Bo Bai, Fei Yin, and Cheng Lin Liu. 2014. A Seed-Based Segmentation Method for Scene Text Extraction. In DAS. https://doi.org/10.1109/das.2014.34

Digital Library

[3]

Xuewei Bian, Chaoqun Wang, Weize Quan, Juntao Ye, Xiaopeng Zhang, and Dong-Ming Yan. 2022. Scene Text Removal via Cascaded Text Stroke Detection and Erasing. Computational Visual Media, Vol. 8 (2022), 273--287.

[4]

Simone Bonechi, Paolo Andreini, Monica Bianchini, and Franco Scarselli. 2019. COCO_TS Dataset: Pixel-Level Annotations Based on Weak Supervision for Scene Text Segmentation. In ICANN. Springer, 238--250.

[5]

Simone Bonechi, Monica Bianchini, Franco Scarselli, and Paolo Andreini. 2020. Weak Supervision for Generating Pixel-Level Annotations in Scene Text Segmentation. Pattern Recognition Letters, Vol. 138 (2020), 1--7.

[6]

Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, and Xiangyang Xue. 2022b. Text gestalt: Stroke-aware scene text image super-resolution. In AAAI. 285--293.

[7]

Jing-Tiao Chen, Shu-Gong Xu, and You-Dong Ding. 2022a. Text Image Editing Method Based on Font and Character Attribute Guidance. Journal of Computer Applications (2022), 0.

[8]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017a. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, And Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 4 (2017), 834--848.

[9]

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017b. Rethinking Atrous Convolution for Semantic Image Segmentation.

[10]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV. 801--818.

[11]

Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-Pixel Classification Is Not All You Need for Semantic Segmentation. NeurIPS, Vol. 34 (2021), 17864--17875.

[12]

Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In ICDAR, Vol. 1. IEEE, 935--942.

[13]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting Scene Text via Instance Segmentation. In AAAI, Vol. 32.

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020).

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.

[16]

Yaoxiong Huang, Mengchao He, Lianwen Jin, and Yongpan Wang. 2020. RD-GAN: Few/Zero-Shot Chinese Character Style Transfer via Radical Decomposition and Rendering. In ECCV. Springer, 156--172.

[17]

Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. 2022. OneFormer: One Transformer to Rule Universal Image Segmentation. arXiv preprint arXiv:2211.06220 (2022).

[18]

Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579 (2017).

[19]

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 Robust Reading Competition. In ICDAR. IEEE, 1484--1493.

[20]

Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-Shot Font Style Transfer between Different Languages. In WACV. 433--442.

[21]

Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. TextBoxes: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing (Apr 2018), 3676--3690. https://doi.org/10.1109/tip.2018.2825107

[22]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. 21--37. https://doi.org/10.1007/978-3-319-46448-0_2

[23]

Xiaoqing Liu and Jagath Samarabandu. 2007. Multiscale Edge-Based Text Extraction from Complex Images. In ICME. https://doi.org/10.1109/icme.2006.262882

[24]

Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. ABCNet: Real-Time Scene Text Spotting with Adaptive Bezier-Curve Network. In CVPR. 9809--9818.

[25]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In CVPR. 3431--3440.

[26]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 (2017).

[27]

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models. In CVPR. 11461--11471.

[28]

Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Transactions on Multimedia (Mar 2018), 3111--3122. https://doi.org/10.1109/tmm.2018.2818020

Digital Library

[29]

Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. 2004. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing, Vol. 22, 10 (2004), 761--767.

[30]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017. ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification-Rrc-Mlt. In ICDAR, Vol. 1. IEEE, 1454--1459.

[31]

Nobuyuki Otsu. 2008. A Threshold Selection Method from Gray Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics (Jul 2008), 62--66. https://doi.org/10.1109/tsmc.1979.4310076

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (Jun 2016), 1137--1149. https://doi.org/10.1109/tpami.2016.2577031

Digital Library

[33]

Bolan Su, Shijian Lu, and Chew Lim Tan. 2010. Binarization of Historical Document Images Using the Local Maximum and Minimum. In DAS. https://doi.org/10.1145/1815330.1815351

Digital Library

[34]

Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In CVPR. 4563--4572.

[35]

Zhengmi Tang, Tomo Miyazaki, Yoshihiro Sugaya, and Shinichiro Omachi. 2021. Stroke-based Scene Text Erasing Using Synthetic Data for Training. IEEE Transactions on Image Processing, Vol. 30 (2021), 9306--9320.

[36]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training Data-Efficient Image Transformers & Distillation Through Attention. In ICML. PMLR, 10347--10357.

[37]

Chuan Wang, Shan Zhao, Li Zhu, Kunming Luo, Yanwen Guo, Jue Wang, and Shuaicheng Liu. 2021b. Semi-Supervised Pixel-Level Scene Text Segmentation by Mutually Guided Network. IEEE Transactions on Image Processing, Vol. 30 (Sep 2021), 8212--8221. https://doi.org/10.1109/tip.2021.3113157

[38]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2020), 3349--3364.

[39]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021a. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In ICCV. 568--578.

[40]

Qi Wei, Lei Feng, Haoliang Sun, Ren Wang, Chenhui Guo, and Yilong Yin. 2023. Fine-Grained Classification with Noisy Labels. In CVPR. 11651--11660.

[41]

Qi Wei, Haoliang Sun, Xiankai Lu, and Yilong Yin. 2022. Self-Filtering: A Noise-Aware Sample Selection for Label Noise with Confidence Penalization. In ECCV. Springer, 516--532.

[42]

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. Segformer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, Vol. 34 (2021), 12077--12090.

[43]

Xixi Xu, Zhongang Qi, Jianqi Ma, Honglun Zhang, Ying Shan, and Xiaohu Qie. 2022. BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild. In CVPR. 19152--19162.

[44]

Xingqian Xu, Zhifei Zhang, Zhaowen Wang, Brian Price, Zhonghao Wang, and Humphrey Shi. 2021. Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach. In CVPR. 12045--12055.

[45]

Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene Text Detection via Holistic, Multi-channel Prediction. arXiv preprint arXiv:1606.09002 (2016).

[46]

Haiyang Yu, Jingye Chen, Bin Li, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, Shaobo Qu, and Xiangyang Xue. 2021. Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study. arXiv preprint arXiv:2112.15093 (2021).

[47]

Haiyang Yu, Jingye Chen, Bin Li, and Xiangyang Xue. 2022. Chinese Character Recognition with Radical-Structured Stroke Trees. arXiv preprint arXiv:2211.13518 (2022).

[48]

Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-Contextual Representations for Semantic Segmentation. In ECCV. Springer, 173--190.

[49]

Jan Zdenek and Hideki Nakayama. 2020. Erasing Scene Text with Weak Supervision. In WACV. 2238--2246.

[50]

Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019a. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR. 10552--10561.

[51]

Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019b. EnsNet: Ensconce Text in the Wild. In AAAI, Vol. 33. 801--808.

Digital Library

[52]

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented Text Detection with Fully Convolutional Networks. In CVPR. 4159--4167.

[53]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid Scene Parsing Network. In CVPR. 2881--2890.

[54]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. 2021. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In CVPR. https://doi.org/10.1109/cvpr46437.2021.00681

[55]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. 5551--5560.

[56]

Anna Zhu, Xiongbo Lu, Xiang Bai, Seiichi Uchida, Brian Kenji Iwana, and Shengwu Xiong. 2020. Few-Shot Text Style Transfer via Deep Feature Similarity. IEEE Transactions on Image Processing, Vol. 29 (2020), 6932--6946.

[57]

Xinyan Zu, Haiyang Yu, Bin Li, and Xiangyang Que. 2023. Weakly-Supervised Text Instance Segmentation. arXiv preprint arXiv:2303.10848 (2023).

[58]

Xinyan Zu, Haiyang Yu, Bin Li, and Xiangyang Xue. 2022. Chinese Character Recognition with Augmented Character Profile Matching. In ACM Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 6094--6102. https://doi.org/10.1145/3503161.3547827

Digital Library

Cited By

Ye MZhang JLiu JLiu CYin BLiu CDu BTao D(2025)Hi-SAM: Marrying Segment Anything Model for Hierarchical Text SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.349583147:3(1431-1447)Online publication date: Mar-2025
https://doi.org/10.1109/TPAMI.2024.3495831
Liu CJiang QPeng DKong YZhang JXiong LDuan JSun CJin L(2025)QT-TextSRNeurocomputing10.1016/j.neucom.2024.129241620:COnline publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1016/j.neucom.2024.129241
Gong RZhu ALiu K(2025)Edge guided and Fourier attention-based Dual Interaction Network for scene text erasingImage and Vision Computing10.1016/j.imavis.2024.105406154(105406)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2024.105406
Show More Cited By

Index Terms

Scene Text Segmentation with Text-Focused Transformers
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Text segmentation tasks have a very wide range of application values, such as image editing, style transfer, watermark removal, etc. However, existing public datasets are of poor quality of pixel-level labels that have been shown to be notoriously ...
Scene text segmentation using low variation extremal regions and sorting based character grouping

Extraction of textual information from natural scene images is a challenging task due to imaging conditions and diversity of text properties. Segmentation of scene text is important step in the pipeline that significantly affects the final recognition ...
EAFormer: Scene Text Segmentation with Edge-Aware Transformers
Computer Vision – ECCV 2024
Abstract
Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
328
Total Downloads

Downloads (Last 12 months)179
Downloads (Last 6 weeks)17

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ye MZhang JLiu JLiu CYin BLiu CDu BTao D(2025)Hi-SAM: Marrying Segment Anything Model for Hierarchical Text SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.349583147:3(1431-1447)Online publication date: Mar-2025
https://doi.org/10.1109/TPAMI.2024.3495831
Liu CJiang QPeng DKong YZhang JXiong LDuan JSun CJin L(2025)QT-TextSRNeurocomputing10.1016/j.neucom.2024.129241620:COnline publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1016/j.neucom.2024.129241
Gong RZhu ALiu K(2025)Edge guided and Fourier attention-based Dual Interaction Network for scene text erasingImage and Vision Computing10.1016/j.imavis.2024.105406154(105406)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2024.105406
Peng DYang ZZhang JLiu CShi YDing KGuo FJin LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)UPOCRProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693704(40271-40294)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693704
Kim HKyu Kim HLee SKim H(2024)Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features2024 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW)10.1109/ICIPCW64161.2024.10769199(4210-4215)Online publication date: 27-Oct-2024
https://doi.org/10.1109/ICIPCW64161.2024.10769199
Xie XLi YLiu YZhang ZWang ZXiong WBai X(2024)WAS: Dataset and Methods for Artistic Text SegmentationComputer Vision – ECCV 202410.1007/978-3-031-73001-6_14(237-254)Online publication date: 27-Nov-2024
https://doi.org/10.1007/978-3-031-73001-6_14
Yu HFu TLi BXue X(2024)EAFormer: Scene Text Segmentation with Edge-Aware TransformersComputer Vision – ECCV 202410.1007/978-3-031-72698-9_24(410-427)Online publication date: 26-Oct-2024
https://doi.org/10.1007/978-3-031-72698-9_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten