skip to main content
10.1145/3581783.3611755acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scene Text Segmentation with Text-Focused Transformers

Published: 27 October 2023 Publication History

Abstract

Text segmentation is a crucial aspect of various text-related tasks, including text erasing, text editing, and font style transfer. In recent years, multiple text segmentation datasets, such as TextSeg focusing on Latin text segmentation and BTS on bilingual text segmentation, have been proposed. However, existing methods either disregard the annotations of text location or directly use pre-trained text detectors. In general, these methods cannot fully utilize the annotations of text location in the datasets. To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. In the proposed framework, we first extract multi-level global visual features through residual convolution blocks and then predict the mask of text areas using a text detection head. Subsequently, we develop a text-focused module that compels the model to pay more attention to text areas. Specifically, we introduce two types of attention masks to extract corresponding features: text-aware and instance-aware features. Finally, we employ hierarchical Transformer encoders to fuse multi-level features and predict the text mask with a text segmentation head. To evaluate the effectiveness of our method, we conduct experiments on six text segmentation benchmarks. The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art (SOTA) methods by a clear margin in most cases. The code and supplementary materials are available at https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Bo Bai, Fei Yin, and Cheng Lin Liu. 2014. A Seed-Based Segmentation Method for Scene Text Extraction. In DAS. https://doi.org/10.1109/das.2014.34
[3]
Xuewei Bian, Chaoqun Wang, Weize Quan, Juntao Ye, Xiaopeng Zhang, and Dong-Ming Yan. 2022. Scene Text Removal via Cascaded Text Stroke Detection and Erasing. Computational Visual Media, Vol. 8 (2022), 273--287.
[4]
Simone Bonechi, Paolo Andreini, Monica Bianchini, and Franco Scarselli. 2019. COCO_TS Dataset: Pixel-Level Annotations Based on Weak Supervision for Scene Text Segmentation. In ICANN. Springer, 238--250.
[5]
Simone Bonechi, Monica Bianchini, Franco Scarselli, and Paolo Andreini. 2020. Weak Supervision for Generating Pixel-Level Annotations in Scene Text Segmentation. Pattern Recognition Letters, Vol. 138 (2020), 1--7.
[6]
Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, and Xiangyang Xue. 2022b. Text gestalt: Stroke-aware scene text image super-resolution. In AAAI. 285--293.
[7]
Jing-Tiao Chen, Shu-Gong Xu, and You-Dong Ding. 2022a. Text Image Editing Method Based on Font and Character Attribute Guidance. Journal of Computer Applications (2022), 0.
[8]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017a. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, And Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 4 (2017), 834--848.
[9]
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017b. Rethinking Atrous Convolution for Semantic Image Segmentation.
[10]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV. 801--818.
[11]
Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-Pixel Classification Is Not All You Need for Semantic Segmentation. NeurIPS, Vol. 34 (2021), 17864--17875.
[12]
Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In ICDAR, Vol. 1. IEEE, 935--942.
[13]
Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting Scene Text via Instance Segmentation. In AAAI, Vol. 32.
[14]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[16]
Yaoxiong Huang, Mengchao He, Lianwen Jin, and Yongpan Wang. 2020. RD-GAN: Few/Zero-Shot Chinese Character Style Transfer via Radical Decomposition and Rendering. In ECCV. Springer, 156--172.
[17]
Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. 2022. OneFormer: One Transformer to Rule Universal Image Segmentation. arXiv preprint arXiv:2211.06220 (2022).
[18]
Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579 (2017).
[19]
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 Robust Reading Competition. In ICDAR. IEEE, 1484--1493.
[20]
Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-Shot Font Style Transfer between Different Languages. In WACV. 433--442.
[21]
Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. TextBoxes: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing (Apr 2018), 3676--3690. https://doi.org/10.1109/tip.2018.2825107
[22]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. 21--37. https://doi.org/10.1007/978-3-319-46448-0_2
[23]
Xiaoqing Liu and Jagath Samarabandu. 2007. Multiscale Edge-Based Text Extraction from Complex Images. In ICME. https://doi.org/10.1109/icme.2006.262882
[24]
Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. ABCNet: Real-Time Scene Text Spotting with Adaptive Bezier-Curve Network. In CVPR. 9809--9818.
[25]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In CVPR. 3431--3440.
[26]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 (2017).
[27]
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models. In CVPR. 11461--11471.
[28]
Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Transactions on Multimedia (Mar 2018), 3111--3122. https://doi.org/10.1109/tmm.2018.2818020
[29]
Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. 2004. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing, Vol. 22, 10 (2004), 761--767.
[30]
Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017. ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification-Rrc-Mlt. In ICDAR, Vol. 1. IEEE, 1454--1459.
[31]
Nobuyuki Otsu. 2008. A Threshold Selection Method from Gray Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics (Jul 2008), 62--66. https://doi.org/10.1109/tsmc.1979.4310076
[32]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (Jun 2016), 1137--1149. https://doi.org/10.1109/tpami.2016.2577031
[33]
Bolan Su, Shijian Lu, and Chew Lim Tan. 2010. Binarization of Historical Document Images Using the Local Maximum and Minimum. In DAS. https://doi.org/10.1145/1815330.1815351
[34]
Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In CVPR. 4563--4572.
[35]
Zhengmi Tang, Tomo Miyazaki, Yoshihiro Sugaya, and Shinichiro Omachi. 2021. Stroke-based Scene Text Erasing Using Synthetic Data for Training. IEEE Transactions on Image Processing, Vol. 30 (2021), 9306--9320.
[36]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training Data-Efficient Image Transformers & Distillation Through Attention. In ICML. PMLR, 10347--10357.
[37]
Chuan Wang, Shan Zhao, Li Zhu, Kunming Luo, Yanwen Guo, Jue Wang, and Shuaicheng Liu. 2021b. Semi-Supervised Pixel-Level Scene Text Segmentation by Mutually Guided Network. IEEE Transactions on Image Processing, Vol. 30 (Sep 2021), 8212--8221. https://doi.org/10.1109/tip.2021.3113157
[38]
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2020), 3349--3364.
[39]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021a. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In ICCV. 568--578.
[40]
Qi Wei, Lei Feng, Haoliang Sun, Ren Wang, Chenhui Guo, and Yilong Yin. 2023. Fine-Grained Classification with Noisy Labels. In CVPR. 11651--11660.
[41]
Qi Wei, Haoliang Sun, Xiankai Lu, and Yilong Yin. 2022. Self-Filtering: A Noise-Aware Sample Selection for Label Noise with Confidence Penalization. In ECCV. Springer, 516--532.
[42]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. Segformer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, Vol. 34 (2021), 12077--12090.
[43]
Xixi Xu, Zhongang Qi, Jianqi Ma, Honglun Zhang, Ying Shan, and Xiaohu Qie. 2022. BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild. In CVPR. 19152--19162.
[44]
Xingqian Xu, Zhifei Zhang, Zhaowen Wang, Brian Price, Zhonghao Wang, and Humphrey Shi. 2021. Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach. In CVPR. 12045--12055.
[45]
Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene Text Detection via Holistic, Multi-channel Prediction. arXiv preprint arXiv:1606.09002 (2016).
[46]
Haiyang Yu, Jingye Chen, Bin Li, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, Shaobo Qu, and Xiangyang Xue. 2021. Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study. arXiv preprint arXiv:2112.15093 (2021).
[47]
Haiyang Yu, Jingye Chen, Bin Li, and Xiangyang Xue. 2022. Chinese Character Recognition with Radical-Structured Stroke Trees. arXiv preprint arXiv:2211.13518 (2022).
[48]
Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-Contextual Representations for Semantic Segmentation. In ECCV. Springer, 173--190.
[49]
Jan Zdenek and Hideki Nakayama. 2020. Erasing Scene Text with Weak Supervision. In WACV. 2238--2246.
[50]
Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019a. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR. 10552--10561.
[51]
Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019b. EnsNet: Ensconce Text in the Wild. In AAAI, Vol. 33. 801--808.
[52]
Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented Text Detection with Fully Convolutional Networks. In CVPR. 4159--4167.
[53]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid Scene Parsing Network. In CVPR. 2881--2890.
[54]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. 2021. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In CVPR. https://doi.org/10.1109/cvpr46437.2021.00681
[55]
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. 5551--5560.
[56]
Anna Zhu, Xiongbo Lu, Xiang Bai, Seiichi Uchida, Brian Kenji Iwana, and Shengwu Xiong. 2020. Few-Shot Text Style Transfer via Deep Feature Similarity. IEEE Transactions on Image Processing, Vol. 29 (2020), 6932--6946.
[57]
Xinyan Zu, Haiyang Yu, Bin Li, and Xiangyang Que. 2023. Weakly-Supervised Text Instance Segmentation. arXiv preprint arXiv:2303.10848 (2023).
[58]
Xinyan Zu, Haiyang Yu, Bin Li, and Xiangyang Xue. 2022. Chinese Character Recognition with Augmented Character Profile Matching. In ACM Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 6094--6102. https://doi.org/10.1145/3503161.3547827

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. end-to-end framework
  2. scene text segmentation
  3. text-focused transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)179
  • Downloads (Last 6 weeks)17
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Hi-SAM: Marrying Segment Anything Model for Hierarchical Text SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.349583147:3(1431-1447)Online publication date: Mar-2025
  • (2025)QT-TextSRNeurocomputing10.1016/j.neucom.2024.129241620:COnline publication date: 1-Mar-2025
  • (2025)Edge guided and Fourier attention-based Dual Interaction Network for scene text erasingImage and Vision Computing10.1016/j.imavis.2024.105406154(105406)Online publication date: Feb-2025
  • (2024)UPOCRProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693704(40271-40294)Online publication date: 21-Jul-2024
  • (2024)Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features2024 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW)10.1109/ICIPCW64161.2024.10769199(4210-4215)Online publication date: 27-Oct-2024
  • (2024)WAS: Dataset and Methods for Artistic Text SegmentationComputer Vision – ECCV 202410.1007/978-3-031-73001-6_14(237-254)Online publication date: 27-Nov-2024
  • (2024)EAFormer: Scene Text Segmentation with Edge-Aware TransformersComputer Vision – ECCV 202410.1007/978-3-031-72698-9_24(410-427)Online publication date: 26-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media