skip to main content
10.1145/3581783.3612488acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

Published: 27 October 2023 Publication History

Abstract

Though scene text recognition (STR) from high-resolution (HR) images has achieved significant success in the past years, text recognition from low-resolution (LR) images is still a challenging task. This inspires the study on scene text image super-resolution (STISR) to generate super-resolution (SR) images based on the LR images, then STR is performed on the generated SR images, which eventually boosts the recognition performance. However, existing methods have two major drawbacks: 1) STISR models may generate imperfect SR images, which mislead the subsequent recognition. 2) As the STISR models are optimized for high recognition accuracy, the fidelity of SR images may be degraded. Consequently, neither the recognition performance of STR nor the fidelity of STISR is desirable. In this paper, a novel model called STIRER (the abbreviation of Scene Text Image REcovery and Recognition) is proposed to effectively and simultaneously recover and recognize LR scene text images under a unified framework. Concretely, STIRER consists of a feature encoder to obtain pixel features and two dedicated decoders to generate SR images and recognize texts respectively based on the encoded features and the raw LR images. We propose a progressive scene text swin transformer architecture as the encoder to enrich the representations of the pixel features for better recovery and recognition. Extensive experiments on two LR datasets show the superiority of our model to the existing methods on recognition performance, super-resolution fidelity and computational cost. The STIRER Code is available in https://github.com/zhaominyiz/STIRER.

Supplemental Material

MP4 File
Presentation video for STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

References

[1]
Rowel Atienza. 2021. Vision transformer for fast and efficient scene text recognition. In International Conference on Document Analysis and Recognition. Springer, 319--334.
[2]
Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In CVPR. 1508--1516.
[3]
Darwin Bautista and Rowel Atienza. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. In Proceedings of the 17th European Conference on Computer Vision (ECCV). Springer International Publishing, Cham.
[4]
Jingye Chen, Bin Li, and Xiangyang Xue. 2021b. Scene Text Telescope: Text-Focused Scene Image Super-Resolution. In CVPR. 12026--12035.
[5]
Jingye Chen, Haiyang Yu, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, Shaobo Qu, Bin Li, and Xiangyang Xue. 2021c. Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study. arXiv preprint arXiv:2112.15093 (2021).
[6]
Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, and Xiangyang Xue. 2022. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 285--293.
[7]
Xiaoxue Chen, Lianwen Jin, Yuanzhi Zhu, Canjie Luo, and Tianwei Wang. 2021a. Text recognition in the wild: A survey. ACM Computing Surveys (CSUR), Vol. 54, 2 (2021), 1--35.
[8]
Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In ICCV. 5076--5084.
[9]
Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Aon: Towards arbitrarily-oriented text recognition. In CVPR. 5571--5579.
[10]
Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. 2019. Second-order attention network for single image super-resolution. In CVPR. 11065--11074.
[11]
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2015. Image super-resolution using deep convolutional networks. TPAMI, Vol. 38, 2 (2015), 295--307.
[12]
Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2022. SVTR: Scene Text Recognition with a Single Visual Model. arXiv preprint arXiv:2205.00159 (2022).
[13]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. JMLR, Vol. 20, 1 (2019), 1997--2017.
[14]
Chuantao Fang, Yu Zhu, Lei Liao, and Xiaofeng Ling. 2021b. TSRGAN: Real-world text image super-resolution based on adversarial learning and triplet attention. Neurocomputing, Vol. 455 (2021), 88--96.
[15]
Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021a. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In CVPR. 7098--7107.
[16]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML. 369--376.
[17]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2315--2324.
[18]
Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, and Zhiping Lin. 2020. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In AAAI, Vol. 34. 11005--11012.
[19]
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014).
[20]
Zhiwei Jia, Shugong Xu, Shiyi Mu, Yue Tao, Shan Cao, and Zhiyong Chen. 2021. IFR: Iterative Fusion Based Recognizer for Low Quality Scene Text Recognition. In Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, October 29-November 1, 2021, Proceedings, Part II 4. Springer, 180--191.
[21]
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on robust reading. In ICDAR. IEEE, 1156--1160.
[22]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR. 4681--4690.
[23]
Wenbo Li, Xin Lu, Jiangbo Lu, Xiangyu Zhang, and Jiaya Jia. 2021. On efficient transformer and image pre-training for low-level vision. arXiv preprint arXiv:2112.10175 (2021).
[24]
Xiaoming Li, Wangmeng Zuo, and Chen Change Loy. 2023. Learning Generative Structure Prior for Blind Text Image Super-resolution. arXiv preprint arXiv:2303.14726 (2023).
[25]
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1833--1844.
[26]
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
[27]
Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. Moran: A multi-object rectified attention network for scene text recognition. PR, Vol. 90 (2019), 109--118.
[28]
Jianqi Ma, Shi Guo, and Lei Zhang. 2023. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing, Vol. 32 (2023), 1341--1353.
[29]
Jianqi Ma, Zhetong Liang, and Lei Zhang. 2022. A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution. In CVPR. 5911--5920.
[30]
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In CVPR. 2200--2209.
[31]
Anand Mishra, Karteek Alahari, and CV Jawahar. 2012. Top-down and bottom-up cues for scene text recognition. In CVPR. IEEE, 2687--2694.
[32]
Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, and Yaohong Huang. 2020. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In ECCV. Springer, 158--174.
[33]
Shimon Nakaune, Satoshi Iizuka, and Kazuhiro Fukui. 2021. Skeleton-aware Text Image Super-Resolution. (2021).
[34]
Ram Krishna Pandey, K Vignesh, AG Ramakrishnan, et al. 2018. Binary document image super resolution for improved readability and OCR performance. arXiv preprint arXiv:1812.02475 (2018).
[35]
Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. 2013. Recognizing text with perspective distortion in natural scenes. In ICCV. 569--576.
[36]
Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In CVPR. 13528--13537.
[37]
Rui Qin, Bin Wang, and Yu-Wing Tai. 2022. Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks. arXiv preprint arXiv:2210.06924 (2022).
[38]
Sangeeth Reddy, Minesh Mathew, Lluis Gomez, Marcc al Rusinol, Dimosthenis Karatzas, and CV Jawahar. 2020. Roadtext-1k: Text detection & recognition dataset for driving videos. In ICRA. IEEE, 11074--11080.
[39]
Fenfen Sheng, Zhineng Chen, and Bo Xu. 2019. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In ICDAR. IEEE, 781--786.
[40]
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, Vol. 39, 11 (2016), 2298--2304.
[41]
Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. TPAMI, Vol. 41, 9 (2018), 2035--2048.
[42]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In CVPR. 8317--8326.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[44]
Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In ICCV. IEEE, 1457--1464.
[45]
Wenjia Wang, Enze Xie, Xuebo Liu, Wenhai Wang, Ding Liang, Chunhua Shen, and Xiang Bai. 2020. Scene text image super-resolution in the wild. In ECCV. Springer, 650--666.
[46]
Wenjia Wang, Enze Xie, Peize Sun, Wenhai Wang, Lixun Tian, Chunhua Shen, and Ping Luo. 2019. Textsr: Content-aware text super-resolution guided by recognition. arXiv preprint arXiv:1909.07113 (2019).
[47]
Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene text recognizer with visual language modeling network. In ICCV. 14194--14203.
[48]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. TIP, Vol. 13, 4 (2004), 600--612.
[49]
Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. 2022. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17683--17693.
[50]
Xiangyu Xu, Deqing Sun, Jinshan Pan, Yujin Zhang, Hanspeter Pfister, and Ming-Hsuan Yang. 2017. Learning to super-resolve blurry face and text images. In ICCV. 251--260.
[51]
Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards accurate scene text recognition with semantic reasoning networks. In CVPR. 12113--12122.
[52]
Hui Zhang, Quanming Yao, Mingkun Yang, Yongchao Xu, and Xiang Bai. 2020. AutoSTR: Efficient Backbone Search for Scene Text Recognition. In ECCV.
[53]
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018. Image super-resolution using very deep residual channel attention networks. In ECCV. 286--301.
[54]
Cairong Zhao, Shuyang Feng, Brian Nlong Zhao, Zhijun Ding, Jun Wu, Fumin Shen, and Heng Tao Shen. 2021. Scene Text Image Super-Resolution via Parallelly Contextual Attention Network. In MM. 2908--2917.
[55]
Minyi Zhao, Bingjia Li, Jie Wang, Wanqing Li, Wenjing Zhou, Lan Zhang, Shijie Xuyang, Zhihang Yu, Xinkun Yu, Guangze Li, et al. 2022a. Towards Video Text Visual Question Answering: Benchmark and Baseline. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[56]
Minyi Zhao, Miao Wang, Fan Bai, Bingjia Li, Jie Wang, and Shuigeng Zhou. 2022b. C3-STISR: Scene Text Image Super-resolution with Triple Clues. In IJCAI. 1707--1713.
[57]
Shipeng Zhu, Zuoyan Zhao, Pengfei Fang, and Hui Xue. 2023. Improving Scene Text Image Super-Resolution via Dual Prior Modulation Network. arXiv preprint arXiv:2302.10414 (2023).

Cited By

View all

Index Terms

  1. STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. scene text image super-resolution
    2. scene text recognition
    3. transformer

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foudnation of China (NSFC)

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)128
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)HiREN: Towards higher supervision quality for better scene text image super-resolutionNeurocomputing10.1016/j.neucom.2024.129309(129309)Online publication date: Jan-2025
    • (2025)QT-TextSRNeurocomputing10.1016/j.neucom.2024.129241620:COnline publication date: 1-Mar-2025
    • (2024)Scene Text Image Super-Resolution Guided by Frequency Domain Enhancement and Feature Refinement2024 IEEE First International Conference on Data Intelligence and Innovative Application (DIIA)10.1109/DIIA62678.2024.10871446(1-7)Online publication date: 23-Nov-2024
    • (2024)Network architecture for single image super‐resolution: A comprehensive review and comparisonIET Image Processing10.1049/ipr2.1310018:9(2215-2243)Online publication date: 20-May-2024
    • (2024)Batch-transformer for scene text image super-resolutionThe Visual Computer10.1007/s00371-024-03598-740:10(7399-7409)Online publication date: 29-Aug-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media