Semi- and Self-supervised Learning for Scene Text Recognition with Fewer Labels

Sun, Cheng; Cheng, Juntao; Du, Cheng

doi:10.1007/978-3-031-18913-5_23

Cheng Sun¹⁵,
Juntao Cheng¹⁵ &
Cheng Du¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1925 Accesses
1 Citations

Abstract

The majority of existing scene recognition methods are trained on synthetic datasets, following which the performance is evaluated on real-world datasets. Real datasets are not used to train scene text recognition models owing to the difficulty and cost of obtaining labels compare to synthetic datasets. With the development of self-supervised learning, many novel methods apply Siamese neural networks and contrastive learning on unlabeled data for pretraining, and subsequently use the trained encoder for downstream tasks. However, a single self-supervised model may not be able to solve all downstream tasks. Therefore, we propose a self-supervised algorithm including data augmentation, loss functions, and an improved semi-supervised learning method to solve the specific downstream field of scene text recognition. We improved the scene text recognition method by using unlabeled data in semi- and self-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Heterogeneous Network Based Semi-supervised Learning for Scene Text Recognition

Self-supervised Pre-training of Text Recognizers

A Cost-Efficient Framework for Scene Text Detection in the Wild

References

Atienza, R.: Data augmentation for scene text recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1561–1570 (2021)
Google Scholar
Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)
Google Scholar
Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3113–3122 (2021)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4291–4301 (2019)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Chng, C.K., et al.: ICDAR 2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Google Scholar
Dangovski, R., et al.: Equivariant contrastive learning. arXiv preprint arXiv:2111.00899 (2021)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Grill, J.B.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Iwana, B.K., Rizvi, S.T.R., Ahmed, S., Dengel, A., Uchida, S.: Judging a book by its cover. arXiv preprint arXiv:1610.09204 (2016)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)
Google Scholar
Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference. BMVA (2012)
Google Scholar
Mou, Y., et al.: PlugNet: degradation aware scene text recognition supervised by a pluggable super-resolution unit. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 158–174. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_10
Chapter Google Scholar
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1582–1587. IEEE (2019)
Google Scholar
Van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv e-prints pp. arXiv-1807 (2018)
Google Scholar
Peng, X., Wang, K., Zhu, Z., You, Y.: Crafting better contrastive views for Siamese representation learning. arXiv preprint arXiv:2202.03278 (2022)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)
Article Google Scholar
Shi, B., et al.: ICDAR 2017 competition on reading Chinese text in the wild (rctw-17). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1429–1434. IEEE (2017)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)
Google Scholar
Sun, Y., et al.: ICDAR 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1557–1562. IEEE (2019)
Google Scholar
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, pp. 1457–1464. IEEE (2011)
Google Scholar
Wang, T., et al.: Decoupled attention network for text recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12216–12224 (2020)
Google Scholar
Yang, L., Zhuo, W., Qi, L., Shi, Y., Gao, Y.: St++: make self-training work better for semi-supervised semantic segmentation. arXiv preprint arXiv:2106.05095 (2021)
Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2059–2068 (2019)
Google Scholar
Zhang, C., et al.: Spin: structure-preserving inner offset network for scene text recognition. arXiv preprint arXiv:2005.13117 (2020)
Zhang, R., et al.: ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1577–1581. IEEE (2019)
Google Scholar
Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., Kadlec, B.: Uber-text: a large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR, vol. 2017, p. 5 (2017)
Google Scholar

Download references

Acknowledgement

We thank many colleagues at Kingsoft Office AI R &D Department for their help, in particular, Dong Yao, Cheng Du, Ronghua Chen, Juntao Cheng, Junyu Huang, Yushun Zhou for useful discussion and the help on GPU resources.

Author information

Authors and Affiliations

Kingsoft Office AI R &D Department, ZhuHai, GuangDong, China
Cheng Sun, Juntao Cheng & Cheng Du

Authors

Cheng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Juntao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Du .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, C., Cheng, J., Du, C. (2022). Semi- and Self-supervised Learning for Scene Text Recognition with Fewer Labels. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_23
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi- and Self-supervised Learning for Scene Text Recognition with Fewer Labels