Abstract
Chinese text line recognition technology has been applied in a variety of scenarios. As a kind of ideographic writing, Chinese characters contain plenty of semantic information and basic components. While previous methods mainly convert each Chinese character into a discrete label to facilitate the calculation of cross-entropy loss, leaving the fine-grained glyph information (e.g. strokes and radicals) and semantic information unexploited. Concretely, glyph information is crucial for recognizing Chinese characters with similar appearances, as these characters differ only slightly in local strokes. The glyph information reflects these differences guiding the model to learn fine-grained local features. And compared to discrete category labels, the character semantic information introduces diverse visual concepts, which enriches the final character representation. This paper presents a Chinese text recognition method that exploits glyph and character semantic information to acquire effective text representations. Specifically, we propose a Glyph-Aware Decoder to identify characters by dynamically fusing the global visual features with the local stroke and radical features. And we introduce a Contrastive Visual–Textual Learning module to enhance the visual features of Chinese characters by their semantic information. Experiments show that our proposed model achieves state-of-the-art results on the Chinese text recognition benchmarks.
Similar content being viewed by others
References
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, (2016)
Zhao, Z.-Q., Zheng, P., Shou-tao, X., Xindong, W.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019)
Wang, X., Zhang, R., Kong, T., Li, L., Shen, C.: Solov2: dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 33, 17721–17732 (2020)
Stahlberg, F.: Neural machine translation: a review. J. Artif. Intell. Res. 69, 343–418 (2020)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176, (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In International Conference on Computer Vision, (2011)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, (2006)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Yoshua, B. and Yann, L. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, (2015)
Wang, W., Zhang, J., Du, J., Wang, Z.-R., Zhu, Y.: Denseran for offline handwritten Chinese character recognition. In: International Conference on Frontiers in Handwriting Recognition, (2018)
Chen, J., Li, B., Xue, X.: Zero-shot Chinese character recognition with stroke-level decomposition. In: International Joint Conference on Artificial Intelligence, (2021)
Köhler, W.: Gestalt psychology. Psychol. Forsch. 31(1), 18–30 (1967)
Liu, P.D., Chung, K.K.H., McBride-Chang, C., Tong, X.: Holistic versus analytic processing: evidence for a different approach to processing of Chinese at the word and character levels in Chinese children. J. Exp. Child Psychol. 107(4), 466–478 (2010)
Chen, H.-C., Song, H., Lau, W. Y., Wong, K.F.E., Tang, S.L.: Developmental characteristics of eye movements in reading Chinese. Reading development in Chinese children, pp. 157–169, (2003)
Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Asian Conference on Computer Vision, (2014)
He, P., Huang, W., Qiao, Y., Loy, C. C., Tang, X.: Reading scene text in deep convolutional sequences. In: National Conference on Artificial Intelligence, (2016)
Diaz, D.H., Qin, S., Ingl,e R.R., Fujii, Y., Bissacco, A.: Rethinking text line recognition models. Computer Vision and Pattern Recognition (2021)
Wu, G., Zhang, Z., Xiong, Y.: Carvenet: a channel-wise attention-based network for irregular scene text recognition. International Journal on Document Analysis and Recognition (IJDAR), pp. 1–10, (2022)
Cui, S.D., YiLa, S., Ji, Y.T., et al.: An end-to-end network for irregular printed Mongolian recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 25(1), 41–50 (2022)
Tagougui, N., Kherallah, M., Alimi, A.M.: Online Arabic handwriting recognition: a survey. Int. J. Doc. Anal. Recognit. (IJDAR) 16(3), 209–226 (2013)
Liu, X., Meng, G., Pan, C.: Scene text detection and recognition with advances in deep learning: a survey. Int. J. Doc. Anal. Recognit. (IJDAR) 22(2), 143–162 (2019)
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recognit. 90, 109–118 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems, (2017)
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2d self-attention. In: Computer Vision and Pattern Recognition, (2020)
Ning, L., Wenwen, Yu., Qi, X., Chen, Y., Gong, P., Xiao, R., Bai, X.: Master: multi-aspect non-local network for scene text recognition. Pattern Recognit. 117, 107980 (2021)
Feng, Z., Du, C., Wang, Y., Xiao, B.: Oster: an orientation sensitive scene text recognizer with centerline rectification. In: Asian Conference on Pattern Recognition, (2019)
Tong, G., Li, Y., Gao, H., Chen, H., Wang, H., Yang, X.: Ma-crnn: a multi-scale attention CRNN for Chinese text line recognition in natural scenes. Int. J. Doc. Anal. Recognit. 23, 103–114 (2020)
Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarking Chinese text recognition: datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093, (2021)
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X.: Symmetry-constrained rectification network for scene text recognition. In: International Conference on Computer Vision, (2019)
Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: Computer Vision and Pattern Recognition, (2019)
Yu, D., Li, X., Zhang, C., Tao, L., Han, J., Liu, J., Ding, E.: Towards accurate scene text recognition with semantic reasoning networks. In: Computer Vision and Pattern Recognition, (2020)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. Computer Vision and Pattern Recognition (2021)
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14194–14203, (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, (2020)
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Tong, L., Luo, P., Shao, L.: Pvtv 2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 1–10 (2022)
Alec, R., Jong Wook, K., Chris, H., Aditya, R., Gabriel, G., Sandhini, A., Girish, S., Amanda, A., Pamela, M., Jack, C. et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR, (2021)
Yuan, T.-L., Zhu, Z., Xu, K., Li, C.-J., Mu, T.-J., Hu, S.-M.: A large chinese text dataset in the wild. J. Comput. Sci. Technol., (2019)
Chng, C.K., Liu, Y., Sun, Y., Ng, C.C., Luo, C., Ni, Z., Fang, C., Zhang, S., Han, J., Ding, E. et al.: Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: ICDAR, (2019)
Sun, Y., Ni, Z., Chng, C.-K., Liu, Y., Luo, C., Ng, C.C., Han, J., Ding, E., Liu, J., Karatzas, D. et al.: Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In: ICDAR, (2019)
Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., Yang, M. et al.: Icdar 2019 robust reading challenge on reading Chinese text on signboard. In: ICDAR, (2019)
Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: Icdar2017 competition on reading Chinese text in the wild (rctw-17). In: ICDAR, (2017)
He, M., Liu, Y., Yang, Z., Zhang, S., Luo, C., Gao, F., Zheng, Q., Wang, Y., Zhang, X., and Jin, L.: Icpr2018 contest on robust reading for multi-type web images. In: ICPR, 2018
Yim, M., Kim, Y., Cho, H.-C., Park, S.: Synthtiger: synthetic text image generator towards better text recognition models. In: International Conference on Document Analysis and Recognition, pp. 109–124. Springer, (2021)
Zhang, H., Liang, L., Jin, L.: Scut-hccdoc: a new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognition, pp. 107559, (2020)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, (2019)
Liu, Z., Lin,Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 34, 9355–9366 (2021)
Kuang,Z., Sun, H., Li, Z., Yue, X., Lin, T.H., Chen, J., Wei, H., Zhu, Y., Gao, T., Zhang, W., et al.: Mmocr: a comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543, 2021
van der Maaten, L., Hinton, G.E.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., Ding, E.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, (2020)
Yu, Z.Q., Zhou, D.Y., Zhou, Y., and Wang, W.: Semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, Seed (2020)
Chen, J., Li, B., and Xue, X.: Text-focused scene image super-resolution. In: CVPR, Scene text telescope (2021)
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: Robustscanner: dynamically enhancing positional clues for robust text recognition. In: European Conference on Computer Vision, pp. 135–151. Springer, (2020)
Lyu, P., Zhang, C., Liu, S., Qiao, M., Xu, Y., Wu, L., Yao, K., Han, J., Ding, E., Wang, J.: Maskocr: text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311, (2022)
Acknowledgements
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDC08020400).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, S., Li, Y. & Wang, Z. Chinese text recognition enhanced by glyph and character semantic information. IJDAR 27, 45–56 (2024). https://doi.org/10.1007/s10032-023-00444-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-023-00444-9