Abstract
Mongolian handwritten text recognition poses challenges with the unique characteristics of Mongolian script, its large vocabulary, and the presence of out-of-vocabulary (OOV) words. This paper proposes a model that uses local aggregation BiLSTM for sequence modeling of visual features and Transformer for word prediction. Specifically, we introduce a local aggregation operation in BiLSTM (Bidirectional Long and Short Term Memory) to improve contextual understanding by aggregating adjacent information at each time step. The improved BiLSTM is able to capture context-dependent and letter shape changes that occur in different contexts. It effectively addresses the difficulty of accurately identifying variable letters and generating OOV words without relying on predefined words during training. The contextual features extracted by BiLSTM are passed through multiple layers of Transformer’s encoder and decoder. At each layer, the representations of the previous layer are accessible, allowing layered representations to be refined and improved. By using hierarchical representations, accurate predictions can be made even in large vocabulary text recognition tasks. Our proposed model achieves state-of-the-art performance on two commonly used Mongolian handwritten text recognition datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
An, Y., Xia, X., et al.: Chinese clinical named entity recognition via multi-head self-attention based bilstm-crf. Artif. Intell. Med. 127, 102–114 (2022)
Baek, J., Kim, G., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4715–4723. IEEE (2019)
Daoerji, F., Guanglai, G., et al.: DNN-HMM for large vocabulary Mongolian offline handwriting recognition. In: Proceedings of International Conference on Frontiers in Handwriting Recognition, pp. 72–77. IEEE (2016)
Daoerji, F., Guanglai, G., et al.: MHW Mongolian offline handwritten dataset and its application. J. Chin. Inf. Process. 32(1), 89–95 (2018)
Fu, J., Liu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. IEEE (2019)
He, K., Zhang, X., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)
Hu, H., Wei, H., et al.: The CNN based machine-printed traditional Mongolian characters recognition. In: Proceedings of Chinese Control Conference, pp. 3937–3941. IEEE (2017)
Jaderberg, M., Simonyan, K., et al.: Spatial transformer networks. In: Proceedings of Neural Information Processing Systems, pp. 2017–2025. MIT (2015)
Li, X., Wang, W., et al.: Selective kernel networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 510–519. IEEE (2019)
Liu, W., et al.: Star-net: a spatial attention residue network for scene text recognition. In: Proceedings of British Machine Vision Conference, pp. 19–33. Springer (2016)
Ocasio, W.: Attention to attention. Organiz. Sci. 22(5), 1286–1296 (2011)
Riaz, N., Arbab, H., et al.: Conv-transformer architecture for unconstrained off-line urdu handwriting recognition. Int. J. Doc. Anal. Recogn. 25(4), 373–384 (2022)
Shaiq, M.D., Cheema, M.D.A., et al.: Transformer based Urdu handwritten text optical character reader. arXiv preprint arXiv:2206.04575 (2022)
Sheng, F., Chen, Z., et al.: NRTR: a no-recurrence sequence-to-sequence model for scene text recognition. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 781–786. IEEE (2019)
Shi, B., Wang, X., et al.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. IEEE (2016)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems, pp. 5998–6008. MIT (2017)
Wei, H., Gao, M., et al.: Named entity recognition from biomedical texts using a fusion attention-based bilstm-crf. IEEE Access 7, 73627–73636 (2019)
Wei, H., Gao, G.: A keyword retrieval system for historical mongolian document images. Int. J. Doc. Anal. Recogn. 17, 33–45 (2014)
Wei, H., Liu, C., Zhang, H., Bao, F., Gao, G.: End-to-end model for offline handwritten Mongolian word recognition. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 220–230. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_19
Woo, S., Park, J., et al.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, pp. 3–19. IEEE (2018)
Wu, H., Daoerji, F., et al.: A multi-scale based Mongolian offline handwriting recognition method. J. Chin. Inf. Process. 36(10), 81 (2022)
Yang, L., Wang, P., et al.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)
Zhang, H., Wu, C., et al.: Resnest: split-attention networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2746. IEEE (2022)
Zhao, W., Gao, L.: CoMER: modeling coverage for transformer-based handwritten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXVIII, pp. 392–408. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_23
Acknowledgement
This study is supported by the Project for Science and Technology of Inner Mongolia Autonomous Region under Grant 2019GG281, the Natural Science Foundation of Inner Mongolia Autonomous Region under Grant 2019ZD14, the Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region under Grant NJYT-20-A05, the fund of supporting the reform and development of local universities (Disciplinary Construction) and construction project of “Inner Mongolia Science and Technology Achievement Transfer and Transformation Demonstration Zone, University Collaborative Innovation Base, and University Entrepreneurship Training Base” (Supercomputing Power Project: 21300-231510).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Wei, H., Sun, S. (2024). LABT: A Sequence-to-Sequence Model for Mongolian Handwritten Text Recognition with Local Aggregation BiLSTM and Transformer. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14805. Springer, Cham. https://doi.org/10.1007/978-3-031-70536-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-70536-6_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70535-9
Online ISBN: 978-3-031-70536-6
eBook Packages: Computer ScienceComputer Science (R0)