Skip to main content

LABT: A Sequence-to-Sequence Model for Mongolian Handwritten Text Recognition with Local Aggregation BiLSTM and Transformer

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Abstract

Mongolian handwritten text recognition poses challenges with the unique characteristics of Mongolian script, its large vocabulary, and the presence of out-of-vocabulary (OOV) words. This paper proposes a model that uses local aggregation BiLSTM for sequence modeling of visual features and Transformer for word prediction. Specifically, we introduce a local aggregation operation in BiLSTM (Bidirectional Long and Short Term Memory) to improve contextual understanding by aggregating adjacent information at each time step. The improved BiLSTM is able to capture context-dependent and letter shape changes that occur in different contexts. It effectively addresses the difficulty of accurately identifying variable letters and generating OOV words without relying on predefined words during training. The contextual features extracted by BiLSTM are passed through multiple layers of Transformer’s encoder and decoder. At each layer, the representations of the previous layer are accessible, allowing layered representations to be refined and improved. By using hierarchical representations, accurate predictions can be made even in large vocabulary text recognition tasks. Our proposed model achieves state-of-the-art performance on two commonly used Mongolian handwritten text recognition datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. An, Y., Xia, X., et al.: Chinese clinical named entity recognition via multi-head self-attention based bilstm-crf. Artif. Intell. Med. 127, 102–114 (2022)

    Article  Google Scholar 

  2. Baek, J., Kim, G., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4715–4723. IEEE (2019)

    Google Scholar 

  3. Daoerji, F., Guanglai, G., et al.: DNN-HMM for large vocabulary Mongolian offline handwriting recognition. In: Proceedings of International Conference on Frontiers in Handwriting Recognition, pp. 72–77. IEEE (2016)

    Google Scholar 

  4. Daoerji, F., Guanglai, G., et al.: MHW Mongolian offline handwritten dataset and its application. J. Chin. Inf. Process. 32(1), 89–95 (2018)

    Google Scholar 

  5. Fu, J., Liu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. IEEE (2019)

    Google Scholar 

  6. He, K., Zhang, X., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)

    Google Scholar 

  7. Hu, H., Wei, H., et al.: The CNN based machine-printed traditional Mongolian characters recognition. In: Proceedings of Chinese Control Conference, pp. 3937–3941. IEEE (2017)

    Google Scholar 

  8. Jaderberg, M., Simonyan, K., et al.: Spatial transformer networks. In: Proceedings of Neural Information Processing Systems, pp. 2017–2025. MIT (2015)

    Google Scholar 

  9. Li, X., Wang, W., et al.: Selective kernel networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 510–519. IEEE (2019)

    Google Scholar 

  10. Liu, W., et al.: Star-net: a spatial attention residue network for scene text recognition. In: Proceedings of British Machine Vision Conference, pp. 19–33. Springer (2016)

    Google Scholar 

  11. Ocasio, W.: Attention to attention. Organiz. Sci. 22(5), 1286–1296 (2011)

    Article  Google Scholar 

  12. Riaz, N., Arbab, H., et al.: Conv-transformer architecture for unconstrained off-line urdu handwriting recognition. Int. J. Doc. Anal. Recogn. 25(4), 373–384 (2022)

    Article  Google Scholar 

  13. Shaiq, M.D., Cheema, M.D.A., et al.: Transformer based Urdu handwritten text optical character reader. arXiv preprint arXiv:2206.04575 (2022)

  14. Sheng, F., Chen, Z., et al.: NRTR: a no-recurrence sequence-to-sequence model for scene text recognition. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 781–786. IEEE (2019)

    Google Scholar 

  15. Shi, B., Wang, X., et al.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. IEEE (2016)

    Google Scholar 

  16. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems, pp. 5998–6008. MIT (2017)

    Google Scholar 

  17. Wei, H., Gao, M., et al.: Named entity recognition from biomedical texts using a fusion attention-based bilstm-crf. IEEE Access 7, 73627–73636 (2019)

    Article  Google Scholar 

  18. Wei, H., Gao, G.: A keyword retrieval system for historical mongolian document images. Int. J. Doc. Anal. Recogn. 17, 33–45 (2014)

    Article  Google Scholar 

  19. Wei, H., Liu, C., Zhang, H., Bao, F., Gao, G.: End-to-end model for offline handwritten Mongolian word recognition. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 220–230. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_19

    Chapter  Google Scholar 

  20. Woo, S., Park, J., et al.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, pp. 3–19. IEEE (2018)

    Google Scholar 

  21. Wu, H., Daoerji, F., et al.: A multi-scale based Mongolian offline handwriting recognition method. J. Chin. Inf. Process. 36(10), 81 (2022)

    Google Scholar 

  22. Yang, L., Wang, P., et al.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)

    Article  Google Scholar 

  23. Zhang, H., Wu, C., et al.: Resnest: split-attention networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2746. IEEE (2022)

    Google Scholar 

  24. Zhao, W., Gao, L.: CoMER: modeling coverage for transformer-based handwritten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXVIII, pp. 392–408. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_23

Download references

Acknowledgement

This study is supported by the Project for Science and Technology of Inner Mongolia Autonomous Region under Grant 2019GG281, the Natural Science Foundation of Inner Mongolia Autonomous Region under Grant 2019ZD14, the Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region under Grant NJYT-20-A05, the fund of supporting the reform and development of local universities (Disciplinary Construction) and construction project of “Inner Mongolia Science and Technology Achievement Transfer and Transformation Demonstration Zone, University Collaborative Innovation Base, and University Entrepreneurship Training Base” (Supercomputing Power Project: 21300-231510).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongxi Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Wei, H., Sun, S. (2024). LABT: A Sequence-to-Sequence Model for Mongolian Handwritten Text Recognition with Local Aggregation BiLSTM and Transformer. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14805. Springer, Cham. https://doi.org/10.1007/978-3-031-70536-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70536-6_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70535-9

  • Online ISBN: 978-3-031-70536-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics