Skip to main content

Arbitrary-Shaped Scene Text Recognition with Deformable Ensemble Attention

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Abstract

Scene text recognition (STR) is a challenging task that aims to automatically localize and recognize text in varied natural scenes. Although the performance of STR methods has been significantly improved, the STR problem is far from being solved, especially when dealing with text with complex shapes and intricate backgrounds. To increase the accuracy of the STR model for arbitrary-shaped text and robustness to interferences such as noises and adjacent objects, we propose a novel deformable ensemble attention model and a scene text recognition network DEATRN based on it. The attention model combines the flexibility of an ensemble of deformable 2D local attentions for retrieving discriminative features of characters and the constraints on the regularity of the overall shape of a text depicted by its parametric centerline, which effectively enhances the text recognition performance of DEATRN. We also propose effective text geometry-based loss terms to improve the accuracy of attention. The experimental results show the superiority of DEATRN in recognizing arbitrary-shaped text in real scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: ICCV, pp. 4714–4722 (2019)

    Google Scholar 

  2. Cheng, C., Wang, P., Da, C., Zheng, Q., Yao, C.: LISTER: neighbor decoding for length-insensitive scene text recognition. In: ICCV, pp. 19541–19551, October 2023

    Google Scholar 

  3. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5086–5094, October 2017

    Google Scholar 

  4. Du, Y., et al.: SVTR: scene text recognition with a single visual model. In: IJCAI, pp. 884–890 (2022)

    Google Scholar 

  5. Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7094–7103 (2021)

    Google Scholar 

  6. Guan, T., et al.: Self-supervised implicit glyph attention for text recognition. In: CVPR, pp. 15285–15294 (2023)

    Google Scholar 

  7. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014)

    Google Scholar 

  8. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_34

    Chapter  Google Scholar 

  9. Jiang, Q., Wang, J., Peng, D., Liu, C., Jin, L.: Revisiting scene text recognition: a data perspective. In: ICCV (2023)

    Google Scholar 

  10. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)

    Google Scholar 

  11. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  12. Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2D self-attention. In: CVPRW, pp. 2326–2335 (2020)

    Google Scholar 

  13. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617, July 2019

    Google Scholar 

  14. Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: SCATTER: selective context attentional scene text recognizer. In: CVPR, pp. 11959–11969 (2020)

    Google Scholar 

  15. Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. PR 90, 109–118 (2019)

    Google Scholar 

  16. Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC, pp. 1–11 (2012)

    Google Scholar 

  17. Na, B., Kim, Y., Park, S.: Multi-modal text recognition networks: interactive enhancements between visual and semantic features. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) CCV 2022. LNCS, vol. 13688, pp. 446–463. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_26

    Chapter  Google Scholar 

  18. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)

    Google Scholar 

  19. Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: SEED: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13525–13534 (2020)

    Google Scholar 

  20. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. ESA 41(18), 8027–8048 (2014)

    Google Scholar 

  21. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI 39(11), 2298–2304 (2017)

    Google Scholar 

  22. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE TPAMI 41(9), 2035–2048 (2019)

    Google Scholar 

  23. Tan, Y.L., Kong, A.W.K., Kim, J.J.: Pure transformer with integrated experts for scene text recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 481–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_28

    Chapter  Google Scholar 

  24. Vaswani, A., et al.: Attention is all You need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  25. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)

    Google Scholar 

  26. Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI, vol. 34, pp. 12216–12224, April 2020

    Google Scholar 

  27. Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14174–14183 (2021)

    Google Scholar 

  28. Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X.: Toward understanding WordArt: corner-guided transformer for scene text recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 303–321. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_18

    Chapter  Google Scholar 

  29. Xu, J., Wang, Y., Xie, H., Zhang, Y.: OTE: exploring accurate scene text recognition using one token. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28327–28336, June 2024

    Google Scholar 

  30. Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12110–12119 (2020)

    Google Scholar 

  31. Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9

    Chapter  Google Scholar 

  32. Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, pp. 2054–2063, June 2019

    Google Scholar 

  33. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Su .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, S., Zhuang, Z., Li, M., Su, F. (2025). Arbitrary-Shaped Scene Text Recognition with Deformable Ensemble Attention. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15331. Springer, Cham. https://doi.org/10.1007/978-3-031-78119-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78119-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78118-6

  • Online ISBN: 978-3-031-78119-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics