CarveNet: a channel-wise attention-based network for irregular scene text recognition

Wu, Guibin; Zhang, Zheng; Xiong, Yongping

doi:10.1007/s10032-022-00398-4

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Original Paper
Published: 05 April 2022

Volume 25, pages 177–186, (2022)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Guibin Wu¹,
Zheng Zhang¹ &
Yongping Xiong¹

469 Accesses
2 Citations
Explore all metrics

Abstract

Although it has achieved considerable progress in recent years, recognizing irregular text in natural scene is still a challenging problem due to the distortion and background interference. The prior works use either spatial transformation network(STN) or 2D Attention mechanism to improve the recognition accuracy. However, STN-based methods are not robust as the limited network capacity while 2D Attention-based methods are highly interfered by fuzziness, distortion and background. In this paper, we propose a text recognition model CarveNet which consists of three substructures: feature extractor, feature filter and decoder. Feature extractor utilizes FPN (Feature Pyramid Network) to aggregate multi-scale hierarchical feature maps and obtain a larger receptive field. Then, feature filter composed of stacked Residual Channel Attention Block is followed to separate text features from background interference. The 2D self-attention-based decoder generates the text sequence according to the output of feature filter and the previously generated symbols. Extensive evaluation results show CarveNet achieves state-of-the-art on both regular and irregular scene text recognition benchmark datasets. Compared with the previous work based on 2D self-attention, CarveNet achieves accuracy increases of 2.3 and 4.6% on irregular dataset SVTP and CT80.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for image denoising using convolutional neural network: a review

Article Open access 10 June 2021

CBAM: Convolutional Block Attention Module

Deep learning models for digital image processing: a review

Article 07 January 2024

References

Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4715–4723 (2019)
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667 (2017)
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: Towards accurate text recognition in natural images. In: Proceedings of the IEEE international conference on computer vision, pp. 5076–5084 (2017)
Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: Aon: Towards arbitrarily-oriented text recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5571–5579 (2018)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2315–2324 (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition, pp. 1484–1493. IEEE (2013)
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2231–2239 (2016)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8610–8617 (2019)
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X.: Scene text recognition from two-dimensional perspective. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8714–8721 (2019)
Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. BMVC 2, 7 (2016)
Google Scholar
Liu, Y., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 90, 337–345 (2019)
Article Google Scholar
Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vis. 129(1), 161–184 (2021)
Article Google Scholar
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Article Google Scholar
Luo, C., Zhu, Y., Jin, L., Wang, Y.: Learn to augment: joint data augmentation and network optimization for text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13746–13755 (2020)
Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text recognition. In: 2012 IEEE conference on computer vision and pattern recognition, pp. 2687–2694. IEEE (2012)
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE international conference on computer vision, pp. 569–576 (2013)
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar
Sheng, F., Chen, Z., Xu, B.: Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In: 2019 International conference on document analysis and recognition (ICDAR), pp. 781–786. IEEE (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4168–4176 (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International conference on computer vision, pp. 1457–1464. IEEE (2011)
Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., Wang, Q., Cai, M.: Decoupled attention network for text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12216–12224 (2020)
Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for sequence recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6538–6547 (2019)
Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)
Article Google Scholar
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X.: Symmetry-constrained rectification network for scene text recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9147–9156 (2019)
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. IJCAI 1, 3 (2017)
Google Scholar
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV), pp. 286–301 (2018)
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5551–5560 (2017)

Download references

Author information

Authors and Affiliations

The State Key Lab. of Switching and Networking Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Guibin Wu, Zheng Zhang & Yongping Xiong

Authors

Guibin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongping Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongping Xiong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, G., Zhang, Z. & Xiong, Y. CarveNet: a channel-wise attention-based network for irregular scene text recognition. IJDAR 25, 177–186 (2022). https://doi.org/10.1007/s10032-022-00398-4

Download citation

Received: 17 September 2021
Revised: 24 February 2022
Accepted: 01 March 2022
Published: 05 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10032-022-00398-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Abstract

Access this article

Similar content being viewed by others

Methods for image denoising using convolutional neural network: a review

CBAM: Convolutional Block Attention Module

Deep learning models for digital image processing: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Abstract

Access this article

Similar content being viewed by others

Methods for image denoising using convolutional neural network: a review

CBAM: Convolutional Block Attention Module

Deep learning models for digital image processing: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation