Parstr: partially autoregressive scene text recognition

Buoy, Rina; Iwamura, Masakazu; Srun, Sovila; Kise, Koichi

doi:10.1007/s10032-024-00470-1

Parstr: partially autoregressive scene text recognition

Special Issue Paper
Published: 22 May 2024

Volume 27, pages 303–316, (2024)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Rina Buoy¹,
Masakazu Iwamura¹,
Sovila Srun² &
…
Koichi Kise¹

384 Accesses
Explore all metrics

Abstract

An autoregressive (AR) decoder for scene text recognition (STR) requires numerous generation steps to decode a text image character by character but can yield high recognition accuracy. On the other hand, a non-autoregressive (NAR) decoder generates all characters in a single generation but suffers from a loss of recognition accuracy. This is because, unlike the former, the latter assumes that the predicted characters are conditionally independent. This paper presents a Partially Autoregressive Scene Text Recognition (PARSTR) method that unifies both AR and NAR decoding within the same model. To reduce decoding steps while maintaining recognition accuracy, we devise two decoding strategies: b-first and b-ahead, reducing the decoding steps to approximately b and by a factor of b, respectively. The experimental results demonstrate that our PARSTR models using the devised decoding strategies present a balanced compromise between efficiency and recognition accuracy compared to the fully AR and NAR decoding approaches. Specifically, the experimental results on public benchmark STR datasets demonstrate the potential to reduce decoding steps down to at most five steps and by a factor of five under the b-first and b-ahead decoding schemes, respectively, while having a slight reduction of total word recognition accuracy of less than or equal to 0.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene Text Recognition with Permuted Autoregressive Sequence Models

NAMER: Non-autoregressive Modeling for Handwritten Mathematical Expression Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Article 21 February 2024

Notes

https://github.com/facebookresearch/fvcore.

References

Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: Proceedings of the 29th ACM International Conference on Multimedia (2021)
Saharia, C., Hinton, G.E., Norouzi, M., Jaitly, N., Chan, W.: Imputer: Sequence modelling via imputation and dynamic programming (2020)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: Lecture Notes in Computer Science, pp. 178–196 (2022)
Gran Ekstrand, A.C., Nilsson Benfatto, M., Öqvist Seimyr, G.: Screening for reading difficulties: comparing eye tracking outcomes to neuropsychological assessments. Front. Edu. 6 (2021). https://doi.org/10.3389/feduc.2021.643232
Kunze, K., Iwamura, M., Kise, K., Uchida, S., Omachi, S.: Activity recognition for the mind: toward a cognitive “quantified self’’. Computer 46, 105–108 (2013)
Article Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2017)
Article Google Scholar
Gated Recurrent Convolution Neural Network for OCR: NIPS’17. Curran Associates Inc., Red Hook (2017)
Liu, W., Chen, C., Wong, K.-Y., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: Proceedings of the British Machine Vision Conference (2016)
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)
Atienza, R.: Vision transformer for fast and efficient scene text recognition. Doc. Anal. Recogn.—ICDAR 2021, 319–334 (2021)
Google Scholar
Wang, Y., et al.: From two to one: a new scene text recognizer with visual language modeling network. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021, pp 14174–14183 (2021). https://doi.org/10.1109/ICCV48922.2021.01393
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019). https://aclanthology.org/N19-1423
Yang, Z., Wallach, H., et al. (eds.): Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Shi, B., et al.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2035–2048 (2019)
Article Google Scholar
Lee, C.-Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Cheng, Z., et al.: Focusing attention: Towards accurate text recognition in natural images. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Sheng, F., Chen, Z., Xu, B.: Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. Proc. AAAI Confer. Artif. Intell. 33, 8610–8617 (2019)
Google Scholar
Wang, T., et al.: Decoupled attention network for text recognition. In: The 34th AAAI Conference on Artificial Intelligence, AAAI 2020, The 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 12216–12224 (2020). https://ojs.aaai.org/index.php/AAAI/article/view/6903
Cui, M., Wang, W., Zhang, J., Wang, L.: Representation and correlation enhanced encoder-decoder framework for scene text recognition. Doc. Anal. Recogn.—ICDAR 2021, 156–170 (2021)
Google Scholar
Loginov, V.: Why you should try the real data for the scene text recognition (2021). https://arxiv.org/abs/2107.13938
Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X.: Toward understanding wordart: corner-guided transformer for scene text recognition. In: Lecture Notes in Computer Science, pp. 303–321 (2022)
Vaswani, A., Guyon, I., et al. (eds.): Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc. (2017)
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. arXiv:2010.11929 (2021)
Touvron, H., Meila, M., Zhang, T., et al. (eds.): Training data-efficient image transformers and distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, pp. 10347–10357. PMLR (2021). https://proceedings.mlr.press/v139/touvron21a.html
Touvron, H., Cord, M., Jegou, H.: Deit iii: Revenge of the vit. arXiv preprint arXiv:2204.07118 (2022)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J’egou, H.: Going deeper with image transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 32–42 (2021)
Liu, H., et al.: Perceiving stroke-semantic context: hierarchical contrastive learning for robust scene text recognition. Proc. AAAI Confer. Artif. Intell. 36, 1702–1710 (2022)
Google Scholar
Yang, M., et al.: Reading and writing: discriminative and generative modeling for self-supervised text recognition. In: Proceedings of the 30th ACM International Conference on Multimedia (2022)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227 (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.J.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Shi, B., et al.: Icdar2017 competition on reading Chinese text in the wild (rctw-17). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017)
Zhang, Y., et al.: Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop—CVPR 2017 (2017)
Chng, C.K., et al.: Icdar2019 robust reading challenge on arbitrary-shaped text—rrc-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019)
Sun, Y., et al.: Icdar 2019 competition on large-scale street view text with partial labeling—rrc-lsvt. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019)
Zhang, R., et al.: Icdar 2019 robust reading challenge on reading Chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019)
Singh, A., et al.: Textocr: towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Balasubramanian, V.N., Tsang, I.W. (eds.): Open images V5 text annotation and yet another mask text spotter. In: Vol. 157 of Proceedings of Machine Learning Research. PMLR (2021)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International Conference on Computer Vision (2011)
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: Proceedings of the British Machine Vision Conference 2012 (2012)
Karatzas, D., et al.: Icdar 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition (2013)
Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015)
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: 2013 IEEE International Conference on Computer Vision (2013)
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41, 8027–8048 (2014)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020)
Lee, J., et al.: On recognizing texts of arbitrary shapes with 2d self-attention. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020)
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? Toward scene text recognition with fewer labels. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Cheng, Z., et al.: Aon: Towards arbitrarily-oriented text recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Jiang, Q., Wang, J., Peng, D., Liu, C., Jin, L.: Revisiting scene text recognition: a data perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20543–20554 (2023)

Download references

Acknowledgements

This work was supported by JSPS Kakenhi Grant Number 22H00540 and RUPP-OMU/HEIP.

Author information

Authors and Affiliations

Osaka Metropolitan University, Osaka, Japan
Rina Buoy, Masakazu Iwamura & Koichi Kise
Royal University of Phnom Penh, Phnom Penh, Cambodia
Sovila Srun

Authors

Rina Buoy
View author publications
You can also search for this author inPubMed Google Scholar
Masakazu Iwamura
View author publications
You can also search for this author inPubMed Google Scholar
Sovila Srun
View author publications
You can also search for this author inPubMed Google Scholar
Koichi Kise
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Rina Buoy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Buoy, R., Iwamura, M., Srun, S. et al. Parstr: partially autoregressive scene text recognition. IJDAR 27, 303–316 (2024). https://doi.org/10.1007/s10032-024-00470-1

Download citation

Received: 15 November 2023
Revised: 26 February 2024
Accepted: 08 May 2024
Published: 22 May 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10032-024-00470-1

Keywords

Part of a collection:

ICDAR 2024

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parstr: partially autoregressive scene text recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scene Text Recognition with Permuted Autoregressive Sequence Models

NAMER: Non-autoregressive Modeling for Handwritten Mathematical Expression Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now