Skip to main content

A Transformer-Based Decoupled Attention Network for Text Recognition in Shopping Receipt Images

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1449))

Abstract

Optical character recognition (OCR) of shopping receipts plays an important role in smart business and personal financial management. Many challenging issues remain in current OCR systems for text recognition of shopping receipts captured by mobile phones. This research constructs a multi-task model by integrating saliency object detection as a branch task, which enables us to filter out irrelevant text instances by detecting the outline of a shopping receipt. Moreover, the developed model utilized a deformable convolution so as to learning visual information more effectively. On the other hand, to deal with attention drift of text recognition, we propose a transformer-based decoupled attention network, which is able to decouple the attention and prediction processes in attention mechanism. This mechanism can not only increase prediction accuracy, but also increase the inference speed. Extensive experimental results on a large-scale real-life dataset exhibit the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Mithun, N.C., et al.: Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia (2018)

    Google Scholar 

  2. Jiang, Y., Zhu, X., Wang, X., et al.: R2CNN: rotational region cnn for orientation robust scene text detection (2017). arXiv:170609579

  3. Zhang, S., Liu, Y., Jin, L., et al.: Feature enhancement network: a refined scene text detector (2017). arXiv:171104249

  4. Liao, M., et al.: Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1 (2017)

    Google Scholar 

  5. Liu, W., et al.: SSD: single shot multibox detector. In: European Conference on Computer Vision. Springer, Cham (2016)

    Google Scholar 

  6. Zhou, X., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  7. Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vis. 129, 1–24 (2020)

    Google Scholar 

  8. Tian, Z., et al.: Detecting text in natural image with connectionist text proposal network. In: European Conference on Computer Vision. Springer, Cham (2016)

    Google Scholar 

  9. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  10. Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2550–2558 (2017)

    Google Scholar 

  11. Baek, Y., et al.: Character region awareness for text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  12. Feng, W., et al.: Textdragon: an end-to-end framework for arbitrary shaped text spotting. In: Proceedings of the IEEE International Conference on Computer Vision (2019)

    Google Scholar 

  13. Deng, D., Liu, H., Li, X., et al.: Pixellink: detecting scene text via instance segmentation (2018). arXiv:180101315

  14. Li, Y., Yu, Y., Li, Z., et al.: Pixel-anchor: a fast oriented scene text detector with combined networks (2018). arXiv:181107432

  15. Wang, W., Xie, E., Song, X., et al.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8440–8449 (2019)

    Google Scholar 

  16. Wang, W., Xie, E., Li, X., et al.: Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9336–9345 (2019)

    Google Scholar 

  17. Liao, M., Wan, Z., Yao, C., et al.: Real-time scene text detection with differentiable binarization. In: AAAI Conference on Artificial Intelligence, pp. 11474–11481 (2020)

    Google Scholar 

  18. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11) 2298–2304 (2016)

    Article  Google Scholar 

  19. Gao, Y., Chen, Y., Wang, J., et al.: Reading scene text with attention convolutional sequence modeling (2017). arXiv:170904303

  20. Nguyen, D., Tran, N., Le, H.: Improving long handwritten text line recognition with convolutional multi-way associative memory (2019). arXiv:191101577

  21. Yin, F., Wu. Y.-C., Zhang, X.-Y., et al.: Scene text recognition with sliding convolutional character models (2017). arXiv:170901727

  22. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)

    Google Scholar 

  23. Lee, C.-Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239 (2016)

    Google Scholar 

  24. Cheng, Z., Bai, F., Xu, Y., et al.: Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5076–5084 (2017)

    Google Scholar 

  25. Bai, F., Cheng, Z., Niu, Y., et al.: Edit probability for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1508–1516 (2018)

    Google Scholar 

  26. Liu, Z., Lin, G., Yang, S., et al.: Learning markov clustering networks for scene text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  27. Shi, B., Yang, M., Wang, X., et al.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)

    Google Scholar 

  28. Berman, M., Amal, R.T., Blaschko, M.B.: The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  29. Wang, T., Zhu, Y., Jin, L., et al.: Decoupled attention network for text recognition. In: AAAI Conference on Artificial Intelligence, pp. 12216–12224 (2020)

    Google Scholar 

  30. Ryan, M., Hanafiah, N.,: An examination of character recognition on ID card using template matching approach. Procedia Comput. Sci. 59, 520–529 (2015)

    Google Scholar 

  31. Chang, S.-L., et al.: Automatic license plate recognition. IEEE Trans. Intell. Transp. Syst. 5(1), 42–53 (2004)

    Google Scholar 

  32. Maes, P.: Smart commerce: the future of intelligent agents in cyberspace. J. Interact. Mark. 13(3) 66–76 (1999)

    Google Scholar 

  33. Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2017)

    Google Scholar 

  34. Ji, Y., Haijun, Z., Wu, Q.J.: Salient object detection via multi-scale attention CNN. Neurocomputing 322 130–140 (2018)

    Google Scholar 

  35. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  36. Lin, T.-Y., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  37. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant no. 61972112 and no. 61832004, the Guangdong Basic and Applied Basic Research Foundation under Grant no. 2021B1515020088, and the HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication under Grant no. HITSZ-J&A-2021A01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haijun Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ren, L., Zhou, H., Chen, J., Shao, L., Wu, Y., Zhang, H. (2021). A Transformer-Based Decoupled Attention Network for Text Recognition in Shopping Receipt Images. In: Zhang, H., Yang, Z., Zhang, Z., Wu, Z., Hao, T. (eds) Neural Computing for Advanced Applications. NCAA 2021. Communications in Computer and Information Science, vol 1449. Springer, Singapore. https://doi.org/10.1007/978-981-16-5188-5_40

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-5188-5_40

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-5187-8

  • Online ISBN: 978-981-16-5188-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics