Skip to main content

Improving Document Image Understanding with Reinforcement Finetuning

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1794))

Included in the following conference series:

  • 874 Accesses

Abstract

Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in understanding document images, especially in cases where training data is limited. We address the problem by proposing a novel finetuning method using reinforcement learning. Our approach treats the Information Extraction model as a policy network and uses policy gradient training to update the model to maximize combined reward functions that complement the traditional cross-entropy losses. Our experiments on four datasets using labels and expert feedback demonstrate that our finetuning mechanism consistently improves the performance of a state-of-the-art information extractor, especially in the small training data regime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Using Python slicing notation.

  2. 2.

    https://huggingface.co/taprosoft/layoutxlm-no-visual.

References

  1. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 11 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Industry Papers), pp. 32–39 (2021)

    Google Scholar 

  2. Xu, Y., et al.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  3. Xu, Y., et al.: LayoutLMv2: multi-modal Pre-training for Visually-rich Document Understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pp. 2579–2591, August 2021

    Google Scholar 

  4. Nguyen, T.-A.D., Vu, H.M., Son, N.H., Nguyen, M.-T.: A span approach for information extraction on visually-rich documents. In: International Conference on Document Analysis and Recognition, pp. 353–363 (2021)

    Google Scholar 

  5. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  6. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4363–4370 (2021)

    Google Scholar 

  7. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual FUDGE: form understanding via dynamic graph editing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 416–431. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_27

    Chapter  Google Scholar 

  8. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  9. Li, F., Lin, Z., Zhang, M., Ji, D.: A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4814-4828. Association for Computational Linguistics, August 2021

    Google Scholar 

  10. Son, N.H., Vu, H.M., Nguyen, T.-A.D., Nguyen, M.-T.: Jointly learning span extraction and sequence labeling for information extraction from business documents. arXiv preprint arXiv:2205.13434 (2022)

  11. Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1662–1675, June 2018

    Google Scholar 

  12. Li, J., et al.: Deep reinforcement learning for dialogue generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Association for Computational Linguistics, November 2016

    Google Scholar 

  13. Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y.: A study of reinforcement learning for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3612–3621. Association for Computational Linguistics, October 2018

    Google Scholar 

  14. Nguyen, D.-H., et al.: Robust deep reinforcement learning for extractive legal summarization. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 597–604. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_69

    Chapter  Google Scholar 

  15. Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021 (2020)

    Google Scholar 

  16. Nguyen, D.-H., et al.: Make the most of prior data: a solution for interactive text summarization with preference feedback. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1919–1930. Association for Computational Linguistics, Seattle, July 2022. https://aclanthology.org/2022.findings-naacl.147

  17. Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)

    Google Scholar 

  18. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878–891. Association for Computational Linguistics, May 2022

    Google Scholar 

  19. Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)

    Google Scholar 

  20. Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  21. Wang, Z., Shang, J.: Towards few-shot entity recognition in document images: a label-aware sequence-to-sequence framework. arXiv preprint arXiv:2204.05819 (2022)

  22. Le, H., et al.: Episodic policy gradient training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7317–7325 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bao-Sinh Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, BS., Le, D.T., Vu, H.M., Nguyen, TA.D., Nguyen, MT., Le, H. (2023). Improving Document Image Understanding with Reinforcement Finetuning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1648-1_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1647-4

  • Online ISBN: 978-981-99-1648-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics