Improving Document Image Understanding with Reinforcement Finetuning

Nguyen, Bao-Sinh; Le, Dung Tien; Vu, Hieu M.; Nguyen, Tuan-Anh D.; Nguyen, Minh-Tien; Le, Hung

doi:10.1007/978-981-99-1648-1_5

Bao-Sinh Nguyen¹⁰,
Dung Tien Le¹⁰,
Hieu M. Vu¹⁰,
Tuan-Anh D. Nguyen¹⁰,
Minh-Tien Nguyen^10,12 &
…
Hung Le¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1794))

Included in the following conference series:

International Conference on Neural Information Processing

874 Accesses

Abstract

Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in understanding document images, especially in cases where training data is limited. We address the problem by proposing a novel finetuning method using reinforcement learning. Our approach treats the Information Extraction model as a policy network and uses policy gradient training to update the model to maximize combined reward functions that complement the traditional cross-entropy losses. Our experiments on four datasets using labels and expert feedback demonstrate that our finetuning mechanism consistently improves the performance of a state-of-the-art information extractor, especially in the small training data regime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-supervised Representation Learning on Document Images

Circumventing Outliers of AutoAugment with Knowledge Distillation

Deep deterministic policy gradients with a self-adaptive reward mechanism for image retrieval

Article Open access 26 December 2024

Notes

1.
Using Python slicing notation.
2.
https://huggingface.co/taprosoft/layoutxlm-no-visual.

References

Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 11 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Industry Papers), pp. 32–39 (2021)
Google Scholar
Xu, Y., et al.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Google Scholar
Xu, Y., et al.: LayoutLMv2: multi-modal Pre-training for Visually-rich Document Understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pp. 2579–2591, August 2021
Google Scholar
Nguyen, T.-A.D., Vu, H.M., Son, N.H., Nguyen, M.-T.: A span approach for information extraction on visually-rich documents. In: International Conference on Document Analysis and Recognition, pp. 353–363 (2021)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4363–4370 (2021)
Google Scholar
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual FUDGE: form understanding via dynamic graph editing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 416–431. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_27
Chapter Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Li, F., Lin, Z., Zhang, M., Ji, D.: A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4814-4828. Association for Computational Linguistics, August 2021
Google Scholar
Son, N.H., Vu, H.M., Nguyen, T.-A.D., Nguyen, M.-T.: Jointly learning span extraction and sequence labeling for information extraction from business documents. arXiv preprint arXiv:2205.13434 (2022)
Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1662–1675, June 2018
Google Scholar
Li, J., et al.: Deep reinforcement learning for dialogue generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Association for Computational Linguistics, November 2016
Google Scholar
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y.: A study of reinforcement learning for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3612–3621. Association for Computational Linguistics, October 2018
Google Scholar
Nguyen, D.-H., et al.: Robust deep reinforcement learning for extractive legal summarization. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 597–604. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_69
Chapter Google Scholar
Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021 (2020)
Google Scholar
Nguyen, D.-H., et al.: Make the most of prior data: a solution for interactive text summarization with preference feedback. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1919–1930. Association for Computational Linguistics, Seattle, July 2022. https://aclanthology.org/2022.findings-naacl.147
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)
Google Scholar
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878–891. Association for Computational Linguistics, May 2022
Google Scholar
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)
Google Scholar
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Wang, Z., Shang, J.: Towards few-shot entity recognition in document images: a label-aware sequence-to-sequence framework. arXiv preprint arXiv:2204.05819 (2022)
Le, H., et al.: Episodic policy gradient training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7317–7325 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Cinnamon AI, 10th floor, Geleximco building, 36 Hoang Cau, Dong Da, Hanoi, Vietnam
Bao-Sinh Nguyen, Dung Tien Le, Hieu M. Vu, Tuan-Anh D. Nguyen & Minh-Tien Nguyen
Deakin University, Geelong, Australia
Hung Le
Hung Yen University of Technology and Education, Hung Yen, Vietnam
Minh-Tien Nguyen

Authors

Bao-Sinh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Dung Tien Le
View author publications
You can also search for this author in PubMed Google Scholar
Hieu M. Vu
View author publications
You can also search for this author in PubMed Google Scholar
Tuan-Anh D. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Minh-Tien Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Hung Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bao-Sinh Nguyen .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, BS., Le, D.T., Vu, H.M., Nguyen, TA.D., Nguyen, MT., Le, H. (2023). Improving Document Image Understanding with Reinforcement Finetuning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_5

Download citation

DOI: https://doi.org/10.1007/978-981-99-1648-1_5
Published: 15 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1647-4
Online ISBN: 978-981-99-1648-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Document Image Understanding with Reinforcement Finetuning