Abstract
Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in understanding document images, especially in cases where training data is limited. We address the problem by proposing a novel finetuning method using reinforcement learning. Our approach treats the Information Extraction model as a policy network and uses policy gradient training to update the model to maximize combined reward functions that complement the traditional cross-entropy losses. Our experiments on four datasets using labels and expert feedback demonstrate that our finetuning mechanism consistently improves the performance of a state-of-the-art information extractor, especially in the small training data regime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Using Python slicing notation.
- 2.
References
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 11 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Industry Papers), pp. 32–39 (2021)
Xu, Y., et al.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Xu, Y., et al.: LayoutLMv2: multi-modal Pre-training for Visually-rich Document Understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pp. 2579–2591, August 2021
Nguyen, T.-A.D., Vu, H.M., Son, N.H., Nguyen, M.-T.: A span approach for information extraction on visually-rich documents. In: International Conference on Document Analysis and Recognition, pp. 353–363 (2021)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4363–4370 (2021)
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual FUDGE: form understanding via dynamic graph editing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 416–431. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_27
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Li, F., Lin, Z., Zhang, M., Ji, D.: A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4814-4828. Association for Computational Linguistics, August 2021
Son, N.H., Vu, H.M., Nguyen, T.-A.D., Nguyen, M.-T.: Jointly learning span extraction and sequence labeling for information extraction from business documents. arXiv preprint arXiv:2205.13434 (2022)
Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1662–1675, June 2018
Li, J., et al.: Deep reinforcement learning for dialogue generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Association for Computational Linguistics, November 2016
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y.: A study of reinforcement learning for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3612–3621. Association for Computational Linguistics, October 2018
Nguyen, D.-H., et al.: Robust deep reinforcement learning for extractive legal summarization. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 597–604. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_69
Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021 (2020)
Nguyen, D.-H., et al.: Make the most of prior data: a solution for interactive text summarization with preference feedback. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1919–1930. Association for Computational Linguistics, Seattle, July 2022. https://aclanthology.org/2022.findings-naacl.147
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878–891. Association for Computational Linguistics, May 2022
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Wang, Z., Shang, J.: Towards few-shot entity recognition in document images: a label-aware sequence-to-sequence framework. arXiv preprint arXiv:2204.05819 (2022)
Le, H., et al.: Episodic policy gradient training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7317–7325 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nguyen, BS., Le, D.T., Vu, H.M., Nguyen, TA.D., Nguyen, MT., Le, H. (2023). Improving Document Image Understanding with Reinforcement Finetuning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-1648-1_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1647-4
Online ISBN: 978-981-99-1648-1
eBook Packages: Computer ScienceComputer Science (R0)