Skip to main content

Instruction Makes a Difference

  • Conference paper
  • First Online:
Document Analysis Systems (DAS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14994))

Included in the following conference series:

  • 318 Accesses

Abstract

We introduce the Instruction Document Visual Question Answering (iDocVQA) dataset and the Large Language Document (LLaDoc) model, for training Language-Vision (LV) models for document analysis and predictions on document images, respectively. Usually, deep neural networks for the DocVQA task are trained on datasets lacking instructions. We show that using instruction-following datasets improves performance. We compare performance across document-related datasets using the recent state-of-the-art (SotA) Large Language and Vision Assistant (LLaVA)1.5 as the base model. We also evaluate the performance of the derived models for object hallucination using the Polling-based Object Probing Evaluation (POPE) dataset. The results show that instruction-tuning performance ranges from 11x to 32x of zero-shot performance and from 0.1% to 4.2% over non-instruction (traditional task) finetuning. Despite the gains, these still fall short of human performance (94.36%), implying there’s much room for improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    github.com/LTU-Machine-Learning/iDocVQA.

  2. 2.

    huggingface.co/tosin/LLaDoc.

  3. 3.

    Due to resource constraints.

  4. 4.

    platform.openai.com/docs/guides/prompt-engineering.

References

  1. Adewumi, T., et al.: Afriwoz: corpus for exploiting cross-lingual transfer for dialogue generation in low-resource, African languages. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2023). https://doi.org/10.1109/IJCNN54540.2023.10191208

  2. Adewumi, T., et al.: ProCoT: stimulating critical thinking and writing of students through engagement with large language models (LLMs). arXiv preprint arXiv:2312.09801 (2023)

  3. Adewumi, T., Liwicki, F., Liwicki, M.: State-of-the-art in open-domain conversational AI: a survey. Information 13(6) (2022). https://doi.org/10.3390/info13060298. https://www.mdpi.com/2078-2489/13/6/298

  4. AIIM: State of the intelligent information management industry: Pivotal moment in information management. Association for Intelligent Information Management (2023)

    Google Scholar 

  5. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)

    Google Scholar 

  6. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  7. Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

  8. Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)

  9. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  10. Dao, T.: FlashAttention-2: faster attention with better parallelism and work partitioning (2023)

    Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: Proceedings of ICLR (2021)

    Google Scholar 

  12. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  13. Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: RealToxicityPrompts: evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020)

  14. Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents based on convolutional neural networks. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 287–292. IEEE (2016)

    Google Scholar 

  15. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  16. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)

    Google Scholar 

  17. Hu, W., Xu, Y., Li, Y., Li, W., Chen, Z., Tu, Z.: BLIVA: a simple multimodal LLM for better handling of text-rich visual questions (2024)

    Google Scholar 

  18. Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022)

    Google Scholar 

  19. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213 (2022)

    Google Scholar 

  20. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 292–305. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.emnlp-main.20. https://aclanthology.org/2023.emnlp-main.20

  21. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

    Google Scholar 

  22. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  23. Mathew, M., Gomez, L., Karatzas, D., Jawahar, C.: Asking questions on handwritten document collections. Int. J. Doc. Anal. Recogn. (IJDAR) 24(3), 235–249 (2021)

    Article  Google Scholar 

  24. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)

    Google Scholar 

  25. Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 3470–3487. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.acl-long.244. https://aclanthology.org/2022.acl-long.244

  26. Parrish, A., et al.: BBQ: a hand-built bias benchmark for question answering (2022)

    Google Scholar 

  27. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  28. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264

  29. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  30. Singh, A., et al.: Towards VQA models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, pp. 8309–8318. IEEE Computer Society (2019). https://doi.org/10.1109/CVPR.2019.00851

  31. Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)

    Google Scholar 

  32. Tito, R., Mathew, M., Jawahar, C.V., Valveny, E., Karatzas, D.: ICDAR 2021 competition on document visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 635–649. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_42

    Chapter  Google Scholar 

  33. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  34. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  35. Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)

    Google Scholar 

  36. Wang, X., et al.: On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10126–10135 (2020)

    Google Scholar 

  37. Wang, Y., Li, H., Han, X., Nakov, P., Baldwin, T.: Do-not-answer: a dataset for evaluating safeguards in LLMs. arXiv preprint arXiv:2308.13387 (2023)

  38. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)

    Google Scholar 

  39. Xu, W., Banburski-Fahey, A., Jojic, N.: Reprompting: automated chain-of-thought prompt inference through gibbs sampling. arXiv preprint arXiv:2305.09993 (2023)

  40. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  41. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

    Google Scholar 

  42. Zhao, B., Wu, B., Huang, T.: SVIT: scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 (2023)

  43. Zheng, L., et al.: LMSYS-Chat-1M: a large-scale real-world LLM conversation dataset. arXiv preprint arXiv:2309.11998 (2023)

  44. Zhou, Q., Wang, Z., Chu, W., Xu, Y., Li, H., Qi, Y.: InfMLLM: a unified framework for visual-language tasks (2023)

    Google Scholar 

  45. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgment

This work is partly supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP), funded by Knut and Alice Wallenberg Foundations and counterpart funding from Luleå University of Technology (LTU).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tosin Adewumi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Adewumi, T., Habib, N., Alkhaled, L., Barney, E. (2024). Instruction Makes a Difference. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70442-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70441-3

  • Online ISBN: 978-3-031-70442-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics