Instruction Makes a Difference

Adewumi, Tosin; Habib, Nudrat; Alkhaled, Lama; Barney, Elisa

doi:10.1007/978-3-031-70442-0_5

Tosin Adewumi ORCID: orcid.org/0000-0002-5582-2031⁹,
Nudrat Habib⁹,
Lama Alkhaled⁹ &
…
Elisa Barney⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14994))

Included in the following conference series:

International Workshop on Document Analysis Systems

318 Accesses

Abstract

We introduce the Instruction Document Visual Question Answering (iDocVQA) dataset and the Large Language Document (LLaDoc) model, for training Language-Vision (LV) models for document analysis and predictions on document images, respectively. Usually, deep neural networks for the DocVQA task are trained on datasets lacking instructions. We show that using instruction-following datasets improves performance. We compare performance across document-related datasets using the recent state-of-the-art (SotA) Large Language and Vision Assistant (LLaVA)1.5 as the base model. We also evaluate the performance of the derived models for object hallucination using the Polling-based Object Probing Evaluation (POPE) dataset. The results show that instruction-tuning performance ranges from 11x to 32x of zero-shot performance and from 0.1% to 4.2% over non-instruction (traditional task) finetuning. Despite the gains, these still fall short of human performance (94.36%), implying there’s much room for improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Article 29 August 2017

Notes

1.
github.com/LTU-Machine-Learning/iDocVQA.
2.
huggingface.co/tosin/LLaDoc.
3.
Due to resource constraints.
4.
platform.openai.com/docs/guides/prompt-engineering.

References

Adewumi, T., et al.: Afriwoz: corpus for exploiting cross-lingual transfer for dialogue generation in low-resource, African languages. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2023). https://doi.org/10.1109/IJCNN54540.2023.10191208
Adewumi, T., et al.: ProCoT: stimulating critical thinking and writing of students through engagement with large language models (LLMs). arXiv preprint arXiv:2312.09801 (2023)
Adewumi, T., Liwicki, F., Liwicki, M.: State-of-the-art in open-domain conversational AI: a survey. Information 13(6) (2022). https://doi.org/10.3390/info13060298. https://www.mdpi.com/2078-2489/13/6/298
AIIM: State of the intelligent information management industry: Pivotal moment in information management. Association for Intelligent Information Management (2023)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Dao, T.: FlashAttention-2: faster attention with better parallelism and work partitioning (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16 \times 16$ words: transformers for image recognition at scale. In: Proceedings of ICLR (2021)
Google Scholar
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: RealToxicityPrompts: evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020)
Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents based on convolutional neural networks. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 287–292. IEEE (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Google Scholar
Hu, W., Xu, Y., Li, Y., Li, W., Chen, Z., Tu, Z.: BLIVA: a simple multimodal LLM for better handling of text-rich visual questions (2024)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213 (2022)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 292–305. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.emnlp-main.20. https://aclanthology.org/2023.emnlp-main.20
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Mathew, M., Gomez, L., Karatzas, D., Jawahar, C.: Asking questions on handwritten document collections. Int. J. Doc. Anal. Recogn. (IJDAR) 24(3), 235–249 (2021)
Article Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)
Google Scholar
Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 3470–3487. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.acl-long.244. https://aclanthology.org/2022.acl-long.244
Parrish, A., et al.: BBQ: a hand-built bias benchmark for question answering (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, pp. 8309–8318. IEEE Computer Society (2019). https://doi.org/10.1109/CVPR.2019.00851
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Google Scholar
Tito, R., Mathew, M., Jawahar, C.V., Valveny, E., Karatzas, D.: ICDAR 2021 competition on document visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 635–649. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_42
Chapter Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Google Scholar
Wang, X., et al.: On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10126–10135 (2020)
Google Scholar
Wang, Y., Li, H., Han, X., Nakov, P., Baldwin, T.: Do-not-answer: a dataset for evaluating safeguards in LLMs. arXiv preprint arXiv:2308.13387 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Xu, W., Banburski-Fahey, A., Jojic, N.: Reprompting: automated chain-of-thought prompt inference through gibbs sampling. arXiv preprint arXiv:2305.09993 (2023)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Google Scholar
Zhao, B., Wu, B., Huang, T.: SVIT: scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 (2023)
Zheng, L., et al.: LMSYS-Chat-1M: a large-scale real-world LLM conversation dataset. arXiv preprint arXiv:2309.11998 (2023)
Zhou, Q., Wang, Z., Chu, W., Xu, Y., Li, H., Qi, Y.: InfMLLM: a unified framework for visual-language tasks (2023)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgment

This work is partly supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP), funded by Knut and Alice Wallenberg Foundations and counterpart funding from Luleå University of Technology (LTU).

Author information

Authors and Affiliations

Machine Learning Group, EISLAB, Luleå University of Technology, Luleå, Sweden
Tosin Adewumi, Nudrat Habib, Lama Alkhaled & Elisa Barney

Authors

Tosin Adewumi
View author publications
You can also search for this author in PubMed Google Scholar
Nudrat Habib
View author publications
You can also search for this author in PubMed Google Scholar
Lama Alkhaled
View author publications
You can also search for this author in PubMed Google Scholar
Elisa Barney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tosin Adewumi .

Editor information

Editors and Affiliations

University of West Attica, Egaleo, Greece
Giorgos Sfikas
National Technical University of Athens, Zografou, Greece
George Retsinas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adewumi, T., Habib, N., Alkhaled, L., Barney, E. (2024). Instruction Makes a Difference. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-70442-0_5
Published: 11 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70441-3
Online ISBN: 978-3-031-70442-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Instruction Makes a Difference