Learning to Answer Complex Visual Questions from Multi-View Analysis

Zhu, Minjun; Weng, Yixuan; He, Shizhu; Liu, Kang; Zhao, Jun

doi:10.1007/978-981-19-8300-9_17

Minjun Zhu^10,11,
Yixuan Weng¹⁰,
Shizhu He^10,11,
Kang Liu^10,11 &
…
Jun Zhao^10,11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1711))

Included in the following conference series:

China Conference on Knowledge Graph and Semantic Computing

511 Accesses
1 Citations

Abstract

Visual Question Answering (VQA) has received increasing attention in NLP research. Most VQA images focus on natural scenes. However, some images widely used in textbooks such as diagrams often contain complicated and abstract information (e.g. constructed graphs with logic and concepts). Therefore, Diagram Question answering (DQA) is a challenging but significant task, which is also helpful for machines to understand human cognitive behaviors and learning habits. On DQA task, we propose a multi-perspective understanding based visual question-answering method, which constructs a variety of different self-monitoring tasks in the form of prompts to help the model learn deeper information. For the first time, we propose a decoding method of “Cross Entropy constraint Decoding”, which can effectively constrain the content generated by the text when performing multiple selection tasks. This method has obtained SOTA in the evaluation task of CCKS-2022, which fully proves the effectiveness of the method.

M. Zhu and Y. Weng—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Antol, S., et al.: Visual question answering. In: International Conference on Computer Vision, VQA (2015)
Google Scholar
Chen, S.X., Liu, J.S.: Statistical applications of the poisson-binomial and conditional Bernoulli distributions. Statistica Sinica 7, 875–892 (1997)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers) (Minneapolis, Minnesota, June 2019), Association for Computational Linguistics, pp. 4171–4186
Google Scholar
Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127(4), 398–414 (2018). https://doi.org/10.1007/s11263-018-1116-0
Article Google Scholar
Han, X., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 Computer Vision and Pattern Recognition (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hu, R., Singh, A.: Unit: multimodal multitask learning with a unified transformer. In: International Conference on Computer Vision (2021)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking
Google Scholar
Li, B., Weng, Y., Sun, B., Li, S.: Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667 (2022)
Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Neural Information Processing Systems (2014)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, Curran Associates Inc. (2019)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: North American Chapter of the Association for Computational Linguistics (2018)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data
Google Scholar
Qiu, X.P., Sun, T.X., Xu, Y.G., Shao, Y.F., Dai, N., Huang, X.J.: Pre-trained models for natural language processing: a survey. Sci. China Technol. Sci. 63(10), 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3
Article Google Scholar
Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training
Google Scholar
Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: Neural Information Processing Systems (2015)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
Wang, P., et al.: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework
Google Scholar
Wang, W., Bao, H., Dong, L., Wei, F.: VLMo: unified vision-language pre-training with mixture-of-modality-experts. arXiv: 2111.02358 Computer Vision and Pattern Recognition (2021)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Association for Computational Linguistics, pp. 38–45 (2020)
Google Scholar
Xu, R., et al.: Raise a child in large language model: towards effective and generalizable fine-tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9514–9528 (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. knowledge discovery and data mining (2019)
Google Scholar
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: Meeting of the Association for Computational Linguistics (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China
Minjun Zhu, Yixuan Weng, Shizhu He, Kang Liu & Jun Zhao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Minjun Zhu, Shizhu He, Kang Liu & Jun Zhao

Authors

Minjun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yixuan Weng
View author publications
You can also search for this author in PubMed Google Scholar
Shizhu He
View author publications
You can also search for this author in PubMed Google Scholar
Kang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shizhu He .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Ningyu Zhang
Southeast University, Nanjing, China
Meng Wang
Southeast University, Nanjing, China
Tianxing Wu
Nanjing University, Nanjing, China
Wei Hu
National University of Singapore, Singapore, Singapore
Shumin Deng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, M., Weng, Y., He, S., Liu, K., Zhao, J. (2022). Learning to Answer Complex Visual Questions from Multi-View Analysis. In: Zhang, N., Wang, M., Wu, T., Hu, W., Deng, S. (eds) CCKS 2022 - Evaluation Track. CCKS 2022. Communications in Computer and Information Science, vol 1711. Springer, Singapore. https://doi.org/10.1007/978-981-19-8300-9_17

Download citation

DOI: https://doi.org/10.1007/978-981-19-8300-9_17
Published: 02 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8299-6
Online ISBN: 978-981-19-8300-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning to Answer Complex Visual Questions from Multi-View Analysis