LRRA:A Transparent Neural-Symbolic Reasoning Framework for Real-World Visual Question Answering

Wan, Zhang; Chen, Keming; Zhang, Yujie; Xu, Jinan; Chen, Yufeng

doi:10.1007/978-3-030-84186-7_15

Zhang Wan¹⁶,
Keming Chen¹⁶,
Yujie Zhang¹⁶,
Jinan Xu¹⁶ &
…
Yufeng Chen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12869))

Included in the following conference series:

China National Conference on Chinese Computational Linguistics

1563 Accesses

Abstract

The predominant approach of visual question answering (VQA) relies on encoding the image and question with a “black box" neural encoder and decoding a single token into answers such as “yes” or “no”. Despite this approach’s strong quantitative results, it struggles to come up with human-readable forms of justification for the prediction process. To address this insufficiency, we propose LRRA [Look, Read, Reasoning,Answer], a transparent neural-symbolic framework for visual question answering that solves the complicated problem in the real world step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRRA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates answers to the given questions and makes corresponding marks on the image. Furthermore, we believe that the relations between objects in the question is of great significance for obtaining the correct answer, so we create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions to analyze which part of the question contributes more to the answer. Our experiments on the GQA dataset show that LRRA is significantly better than the existing representative model (57.12% vs. 56.39%). Our experiments on the perturbed GQA test set show that the relations between objects is more important for answering complicated questions than the attributes of objects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111. Hong Kong, China, Association for Computational Linguistics, November 2019
Google Scholar
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Antol, S., et al.: Vqa: visual question answering. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433 (2015)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VGA matter: elevating the role of image understanding in visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6325–6334 (2017)
Google Scholar
Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1955–1960, Austin, Texas, Association for Computational Linguistics November 2016
Google Scholar
Hudson, D.A., Manning, C.D.: Gqa: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Johnson, J., Hariharan, B., Maaten, L., Li, F.F., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Santoro, A., et al.: A simple neural network module for relational reasoning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc., Red Hook (2017)
Google Scholar
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Visual reasoning with a general conditioning layer, Courville. Film (2017)
Google Scholar
Hudson, D.A., Manning. C.D.: Compositional attention networks for machine reasoning. In :International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 39–48 (2016)
Google Scholar
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko. K.: Learning to reason: end-to-end module networks for visual question answering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 804–813 (2017)
Google Scholar
Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3008–3017 (2017)
Google Scholar
Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design: casing the gap between performance and interpretability in visual reasoning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4942–4950 (2018)
Google Scholar
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: European Conference on Computer Vision (2018)
Google Scholar
Yi, K., et al.: Disentangling reasoning from vision and language understanding. In: Bengio, S., et al. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates Inc., Red Hook (2018)
Google Scholar
Johnson, J., et al.: Image retrieval using scene graphs. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015)
Google Scholar
Teney, D., Liu, L., Hengel. A.: Graph-structured representations for visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Yin, X., Ordonez. V.: Obj2text: generating visually descriptive language from object layouts (2017)
Google Scholar
Xu, D., Zhu, Y., Choy, C., Fei-Fei, L.: Scene graph generation by iterative message passing. In :Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zellers, R., Yatskar, M., Thomson, S., Choi. Y.: Neural motifs: scene graph parsing with global context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)
Google Scholar
Li, Y., Ouyang, W., Zhou, B., Cui, Y., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision– ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol. 11205, Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_21
He, K., Zhang, X., Ren, S., Sun. J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Battaglia, P.W.: Relational inductive biases, deep learning, and graph networks (2018)
Google Scholar
Xu, K., Hu, W., Leskovec, J., Jegelka. S.: How powerful are graph neural networks? (2018)
Google Scholar
Liang, W., Tian, Y., Chen, C., Yu. Z.: Moss: end-to-end dialog system framework with modular supervision (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CON: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Chen, W., Gan, Z., Li, L., Cheng, L., Wang, W., Liu, J.: Meta module network for compositional visual reasoning (2019)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10293–10302 (2019)
Google Scholar

Download references

Acknowledgements

The research work descried in this paper has been supported by the National Nature Science Foundation of China(Contract 61876198, 61976015, 61976016). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve this paper.

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Zhang Wan, Keming Chen, Yujie Zhang, Jinan Xu & Yufeng Chen

Authors

Zhang Wan
View author publications
You can also search for this author in PubMed Google Scholar
Keming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujie Zhang .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Sheng Li
Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Yang Liu
Baidu (China), Beijing, China
Hua Wu
Chinese Academy of Sciences, Beijing, China
Liu Kang
Harbin Institute of Technology, Harbin, China
Wanxiang Che
Chinese Academy of Sciences, Beijing, China
Shizhu He
Beijing Language and Culture University, Beijing, China
Gaoqi Rao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wan, Z., Chen, K., Zhang, Y., Xu, J., Chen, Y. (2021). LRRA:A Transparent Neural-Symbolic Reasoning Framework for Real-World Visual Question Answering. In: Li, S., et al. Chinese Computational Linguistics. CCL 2021. Lecture Notes in Computer Science(), vol 12869. Springer, Cham. https://doi.org/10.1007/978-3-030-84186-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-84186-7_15
Published: 08 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84185-0
Online ISBN: 978-3-030-84186-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics