Skip to main content
Log in

Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Visual commonsense reasoning (VCR) task leads to a cognitive level of understanding between vision and linguistic domains. Three sub-tasks, i.e., \(Q \rightarrow A\), \(QA \rightarrow R\), and \(Q \rightarrow AR\), require the ability to predict the correct answer and rational explanation according to the given image and question. Different from other visual reasoning tasks, such as VQA and GQA, VCR focuses on the exploration of the facts that clarify the causes, context, and consequences of the image and questions, which is the process of acquiring knowledge and thorough understanding. In this paper, we propose a rationale knowledge base (RKB) incorporating the convolution fusion mechanism to import the VCR-related knowledge. We emphasize that (1) the RKB is extracted and then trained over VCR’s dataset (VCR-set) itself, and (2) the convolution fusion mechanism is subtly designed to be self-adaptive and computationally efficient. Experiments on the large-scale VCR-set demonstrate the effectiveness of our proposed method with respect to the three sub-tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ma, C., Shen, C., Dick, A., Wu, Q., Wang, P., van den Hengel, A., Reid, I.: Visual question answering with memory-augmented networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6975–6984 (2018)

  2. Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1969–1978 (2019)

  3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp. 2425–2433 (2015)

  4. Liu, X., Yang, X., Wang, M., Hong, R.: Deep neighborhood component analysis for visual similarity modeling. ACM Trans. Intell. Syst. Technol. TIST 11(3), 1–15 (2020)

    Google Scholar 

  5. Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10267–10276 (2020)

  6. Uppal, S., Madan, A., Bhagat, S., Yu, Y., Shah, R. R.: C3VQG: category consistent cyclic visual question generation. arXiv preprint arXiv:2005.07771 (2020)

  7. Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. arXiv preprint arXiv:1603.06059 (2016)

  8. Ye, K., Kovashka, A.: A case study of the shortcut effects in visual commonsense reasoning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, pp. 3181–3189 (2021)

  9. Han, Y., Wu, A., Zhu, L., Yang, Y.: Visual commonsense reasoning with directional visual connections. Front. Inf. Technol. Electron. Eng. 22(5), 625–637 (2021)

    Article  Google Scholar 

  10. Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7746–7755 (2018)

  11. Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T. S.: Visual relation grounding in videos. In: European conference on computer vision, pp. 447–464 (2020)

  12. Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4777–4786 (2020)

  13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)

  14. Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  15. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  16. Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T. S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1339–1348 (2020)

  17. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)

  18. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)

  19. Norcliffe-Brown, W., Vafeias, E., Parisot, S.: Learning conditioned graph structures for interpretable visual question answering. arXiv preprint arXiv:1806.07243 (2018)

  20. Yang, X., Feng, F., Ji, W., Wang, M., Chua, T. S.: Deconfounded video moment retrieval with causal intervention. arXiv preprint arXiv:2106.01534 (2021)

  21. Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6619–6628 (2019)

  22. Yu, W., Zhou, J., Yu, W., Liang, X., Xiao, N.: Heterogeneous graph learning for visual commonsense reasoning. arXiv preprint arXiv:1910.11475 (2019)

  23. Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X., Wang, M.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3059295

    Article  Google Scholar 

  24. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6720–6731 (2019)

  25. Pan, X., Sun, K., Yu, D., Chen, J., Ji, H., Cardie, C., Yu, D.: Improving question answering with external knowledge. arXiv preprint arXiv:1902.00993 (2019)

  26. Ding, X., Liao, K., Liu, T., Li, Z., Duan, J.: Event representation learning enhanced with external commonsense knowledge. arXiv preprint arXiv:1909.05190 (2019)

  27. Zheng, J., Cai, F., Chen, H.: Incorporating scenario knowledge into a unified fine-tuning architecture for event representation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 249–258 (2020)

  28. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  29. Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T. S.: Video visual relation detection. In: Proceedings of the 25th ACM international conference on multimedia, pp. 1300–1308 (2017)

  30. Li, Y., Yang, X., Shang, X., Chua, T.: Interventional video relation detection. In: ACM international conference on multimedia (2021)

  31. Xiong, P., Zhan, H., Wang, X., Sinha, B., Wu, Y.: Visual query answering by entity-attribute graph matching and reasoning. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 8357–8366 (2019)

  32. Brad, F.: Scene graph contextualization in visual commonsense reasoning. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp. 4584–4586 (2019)

  33. Chami, I., Wolf, A., Juan, D. C., Sala, F., Ravi, S., Ré, C.: Low-dimensional hyperbolic knowledge graph embeddings. arXiv preprint arXiv:2005.00545 (2020)

  34. Hudson, D. A., Manning, C. D.: Gqa: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709 (2019)

  35. Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491 (2018)

  36. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086 (2018)

  37. Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: European conference on computer vision, Springer, Cham, pp. 852–869 (2016)

  38. Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2017)

  39. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 684–699 (2018)

  40. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.:Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902 (2017)

  41. Bao, J., Duan, N., Zhou, M., Zhao, T.: Knowledge-based question answering as machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, (volume 1: long papers), pp. 967–976 (2014)

  42. Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., et al.: Large-scale object classification using label relation graphs. In: European conference on computer vision, Springer, Cham, pp. 48–64 (2014)

  43. Yang, X., Liu, X., Jian, M., Gao, X., Wang, M.: Weakly-supervised video object grounding by exploring spatio-temporal contexts. In: Proceedings of the 28th ACM international conference on multimedia, pp. 1939–1947 (2020)

  44. Li, G., Su, H., Zhu, W.: Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. arXiv preprint arXiv:1712.00733 (2017)

  45. Dai, Y., Wang, S., Xiong, N.N., Guo, W.: A survey on knowledge graph embedding: approaches, applications and benchmarks. Electronics 9(5), 750 (2020)

    Article  Google Scholar 

  46. Zheng, W., Yan, L., Gou, C., Wang, F.Y.: KM4: visual reasoning via knowledge embedding memory model with mutual modulation. Inf. Fusion 67, 14–28 (2021)

    Article  Google Scholar 

  47. Tang, Y., Huang, J., Wang, G., He, X., Zhou, B.: Orthogonal relation transforms with graph context modeling for knowledge graph embedding. arXiv preprint arXiv:1911.04910 (2019)

  48. Zhang, X., Zhang, F., Xu, C.: Explicit cross-modal representation learning for visual commonsense reasoning. IEEE Trans. Multimed. (2021). https://doi.org/10.1109/TMM.2021.3091882

    Article  Google Scholar 

  49. Ganea, O. E., Bécigneul, G., Hofmann, T.: Hyperbolic neural networks. arXiv preprint arXiv:1805.09112 (2018)

Download references

Acknowledgements

This work was partially supported by the NSFC No.62172138 and 61932009. This work was also supported by “the Fundamental Research Funds for the Central Universities” JZ2021 HGTB0082.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenzhen Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Z., Hu, Z. & Hong, R. Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning. Multimedia Systems 29, 3017–3026 (2023). https://doi.org/10.1007/s00530-021-00867-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00867-6

Keywords