Skip to main content

A Neuro-Symbolic AI System for Visual Question Answering in Pedestrian Video Sequences

  • Conference paper
  • First Online:
Hybrid Artificial Intelligent Systems (HAIS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13469))

Included in the following conference series:

Abstract

With the rapid increase in the amount of video data, efficient object recognition is mandatory for a system capable of automatically performing question and answering. In particular, real-world video environments with numerous types of objects and complex relationships require extensive knowledge representation and inference algorithms with the properties and relations of objects. In this paper, we propose a hybrid neuro-symbolic AI system that handles scene-graph of real-world video data. The method combines neural networks that generate scene graphs in consideration of the relationship between objects on real roads and symbol-based inference algorithms for responding to questions. We define object properties, relationships, and question coverage to cover the real-world objects in pedestrian video and traverse a scene-graph to perform complex visual question-answering. We have demonstrated the superiority of the proposed method by confirming that it answered with 99.71% accuracy to 5-types of questions in a pedestrian video environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Park, K.-W., Bu, S.-J., Cho, S.-B.: Evolutionary optimization of neuro-symbolic integration for phishing URL detection. In: International Conference on Hybrid Artificial Intelligent Systems, pp. 88–100 (2021)

    Google Scholar 

  2. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Advances in Neural Information Processing Systems, pp. 1031–1042 (2018)

    Google Scholar 

  3. Amizadeh, S., Palangi, H., Polozov, O., Huang, Y., Kishida, K.: Neuro-symbolic visual reasoning: disentangling ‘visual’ from ‘reasoning’. In: International Conference on Machine Learning, pp. 279–290 (2020)

    Google Scholar 

  4. Shi, J., Zhang, H., Li, J.: Explainable and Explicit Visual Reasoning over Scene Graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8368–8376 (2019)

    Google Scholar 

  5. Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1367–1381 (2018)

    Article  Google Scholar 

  6. Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)

    Google Scholar 

  7. Song, Y.-S., Cho, S.-B.: Objects relationship modeling for improving object detection of service robots using Bayesian network integration. In: International Conference on Intelligent Computing, pp. 678–683 (2006)

    Google Scholar 

  8. Mao, J., Gan, C., Deepmind, P.K., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019)

    Google Scholar 

  9. Han, C., Mao, J., Gan, C., Tenenbaum, J.B., Wu, J.: Visual concept metaconcept learning. In: Advances in Neural Information Processing Systems, pp. 5001–5012 (2019)

    Google Scholar 

  10. Yu, J., et al.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia 22, 3196–3209 (2020)

    Article  Google Scholar 

  11. Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123(1), 4–31 (2016). https://doi.org/10.1007/s11263-016-0966-6

    Article  MathSciNet  Google Scholar 

  12. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: IEEE International Conference on Computer Vision, pp. 804–813 (2017)

    Google Scholar 

  13. Cong, W., Wang, W., Lee, W.-C.: Scene Graph Generation via Conditional Random Fields. arXiv preprint arXiv:1811.08075 (2018)

  14. Kolesnikov, A., Kuznetsova, A., Lampert, C., Ferrari, V.: Detecting visual relationships using box attention. In: IEEE International Conference on Computer Vision Workshops, pp. 1749–1753 (2019)

    Google Scholar 

  15. Yin, G., et al.: Zoom-net: mining deep feature interactions for visual relationship recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 330–347. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_20

    Chapter  Google Scholar 

  16. Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6619–6628 (2019)

    Google Scholar 

  17. Goller, C., Kuchler, A.: Learning task-dependent distributed representations by backpropagation through structure. In: International Conference on Neural Networks, pp. 347–352 (1996)

    Google Scholar 

  18. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: IEEE International Joint Conference on Neural Networks, pp.729–734 (2005)

    Google Scholar 

  19. Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: European Conference on Computer Vision, pp. 346–363 (2018)

    Google Scholar 

  20. Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 690–706. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_41

    Chapter  Google Scholar 

  21. Shin, W.-S., Bu, S.-J., Cho, S.-B.: 3D-convolutional neural network with generative adversarial network and autoencoder for robust anomaly detection in video surveillance. Int. J. Neural Syst. 40(6), 2050034 (2020)

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University); No. 2021-0-02068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sung-Bae Cho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Park, J., Bu, SJ., Cho, SB. (2022). A Neuro-Symbolic AI System for Visual Question Answering in Pedestrian Video Sequences. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2022. Lecture Notes in Computer Science(), vol 13469. Springer, Cham. https://doi.org/10.1007/978-3-031-15471-3_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15471-3_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15470-6

  • Online ISBN: 978-3-031-15471-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics