Skip to main content

Artificial Visual Intelligence

Perceptual Commonsense for Human-Centred Cognitive Technologies

  • Chapter
  • First Online:
Human-Centered Artificial Intelligence (ACAI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13500))

Included in the following conference series:

Abstract

We address computational cognitive vision and perception at the interface of language, logic, cognition, and artificial intelligence. The chapter presents general methods for the processing and semantic interpretation of dynamic visuospatial imagery with a particular emphasis on the ability to abstract, learn, and reason with cognitively rooted structured characterisations of commonsense knowledge pertaining to space and motion. The presented work constitutes a systematic model and methodology integrating diverse, multi-faceted AI methods pertaining Knowledge Representation and Reasoning, Computer Vision, and Machine Learning towards realising practical, human-centred artificial visual intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Multi-domain refers to more than one aspect of space, e.g., topology, orientation, direction, distance, shape; this requires a mixed domain ontology involving points, line-segments, polygons, and regions of space, time, and space-time [21, 35, 48].

  2. 2.

    Select publications relevant to these chosen examples include: visuospatial questions-answering [37, 39,40,41], visuospatial abduction [43, 45, 47, 49], and integration of learning and reasoning [42, 46].

  3. 3.

    A summary is available in [10].

  4. 4.

    Select readings are indicated in Appendix A.

References

  1. Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.: OpenFace 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pp. 59–66, May 2018. https://doi.org/10.1109/FG.2018.00019

  2. Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: The IEEE International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468 (2016). https://doi.org/10.1109/ICIP.2016.7533003

  4. Bhatt, M.: Reasoning about space, actions and change: a paradigm for applications of spatial reasoning. In: Qualitative Spatial Representation and Reasoning: Trends and Future Directions. IGI Global, USA (2012)

    Google Scholar 

  5. Bhatt, M., Guesgen, H.W., Wölfl, S., Hazarika, S.M.: Qualitative spatial and temporal reasoning: emerging applications, trends, and directions. Spatial Cogn. Comput. 11(1), 1–14 (2011). https://doi.org/10.1080/13875868.2010.548568

    Article  Google Scholar 

  6. Bhatt, M., Kersting, K.: Semantic interpretation of multi-modal human-behaviour data - making sense of events, activities, processes. KI/Artif. Intell. 31(4), 317–320 (2017)

    Google Scholar 

  7. Bhatt, M., Lee, J.H., Schultz, C.: CLP(QS): a declarative spatial reasoning framework. In: Egenhofer, M., Giudice, N., Moratz, R., Worboys, M. (eds.) COSIT 2011. LNCS, vol. 6899, pp. 210–230. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23196-4_12

    Chapter  Google Scholar 

  8. Bhatt, M., Loke, S.W.: Modelling dynamic spatial systems in the situation calculus. Spatial Cogn. Comput. 8(1–2), 86–130 (2008). https://doi.org/10.1080/13875860801926884

    Article  Google Scholar 

  9. Bhatt, M., Schultz, C., Freksa, C.: The ‘space’ in spatial assistance systems: conception, formalisation and computation. In: Tenbrink, T., Wiener, J., Claramunt, C. (eds.) Representing Space in Cognition: Interrelations of Behavior, Language, and Formal Models. Series: Explorations in Language and Space. Oxford University Press (2013). ISBN 978-0-19-967991-1

    Google Scholar 

  10. Bhatt, M., Suchan, J.: Cognitive vision and perception. In: Giacomo, G.D., Catalá, A., Dilkina, B., Milano, M., Barro, S., Bugarín, A., Lang, J. (eds.) 24th European Conference on Artificial Intelligence, ECAI 2020, Santiago de Compostela, Spain, 29 August–8 September 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 2881–2882. IOS Press (2020). https://doi.org/10.3233/FAIA200434

  11. Bochkovskiy, A., Wang, C., Liao, H.M.: YOLOv4: optimal speed and accuracy of object detection. CoRR abs/2004.10934 (2020). https://arxiv.org/abs/2004.10934

  12. Brewka, G., Eiter, T., Truszczyński, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011). https://doi.org/10.1145/2043174.2043195

    Article  Google Scholar 

  13. Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186 (2019)

    Article  Google Scholar 

  14. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)

  15. Davis, E.: Pouring liquids: a study in commonsense physical reasoning. Artif. Intell. 172(12–13), 1540–1578 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  16. Davis, E.: How does a box work? A study in the qualitative dynamics of solid objects. Artif. Intell. 175(1), 299–345 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  17. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)

    Google Scholar 

  18. Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: RetinaFace: single-shot multi-level face localisation in the wild. In: CVPR (2020)

    Google Scholar 

  19. Dubba, K.S.R., Cohn, A.G., Hogg, D.C., Bhatt, M., Dylla, F.: Learning relational event models from video. J. Artif. Intell. Res. (JAIR) 53, 41–90 (2015). https://doi.org/10.1613/jair.4395. http://dx.doi.org/10.1613/jair.4395

  20. Hampe, B., Grady, J.E.: From Perception to Meaning. De Gruyter Mouton, Berlin (2008). https://www.degruyter.com/view/title/17429

  21. Hazarika, S.M.: Qualitative spatial change : space-time histories and continuity. Ph.D. thesis, The University of Leeds, School of Computing (2005). Supervisor - Anthony Cohn

    Google Scholar 

  22. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42(02), 386–397 (2020). https://doi.org/10.1109/TPAMI.2018.2844175

    Article  Google Scholar 

  23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90

  24. Hu, P., Ramanan, D.: Finding tiny faces. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  25. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). http://lmb.informatik.uni-freiburg.de/Publications/2017/IMSKDB17

  26. Jaffar, J., Maher, M.J.: Constraint logic programming: a survey. J. Logic Program. 19, 503–581 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kowalski, R., Sergot, M.: A logic-based calculus of events. In: Schmidt, J.W., Thanos, C. (eds.) Foundations of Knowledge Base Management, pp. 23–51. Springer, Heidelberg (1989). https://doi.org/10.1007/978-3-642-83397-7_2

    Chapter  Google Scholar 

  28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held at Lake Tahoe, Nevada, United States, 3–6 December 2012, pp. 1106–1114 (2012). https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

  29. Mani, I., Pustejovsky, J.: Interpreting Motion - Grounded Representations for Spatial Language, Explorations in Language and Space, vol. 5. Oxford University Press, Oxford (2012)

    Book  Google Scholar 

  30. Muggleton, S., Raedt, L.D.: Inductive logic programming: theory and methods. J. Log. Program. 19(20), 629–679 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  31. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 779–788. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.91

  32. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. CoRR abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767

  33. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  34. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  35. Schultz, C., Bhatt, M., Suchan, J., Wałęga, P.A.: Answer set programming modulo ‘space-time’. In: Benzmüller, C., Ricca, F., Parent, X., Roman, D. (eds.) RuleML+RR 2018. LNCS, vol. 11092, pp. 318–326. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99906-7_24

    Chapter  Google Scholar 

  36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

    Google Scholar 

  37. Spranger, M., Suchan, J., Bhatt, M.: Robust natural language processing - combining reasoning, cognitive semantics and construction grammar for spatial language. In: 25th International Joint Conference on Artificial Intelligence, IJCAI 2016. AAAI Press, July 2016

    Google Scholar 

  38. Srinivasan, A.: The Aleph Manual (2001). http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/

  39. Suchan, J., Bhatt, M.: The geometry of a scene: on deep semantics for visual perception driven cognitive film, studies. In: 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, 7–10, March 2016, pp. 1–9. IEEE Computer Society (2016). https://doi.org/10.1109/WACV.2016.7477712

  40. Suchan, J., Bhatt, M.: Semantic question-answering with video and eye-tracking data: AI foundations for human visual perception driven cognitive film studies. In: Kambhampati, S. (ed.) Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp. 2633–2639. IJCAI/AAAI Press (2016). http://www.ijcai.org/Abstract/16/374

  41. Suchan, J., Bhatt, M.: Deep semantic abstractions of everyday human activities: on commonsense representations of human interactions. In: ROBOT 2017: Third Iberian Robotics Conference, Advances in Intelligent Systems and Computing 693 (2017)

    Google Scholar 

  42. Suchan, J., Bhatt, M., Schultz, C.P.L.: Deeply semantic inductive spatio-temporal learning. In: Cussens, J., Russo, A. (eds.) Proceedings of the 26th International Conference on Inductive Logic Programming (Short Papers), London, UK, vol. 1865, pp. 73–80. CEUR-WS.org (2016)

    Google Scholar 

  43. Suchan, J., Bhatt, M., Varadarajan, S.: Out of sight but not out of mind: an answer set programming based online abduction framework for visual sensemaking in autonomous driving. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019, pp. 1879–1885. ijcai.org (2019). https://doi.org/10.24963/ijcai.2019/260

  44. Suchan, J., Bhatt, M., Varadarajan, S.: Driven by commonsense. In: Giacomo, G.D., et al. (eds.) ECAI 2020–24th European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 29 August–8 September 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 2939–2940. IOS Press (2020). https://doi.org/10.3233/FAIA200463

  45. Suchan, J., Bhatt, M., Varadarajan, S.: Commonsense visual sensemaking for autonomous driving - on generalised neurosymbolic online abduction integrating vision and semantics. Artif. Intell. 299, 103522 (2021). https://doi.org/10.1016/j.artint.2021.103522

    Article  MATH  Google Scholar 

  46. Suchan, J., Bhatt, M., Vardarajan, S., Amirshahi, S.A., Yu, S.: Semantic analysis of (reflectional) visual symmetry: a human-centred computational model for declarative explainability. Adv. Cogn. Syst. 6, 65–84 (2018). http://www.cogsys.org/journal

  47. Suchan, J., Bhatt, M., Walega, P.A., Schultz, C.P.L.: Visual explanation by high-level abduction: on answer-set programming driven reasoning about moving objects. In: 32nd AAAI Conference on Artificial Intelligence (AAAI-2018), USA, pp. 1965–1972. AAAI Press (2018)

    Google Scholar 

  48. Wałęga, P.A., Bhatt, M., Schultz, C.: ASPMT(QS): non-monotonic spatial reasoning with answer set programming modulo theories. In: Calimeri, F., Ianni, G., Truszczynski, M. (eds.) LPNMR 2015. LNCS (LNAI), vol. 9345, pp. 488–501. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23264-5_41

    Chapter  MATH  Google Scholar 

  49. Walega, P.A., Schultz, C.P.L., Bhatt, M.: Non-monotonic spatial reasoning with answer set programming modulo theories. Theory Pract. Log. Program. 17(2), 205–225 (2017). https://doi.org/10.1017/S1471068416000193

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehul Bhatt .

Editor information

Editors and Affiliations

Appendices

Appendices

A Select Further Readings

Select readings pertaining to cognitive vision and perception are as follows:

\(\blacktriangleright \):

   Visuospatial Question-Answering    [40] [39] [41] [37]

\(\blacktriangleright \):

   Visuospatial Abduction     [43, 45] [47] [49]

\(\blacktriangleright \):

   Relational Visuospatial Learning     [42] [46] [19]

Select readings pertaining to foundational aspects of commonsense spatial reasoning (within a KR setting) are as follows:

\(\blacktriangleright \):

   Theory (Space, Action, Change)    [4, 5, 8, 9]

\(\blacktriangleright \):

   Declarative Spatial Reasoning (CLP, ASP, ILP)     [7, 35, 42, 48]

B Visual Computing Foundations

A robust low-level visual computing foundation driven by the state of the art in computer vision techniques (e.g., for visual feature detection, tracking) is necessary towards realising explainable visual intelligence in the manner described in this chapter. The examples of this chapter (in Sect. 4), for instance, require extracting and analysing scene elements (i.e., people, body-structure, and objects in the scene) and motion (i.e., object motion and scene motion), encompassing methods for:

  • Image Classification and Feature Learning – based on Big Data, (e.g., ImageNet [17, 34]), using neural network architectures such as AlexNets [28], VGG [36], or ResNet [23].

  • Detection, i.e., of people and objects [11, 31,32,33], and faces [18, 24].

  • Pose Estimation, i.e., of body pose [13] (including fine grained hand pose), face and gaze analysis [1].

  • Segmentation, i.e., semantic segmentation [14] and instance segmentation [22].

  • Motion Analysis, i.e., optical flow based motion estimation [25] and movement tracking [2, 3].

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bhatt, M., Suchan, J. (2023). Artificial Visual Intelligence. In: Chetouani, M., Dignum, V., Lukowicz, P., Sierra, C. (eds) Human-Centered Artificial Intelligence. ACAI 2021. Lecture Notes in Computer Science(), vol 13500. Springer, Cham. https://doi.org/10.1007/978-3-031-24349-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24349-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24348-6

  • Online ISBN: 978-3-031-24349-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics