Skip to main content
Log in

Egocentric visual scene description based on human-object interaction and deep spatial relations among objects

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Aydemir A et al (2011) Search in the real world: Active visual object search based on spatial relations. Robotics and Automation (ICRA)

  2. Ellis C, Masood S, Tappen M, Laviola J, Sukthankar R (2013) Exploring the trade-off between accuracy and observational latency in action recognition. Int J Comput Vis 101(3):420436

    Article  Google Scholar 

  3. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  4. Hamza R et al. (2017) Hash based encryption for keyframes of diagnostic hysteroscopy. IEEE Access

  5. Hamza R et al (2017) Secure video summarization framework for personalized wireless capsule endoscopy. Pervasive and Mobile Computing 41:436–450

    Article  Google Scholar 

  6. Huang D et al. (2014) Sequential max-margin event detectors.” European conference on computer vision. Springer, Cham

  7. Jain P et al. (2015) Knowledge acquisition for language description from scene understanding." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE

  8. Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Patt Recog

  9. H Kuehne, H Jhuang, E Garrote, T Poggio, T Serre, HMDB (2011) A Large Video Database for Human Motion Recognition. ICCV

  10. W Li, Z Zhang, Z Liu (2010) Action recognition based on a bag of 3D points, in: IEEE CVPR Workshop on Human Communicative Behavior, Analysis

  11. Lin T-Y et al (2014) Microsoft coco: Common objects in context. European conference on computer vision. Springer, Cham

  12. Muhammad K et al. (2018) Secure Surveillance Framework for IoT systems using Probabilistic Image Encryption. IEEE Trans Indust Info

  13. Redmon J et al (2016) You only look once: Unified, real-time object detection. Proc IEEE Conf Comput Vis Patt Recog

  14. Sajjad M, et al. (2018) CNN-based anti-spoofing two-tier multi-factor authentication system. Pattern Recognition Letters

  15. Sj K, Aydemir A, Jensfelt P (2012) Topological spatial relations for active visual search. Robot Auton Syst 60(9):1093–1107

    Article  Google Scholar 

  16. J Sung, C Ponce, B Selman, A Saxena (2012) Unstructured human activity detection from RGBD images, in: Proc. International Conference on Robotics and Automation 842849

  17. Wang J et al. (2012) Mining actionlet ensemble for action recognition with depth cameras.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE

  18. Welke K et al. (2011) Grounded spatial symbols for task planning based on experience. Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on. IEEE, 2013. IEEE International Conference on. IEEE

  19. Xia L, C-C Chen, JK Aggarwal (2012) View invariant human action recognition using histograms of 3d joints.” Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE

  20. Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25.1:2–11

    Article  Google Scholar 

  21. Zanfir M, M Leordeanu, and C Sminchisescu (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. Proceedings of the IEEE international conference on computer vision

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712) & by IGNITE, National Technology Fund, Pakistan for the project entitle “Automatic Surveillance System for Video Sequences”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irfan Mehmood.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, G., Ghani, M.U., Siddiqi, A. et al. Egocentric visual scene description based on human-object interaction and deep spatial relations among objects. Multimed Tools Appl 79, 15859–15880 (2020). https://doi.org/10.1007/s11042-018-6286-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6286-9

Keywords

Navigation