Skip to main content
Log in

Main objects interaction activity recognition in real images

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Automatically describing the image caption is a challenging task in computer vision. The difficulty mainly lies in capturing the interesting objects and recognizing the interaction activity of the interesting objects. In this paper, we introduce “centerpiece interaction,” a complex visual composite, to represent the main objects interaction activity. We propose a centerpiece interaction recognition framework to achieve the detection of interesting objects and the recognition of their interaction activity by regarding them as an integrated task. In our framework, firstly, a graph-based model is proposed to learn the 2.5D spatial co-occurrence context among objects, which strongly facilitates the interesting objects detection. Secondly, we propose a hierarchical model, with the help of 2.5D spatial co-occurrence context obtained, to learn the relational features of the interesting objects in a hierarchy of stages by integrating the features of the interesting objects, which significantly improve the recognition of centerpiece interaction. Experiments on a joint dataset show that our framework outperforms state-of-the-art in spatial co-occurrence context analysis, the interesting objects detection and the centerpiece interaction recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Andriluka M, Sigal L (2012) Human context: modeling human-human interactions for monocular 3d pose estimation. In: Perales FJ, Fisher RB, Moeslund TB (eds) Articulated motion and deformable objects, Springer, Berlin, Heidelberg, pp 260–272

  2. Desai C, Ramanan D (2012) Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Computer vision-ECCV 2012, Springer, Berlin, Heidelberg, pp 158–172

  3. Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for static human-object interactions. In: Computer vision and pattern recognition workshops (CVPRW), 2010 IEEE computer society conference on, pp 9–16

  4. Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12

    Article  MATH  MathSciNet  Google Scholar 

  5. Faccin M, Migdał P, Johnson T, Biamonte J. Bergholm V (2013) Community detection in quantum complex networks. arXiv preprint arXiv:1310.6638

  6. Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on machine learning, ACM, pp 320–327

  7. George D (2008) How the brain might work: a hierarchical and temporal model for learning and recognition. Ph.D. thesis, Stanford University

  8. Guerra-Filho G, Fermuller C, Aloimonos Y (2005) Discovering a language for human activity. In: Proceedings of the AAAI 2005 fall symposium on anticipatory cognitive embodied systems, Washington, DC

  9. Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recognit 47(10):3343–3361

    Article  Google Scholar 

  10. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  11. Gupta A, Mannem P (2012) From image annotation to image description. In: Huang T, Zeng Z, Li C, Leung CS (eds) Neural information processing, Springer, Berlin, Heidelberg, pp 196–204

  12. Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. pp 2012–2019

  13. Hawkins J, George D (2006) Hierarchical temporal memory: concepts, theory and terminology. Numenta Inc, Whitepaper

    Google Scholar 

  14. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MATH  MathSciNet  Google Scholar 

  15. Hoiem D, Efros AA, Hebert M (2011) Recovering occlusion boundaries from an image. Int J Comput Vis 91(3):328–346

    Article  MATH  MathSciNet  Google Scholar 

  16. Hu JF, Zheng WS, Lai J, Gong S, Xiang T (2013) Recognising human-object interaction via exemplar based modelling. In: 2013 IEEE international conference on computer vision (ICCV), pp 3144–3151

  17. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: Rossi F (ed) Proceedings of the twenty-third international joint conference on artificial intelligence, AAAI Press, California, pp 2466–2472

  18. Johnson-Frey SH, Maloof FR, Newman-Norlund R, Farrer C, Inati S, Grafton ST (2003) Actions or hand-object interactions: human inferior frontal cortex and action observation. Neuron 39(6):1053–1058

    Article  Google Scholar 

  19. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

  20. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  21. Lai D, Lu H, Nardini C (2010) Finding communities in directed networks by pagerank random walk induced network embedding. Physica A: Stat Mech Appl 389(12):2443–2454

    Article  Google Scholar 

  22. Li LJ, Socher R, Fei-Fei L (2009) Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 2036–2043

  23. Memisevic R, Zach C, Pollefeys M, Hinton GE (2010) Gated softmax classification. In: Advances in neural information processing systems, pp 1603–1611

  24. Mones E (2013) Hierarchy in directed random networks. Phys Rev E 87(2):022,817

    Article  Google Scholar 

  25. Prest A, Ferrari V, Schmid C (2013) Explicit modeling of human-object interactions in realistic videos. IEEE Trans Pattern Anal Mach Intell 35(4):835–848

    Article  Google Scholar 

  26. Ratliff N, Bagnell JA, Zinkevich M (2006) Subgradient methods for maximum margin structured learning. In: ICML workshop on learning in structured output spaces, vol. 46. Citeseer

  27. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: 2013 IEEE international conference on computer vision (ICCV), pp 433–440

  28. Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 1745–1752

  29. Serre T, Oliva A, Poggio T (2007) A feedforward architecture accounts for rapid categorization. Proc Natl Acad Sci 104(15):6424–6429

    Article  Google Scholar 

  30. Sporns O (2012) Discovering the human connectome. MIT Press, Cambridge

    Google Scholar 

  31. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv preprint arXiv:1409.4842

  32. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555

  33. Wei P, Zhao Y, Zheng N, Zhu SC (2013) Modeling 4d human-object interactions for event and object recognition. In: 2013 IEEE international conference on computer vision (ICCV), pp 3272–3279

  34. Wisuttirungseurai P, Kawewong A, Patanukhom K (2014) Object categorization using co-occurrence and spatial relationship with human interaction. In: International conference on 2014 information science and applications (ICISA), pp 1–4

  35. Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: IEEE conference on computer vision and pattern recognition (CVPR), 2010, pp 9–16

  36. Yao B, Fei-Fei L (2012) Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans Pattern Anal Mach Intell 34(9):1691–1703

    Article  Google Scholar 

Download references

Acknowledgments

The research was supported in part by Natural Science Foundation of China (No. 60903071), National Basic Research Program of China (973 Program, No. 2013CB329605), Specialized Research Fund for the Doctoral Program of Higher Education of China, and Training Program of the Major Project of BIT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Bai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bai, L., Li, K., Pei, J. et al. Main objects interaction activity recognition in real images. Neural Comput & Applic 27, 335–348 (2016). https://doi.org/10.1007/s00521-015-1846-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-015-1846-7

Keywords

Navigation