Skip to main content
Log in

A general description generator for human activity images based on deep understanding framework

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Image description generation is of great application value in online image searching. Inspired by the recent achievements on neocortex study, we design a deep image understanding framework to implement a description generator for general images involving human activities. Different from existing work on image description , which regards it as a retrieval problem instead of trying to understand an image, our framework can recognize the human–object interaction (HOI) activity in the image based on the co-occurrence analysis of 3-D spatial layout and generate natural language description according to what is really happening in the image. We propose a deep hierarchical model to do the image recognition and a syntactic tree-based model to do the natural language generation. With the consideration of supporting online image searching, these two models are designed to uniformly extract features from humans and different object classes and produce well-formed sentences describing the exact things happening in the image. By conducting experiments on the dataset containing images from the phrasal recognition dataset, the six-class sports dataset and the UIUC Pascal sentence dataset, we demonstrate that our framework outperforms the state-of-the-art methods on recognizing HOI activities and generating image descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Desai C, Ramanan D (2012) Detecting actions, poses, and objects with relational phraselets. Computer vision-ECCV 2012. Springer, Berlin, pp 158–172

    Chapter  Google Scholar 

  2. Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12

    Article  MathSciNet  MATH  Google Scholar 

  3. Elliott D, Keller F (2013) Image description using visual dependency representations. In: Conference on empirical methods in natural language processing, pp 1292–1302

  4. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. Computer vision-ECCV 2010. Springer, Berlin, pp 15–29

    Chapter  Google Scholar 

  5. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  6. Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 320–327

  7. George D (2008) How the brain might work: a hierarchical and temporal model for learning and recognition. Ph.D. thesis, Stanford University

  8. Guerra-Filho G, Fermuller C, Aloimonos Y (2005) Discovering a language for human activity. In: Proceedings of the AAAI 2005 fall symposium on anticipatory cognitive embodied systems

  9. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  10. Gupta A, Mannem P (2012) From image annotation to image description. Neural information processing. Springer, Berlin, pp 196–204

    Chapter  Google Scholar 

  11. Hawkins J, Blakeslee S (2007) On intelligence. Macmillan, London

    Google Scholar 

  12. Hawkins J, George D (2006) Hierarchical temporal memory: concepts, theory and terminology. Whitepaper, Numenta Inc, Redwood City

    Google Scholar 

  13. Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  14. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  15. Hoiem D, Efros AA, Hebert M (2011) Recovering occlusion boundaries from an image. Int J Comput Vis 91(3):328–346

    Article  MathSciNet  MATH  Google Scholar 

  16. Huang FJ, LeCun Y (2006) Large-scale learning with svm and convolutional for generic object categorization. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, vol 1. pp 284–291

  17. Johnson-Frey SH, Maloof FR, Newman-Norlund R, Farrer C, Inati S, Grafton ST (2003) Actions or hand-object interactions: Human inferior frontal cortex and action observation. Neuron 39(6):1053–1058

    Article  Google Scholar 

  18. Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

  19. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  20. Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: Understanding and generating simple image descriptions. In: IEEE conference on computer vision and pattern recognition (CVPR ). IEEE, pp 1601–1608

  21. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg T (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  22. Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 359–368

  23. Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: Annual meeting of the association for computational linguistics. Citeseer, pp 790–796

  24. Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(10):351–362

    Google Scholar 

  25. LeCun Y, Kavukcuoglu K, Farabet C (2010) Convolutional networks and applications in vision. In: Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS), IEEE, pp 253–256

  26. Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning. Association for Computational Linguistics, pp 220–228

  27. Li Y, Xie W, Gao Z, Huang Q, Cao Y (2014) A new bag of words model based on fuzzy membership for image description. In: 12th International conference on signal processing (ICSP), 2014, IEEE, pp 972–976

  28. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. Computer vision-ECCV 2014. Springer, Berlin, pp 740–755

    Google Scholar 

  29. Memisevic R, Zach C, Pollefeys M, Hinton GE (2010) Gated softmax classification. In: Advances in neural information processing systems, pp 1603–1611

  30. Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics, pp 747–756

  31. Nelissen K, Luppino G, Vanduffel W, Rizzolatti G, Orban GA (2005) Observing others: multiple action representation in the frontal lobe. Science 310(5746):332–336

    Article  Google Scholar 

  32. Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151

  33. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

  34. Prest A, Ferrari V, Schmid C (2013) Explicit modeling of human-object interactions in realistic videos. IEEE Trans Pattern Anal Mach Intell 35(4):835–848

    Article  Google Scholar 

  35. Ratliff N, Bagnell JA, Zinkevich M (2006) Subgradient methods for maximum margin structured learning. In: ICML workshop on learning in structured output spaces, vol. 46. Citeseer

  36. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE international conference on computer vision-ICCV, 2013, IEEE, pp 433–440

  37. Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 1745–1752

  38. Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. University of Colorado, Boulder

  39. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218

    Google Scholar 

  40. Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 104

  41. Yan F, Mikolajczyk K (2015) Leveraging high level visual information for matching images and captions. Computer vision-ACCV 2014. Springer, Berlin, pp 613–627

    Google Scholar 

  42. Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 444–454

  43. Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 9–16

  44. Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 17–24

  45. Yao B, Fei-Fei L (2012) Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans Pattern Anal Mach Intell 34(9):1691–1703

    Article  Google Scholar 

  46. Zhang X, Song X, Lv X, Jiang S, Ye Q, Jiao J (2015) Rich image description based on regions. In: Proceedings of the 23rd annual ACM conference on multimedia conference. ACM, pp 1315–1318

Download references

Acknowledgments

This work was supported by National High Technology Research and Development Program of China (No. 2013CB329605) (973 Program), International Graduate Exchange Program of BIT and Training Program of the Major Project of BIT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kan Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Z., Li, K. & Bai, L. A general description generator for human activity images based on deep understanding framework. Neural Comput & Applic 28, 2147–2163 (2017). https://doi.org/10.1007/s00521-015-2171-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-015-2171-x

Keywords

Navigation