A general description generator for human activity images based on deep understanding framework

Zhou, Zheng; Li, Kan; Bai, Lin

doi:10.1007/s00521-015-2171-x

A general description generator for human activity images based on deep understanding framework

Original Article
Published: 19 January 2016

Volume 28, pages 2147–2163, (2017)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zheng Zhou¹,
Kan Li¹ &
Lin Bai²

468 Accesses
2 Citations
Explore all metrics

Abstract

Image description generation is of great application value in online image searching. Inspired by the recent achievements on neocortex study, we design a deep image understanding framework to implement a description generator for general images involving human activities. Different from existing work on image description , which regards it as a retrieval problem instead of trying to understand an image, our framework can recognize the human–object interaction (HOI) activity in the image based on the co-occurrence analysis of 3-D spatial layout and generate natural language description according to what is really happening in the image. We propose a deep hierarchical model to do the image recognition and a syntactic tree-based model to do the natural language generation. With the consideration of supporting online image searching, these two models are designed to uniformly extract features from humans and different object classes and produce well-formed sentences describing the exact things happening in the image. By conducting experiments on the dataset containing images from the phrasal recognition dataset, the six-class sports dataset and the UIUC Pascal sentence dataset, we demonstrate that our framework outperforms the state-of-the-art methods on recognizing HOI activities and generating image descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

Image captioning: from structural tetrad to translated sentences

Article 03 January 2019

Rui Guo, Shubo Ma & Yahong Han

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

References

Desai C, Ramanan D (2012) Detecting actions, poses, and objects with relational phraselets. Computer vision-ECCV 2012. Springer, Berlin, pp 158–172
Chapter Google Scholar
Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12
Article MathSciNet MATH Google Scholar
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Conference on empirical methods in natural language processing, pp 1292–1302
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. Computer vision-ECCV 2010. Springer, Berlin, pp 15–29
Chapter Google Scholar
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Article Google Scholar
Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 320–327
George D (2008) How the brain might work: a hierarchical and temporal model for learning and recognition. Ph.D. thesis, Stanford University
Guerra-Filho G, Fermuller C, Aloimonos Y (2005) Discovering a language for human activity. In: Proceedings of the AAAI 2005 fall symposium on anticipatory cognitive embodied systems
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789
Article Google Scholar
Gupta A, Mannem P (2012) From image annotation to image description. Neural information processing. Springer, Berlin, pp 196–204
Chapter Google Scholar
Hawkins J, Blakeslee S (2007) On intelligence. Macmillan, London
Google Scholar
Hawkins J, George D (2006) Hierarchical temporal memory: concepts, theory and terminology. Whitepaper, Numenta Inc, Redwood City
Google Scholar
Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hoiem D, Efros AA, Hebert M (2011) Recovering occlusion boundaries from an image. Int J Comput Vis 91(3):328–346
Article MathSciNet MATH Google Scholar
Huang FJ, LeCun Y (2006) Large-scale learning with svm and convolutional for generic object categorization. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, vol 1. pp 284–291
Johnson-Frey SH, Maloof FR, Newman-Norlund R, Farrer C, Inati S, Grafton ST (2003) Actions or hand-object interactions: Human inferior frontal cortex and action observation. Neuron 39(6):1053–1058
Article Google Scholar
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: Understanding and generating simple image descriptions. In: IEEE conference on computer vision and pattern recognition (CVPR ). IEEE, pp 1601–1608
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg T (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 359–368
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: Annual meeting of the association for computational linguistics. Citeseer, pp 790–796
Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(10):351–362
Google Scholar
LeCun Y, Kavukcuoglu K, Farabet C (2010) Convolutional networks and applications in vision. In: Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS), IEEE, pp 253–256
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning. Association for Computational Linguistics, pp 220–228
Li Y, Xie W, Gao Z, Huang Q, Cao Y (2014) A new bag of words model based on fuzzy membership for image description. In: 12th International conference on signal processing (ICSP), 2014, IEEE, pp 972–976
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. Computer vision-ECCV 2014. Springer, Berlin, pp 740–755
Google Scholar
Memisevic R, Zach C, Pollefeys M, Hinton GE (2010) Gated softmax classification. In: Advances in neural information processing systems, pp 1603–1611
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics, pp 747–756
Nelissen K, Luppino G, Vanduffel W, Rizzolatti G, Orban GA (2005) Observing others: multiple action representation in the frontal lobe. Science 310(5746):332–336
Article Google Scholar
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Prest A, Ferrari V, Schmid C (2013) Explicit modeling of human-object interactions in realistic videos. IEEE Trans Pattern Anal Mach Intell 35(4):835–848
Article Google Scholar
Ratliff N, Bagnell JA, Zinkevich M (2006) Subgradient methods for maximum margin structured learning. In: ICML workshop on learning in structured output spaces, vol. 46. Citeseer
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE international conference on computer vision-ICCV, 2013, IEEE, pp 433–440
Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 1745–1752
Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. University of Colorado, Boulder
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Google Scholar
Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 104
Yan F, Mikolajczyk K (2015) Leveraging high level visual information for matching images and captions. Computer vision-ACCV 2014. Springer, Berlin, pp 613–627
Google Scholar
Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 444–454
Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 9–16
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 17–24
Yao B, Fei-Fei L (2012) Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans Pattern Anal Mach Intell 34(9):1691–1703
Article Google Scholar
Zhang X, Song X, Lv X, Jiang S, Ye Q, Jiao J (2015) Rich image description based on regions. In: Proceedings of the 23rd annual ACM conference on multimedia conference. ACM, pp 1315–1318

Download references

Acknowledgments

This work was supported by National High Technology Research and Development Program of China (No. 2013CB329605) (973 Program), International Graduate Exchange Program of BIT and Training Program of the Major Project of BIT.

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, China
Zheng Zhou & Kan Li
Guangxi University, Nanning, China
Lin Bai

Authors

Zheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kan Li
View author publications
You can also search for this author in PubMed Google Scholar
Lin Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kan Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, Z., Li, K. & Bai, L. A general description generator for human activity images based on deep understanding framework. Neural Comput & Applic 28, 2147–2163 (2017). https://doi.org/10.1007/s00521-015-2171-x

Download citation

Received: 19 August 2015
Accepted: 21 December 2015
Published: 19 January 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00521-015-2171-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A general description generator for human activity images based on deep understanding framework

Abstract

Access this article

Similar content being viewed by others

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

Image captioning: from structural tetrad to translated sentences

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A general description generator for human activity images based on deep understanding framework

Abstract

Access this article

Similar content being viewed by others

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

Image captioning: from structural tetrad to translated sentences

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation