LSTM and multiple CNNs based event image classification

Li, Peian; Tang, Huadong; Yu, Jing; Song, Wei

doi:10.1007/s11042-020-10165-4

LSTM and multiple CNNs based event image classification

1162: Machine learning for big multimedia analytics
Published: 23 November 2020

Volume 80, pages 30743–30760, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Peian Li^1,2,
Huadong Tang²,
Jing Yu² &
…
Wei Song ORCID: orcid.org/0000-0002-2324-4302^1,3

970 Accesses
10 Citations
Explore all metrics

Abstract

Previous studies have demonstrated that complexity and variation of event images are the major challenges in event classification. We approach the problem through an integrated methodology by utilizing Long Short-Term Memory network (LSTM) to fuse multiple Convolutional Neural Networks (CNNs). To address the issue of complexity, we use three specific CNNs to extract the scene, object and human visual cues respectively. To reduce the semantic gap and utilize the complementarity of the features in different levels, we choose AlexNet and VGG-16 network as the basic structures, and concatenate their outputs of the first fully-connected layer and the second fully-connected layer. Considering the contextual correlations between visual cues, we arrange the concatenations of three CNNs in the sequence of scene, object and human as a whole and put into the LSTM network. Particularly for context, we crop the images into five blocks as input and an individual image is supplemented with contextual features due to the temporal characteristics of the LSTM. We evaluate our method on the Web Image Dataset for Event Recognition (WIDER), and the obtained results demonstrate the effectiveness of all the above points. Compared with the state-of-the-art methods, the proposed method gives a considerable way for improving the performance on event classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transferring Deep Object and Scene Representations for Event Recognition in Still Images

Article 13 September 2017

A new 3D convolutional neural network (3D-CNN) framework for multimedia event detection

Article 19 October 2020

Complex event detection via attention-based video representation and classification

Article 10 August 2017

References

Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. ECCV 2014:329–344
Google Scholar
Bai S (2016) Growing random Forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287
Article Google Scholar
Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(9):1755013
Article Google Scholar
Bai S, Shan A (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Article Google Scholar
Bossard L, Guillaumin M, Van L (2013) Event recognition in photo collections with a stopwatch HMM. IEEE International Conference on Computer Vision 2013:1193–1200
Google Scholar
Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. IEEE Computer Society Conference on Computer Vision and Pattern 2009:248–255
Google Scholar
Dollar P, Rabaud V, Cottrell G et al (2005) Behavior recognition via sparse spatio-temporal features. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005:193–199
Google Scholar
Duan L, Xu D, Tsang WH, Luo J (2012) Visual event recognition in videos by learning from web data. IEEE Trans Pattern Anal Mach Intell 34(9):1667–1680
Article Google Scholar
Gong Y, Wang L, Guo R et al (2014) Multi-scale orderless pooling of deep convolutional activation features. ECCV 8695:392–407
Google Scholar
Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841
Article MathSciNet MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, pp. 464
Izadinia H, Sadeghi F, Farhadi A (2014) Incorporating scene context and object layout into appearance modeling. IEEE Conference on Computer Vision and Pattern Recognition 2014:232–239
Google Scholar
Jia Y, Shelhamer E, Donahue J, et al (2014) Caffe: convolutional architecture for fast feature embedding. In: proceedings of the 22nd ACM international conference on multimedia, pp 675-678
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2:2169–2178
Google Scholar
Li LJ, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. IEEE 11^th international conference on computer vision, vol 2007, pp 1–8
Li LJ, Su H, Xing E P, Fei-fei L (2010) Object bank: a high-level image representation for scene classification and semantic feature sparsification. Adv Neural Inf Proces Syst 23.
Lin M, Chen Q, Yan S (2013) Network in Network. arXiv:1312.4400.
Lin D, Lu C, Liao R, Jia J (2014) Learning important spatial pooling regions for scene classification. IEEE Conference on Computer Vision and Pattern Recognition 2014:3726–3733
Google Scholar
Liu J, Yu Q, Javed O, et al (2013) Video event recognition using concept attributes. In: proceedings of the 2013 IEEE workshop on applications of computer vision, pp 339–346
Liu M, Liu X, Li Y et al (2015) Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. IEEE International Conference on Computer Vision Workshops 2015:274–279
Google Scholar
Lowe DG (1999) Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE International Conference on Computer Vision 1999:1150
Article MathSciNet Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Mattivi R, Uijlings J, Natale FGBD, Sebe N (2011) Exploitation of time constraints for (sub-)event recognition. In: proceedings of the 2011 joint ACM workshop on modeling and representing, pp 7–12
Mousavian A, Kosecka J (2015) Deep convolutional features for image based retrieval and scene categorization. arXiv:1509.06033
Oh SJ, Benenson R, Fritz M, Schiele B (2015) Person recognition in personal photo collections. IEEE International Conference on Computer Vision 42(1):203–220
Google Scholar
Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. International Conference on Computer Vision 2011:1307–1314
Google Scholar
Parizi SN, Oberlin JG, Felzenszwalb PF (2012) Reconfigurable models for scene recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2012:2775–2782
Google Scholar
Park S, Kwak N (2015) Cultural event recognition by subregion classification with convolutional neural network. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:45–50
Google Scholar
Quattoni A, Torralba A (2009) Recognizing indoor scenes. IEEE Conference on Computer Vision and Pattern Recognition 2009:413–420
Google Scholar
Quelhas P, Odobez JM, Gaticaperez D et al (2007) A thousand words in a scene. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(9):1575–1589
Article Google Scholar
Rachmadi RF, Uchimura K, Koutaki G (2016) Combined convolutional neural network for event recognition. The Korea-Japan joint workshop on Frontiers of Computer Vision, pp 85–90
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Salvador A, Manchon-Vizuete D, Calafell A et al (2015) Cultural event recognition with visual ConvNets and temporal models. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:36–44
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large- scale image recognition. arXiv:1409.1556
Sun X, Zhang L, Wang Z, Chang J, Yao Y, Li P, Zimmermann R (2018) Scene categorization using deeply learned gaze shifting kernel. IEEE Transactions on Cybernetics 49(6):2156–2167
Article Google Scholar
Szegedy C, Liu W, Jia Y et al (2014) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition 2014:1–9
Google Scholar
Tian Z, Huang W, He T et al (2016) Detecting text in natural image with connectionist text proposal network. European Conference on Computer Vision 9912:56–72
Google Scholar
Wang J, Yang J, Yu K et al (2010) Locality-constrained linear coding for image classification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2010:3360–3367
Google Scholar
Wang L, Guo S, Huang W, Qiao Y (2015) Places205-VGGNet models for scene recognition. arXiv:1508.01667
Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. IEEE Conference on Computer Vision and Pattern Recognition Workshop 2015:30–35
Google Scholar
Wang L, Wang Z, Guo S, Qiao Y (2015) Better exploiting os-cnns for better event recognition in images. IEEE International Conference on Computer Vision Workshop 2015:45–52
Google Scholar
Wang L, Xiong Y, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159
Wang Y, Lin Z, Shen X et al (2016) Event-specific image importance. IEEE Conference on Computer Vision and Pattern Recognition 2016:4810–4819
Google Scholar
Wang J H, Liu T W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: Proceedings of the 30th conference on computational linguistics and speech processing, Hanoi, Vietnam, pp. 214–223
Wang L, Wang Z, Qiao Y, van Gool L (2018) Transferring deep object and scene representations for event recognition in still images. Int J Comput Vis 126(2–4):390–409
Article MathSciNet Google Scholar
Wang M, Niu S, Gao Z (2019) A novel scene text recognition method based on deep learning. Computers, Materials & Continua 60(2):781–794
Article Google Scholar
Wu X, Luo C, Zhang Q, Zhou J, Yang H, Li Y (2019) Text detection and recognition for natural scene images using deep convolutional neural networks. Computers, Materials & Continua 61(1):289–300
Article Google Scholar
Xiong Y, Zhu K, Lin D et al (2015) Recognize complex events from static images by fusing deep channels. IEEE Conference on Computer Vision and Pattern Recognition 2015:1600–1609
Google Scholar
Xu F, Zhang X, Xin Z, Yang A (2019) Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning. Computers, Materials and Continua 58(3):697–709
Article Google Scholar
Yang Y, Shah M (2012) Complex events detection using data-driven concepts. In Proceedings of the European conference on Computer vision, pp 722–735
Yogatama D, Dyer C, Ling W, et al (2017) Generative and discriminative text classification with recurrent neural networks. arXiv:1703.01898
Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Proces Syst 27:3320–3328
Google Scholar
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional neural networks. European Conference on Computer Vision 2014:818–833
Google Scholar
Zhang N, Paluri M, Taigman Y et al (2015) Beyond frontal faces: improving person recognition using multiple cues. IEEE Conference on Computer Vision and Pattern Recognition 2015:4804–4813
Google Scholar
Zhang C, Li R, Huang Q, Tian Q (2017) Hierarchical deep semantic representation for visual categorization. Neurocomputing 257:88–96
Article Google Scholar
Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.
Zhou B, Khosla A, Lapedriza A, et al (2014) Object detectors emerge in deep scene CNNs. arXiv:1412.6856
Zhou B, Lapedriza A, Xiao J et al (2015) Learning deep features for scene recognition using places database. Adv Neural Inf Proces Syst 2015:487–495
Google Scholar
Zhou P, Qi Z, Zheng S, Xu J, Bao H, Xu B (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: proceedings of the 26th international conference on computational linguistics, pp 3485-3495

Download references

Acknowledgments

This work was supported in part by National Science Foundation Project of P. R. China under Grant No.52071349, No.61701554 and the cross-discipline research project of Minzu University of China (2020MDJC08), State Language Commission Key Project (ZDl135-39), Promotion plan for young teachers’ scientific research ability of Minzu University of China, MUC 111 Project, First class courses (Digital Image Processing KC2066). We gratefully acknowledge the assistance of Dr. Lizhi Zhao providing part of the revised manuscript and valuable discussion.

Author information

Authors and Affiliations

School of Information Engineering, Minzu University of China, Beijing, China
Peian Li & Wei Song
School of Electronic Information and Engineering, Beijing Jiaotong University, Beijing, China
Peian Li, Huadong Tang & Jing Yu
National Language Resource Monitoring and Research Center of Minority Languages, Minzu University of China, Beijing, China
Wei Song

Authors

Peian Li
View author publications
You can also search for this author in PubMed Google Scholar
Huadong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Song.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Tang, H., Yu, J. et al. LSTM and multiple CNNs based event image classification. Multimed Tools Appl 80, 30743–30760 (2021). https://doi.org/10.1007/s11042-020-10165-4

Download citation

Received: 24 March 2020
Revised: 26 September 2020
Accepted: 10 November 2020
Published: 23 November 2020
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11042-020-10165-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LSTM and multiple CNNs based event image classification

Abstract

Access this article

Similar content being viewed by others

Transferring Deep Object and Scene Representations for Event Recognition in Still Images

A new 3D convolutional neural network (3D-CNN) framework for multimedia event detection

Complex event detection via attention-based video representation and classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LSTM and multiple CNNs based event image classification

Abstract

Access this article

Similar content being viewed by others

Transferring Deep Object and Scene Representations for Event Recognition in Still Images

A new 3D convolutional neural network (3D-CNN) framework for multimedia event detection

Complex event detection via attention-based video representation and classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation