Abstract
Previous studies have demonstrated that complexity and variation of event images are the major challenges in event classification. We approach the problem through an integrated methodology by utilizing Long Short-Term Memory network (LSTM) to fuse multiple Convolutional Neural Networks (CNNs). To address the issue of complexity, we use three specific CNNs to extract the scene, object and human visual cues respectively. To reduce the semantic gap and utilize the complementarity of the features in different levels, we choose AlexNet and VGG-16 network as the basic structures, and concatenate their outputs of the first fully-connected layer and the second fully-connected layer. Considering the contextual correlations between visual cues, we arrange the concatenations of three CNNs in the sequence of scene, object and human as a whole and put into the LSTM network. Particularly for context, we crop the images into five blocks as input and an individual image is supplemented with contextual features due to the temporal characteristics of the LSTM. We evaluate our method on the Web Image Dataset for Event Recognition (WIDER), and the obtained results demonstrate the effectiveness of all the above points. Compared with the state-of-the-art methods, the proposed method gives a considerable way for improving the performance on event classification.
Similar content being viewed by others
References
Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. ECCV 2014:329–344
Bai S (2016) Growing random Forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287
Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(9):1755013
Bai S, Shan A (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Bossard L, Guillaumin M, Van L (2013) Event recognition in photo collections with a stopwatch HMM. IEEE International Conference on Computer Vision 2013:1193–1200
Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. IEEE Computer Society Conference on Computer Vision and Pattern 2009:248–255
Dollar P, Rabaud V, Cottrell G et al (2005) Behavior recognition via sparse spatio-temporal features. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005:193–199
Duan L, Xu D, Tsang WH, Luo J (2012) Visual event recognition in videos by learning from web data. IEEE Trans Pattern Anal Mach Intell 34(9):1667–1680
Gong Y, Wang L, Guo R et al (2014) Multi-scale orderless pooling of deep convolutional activation features. ECCV 8695:392–407
Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, pp. 464
Izadinia H, Sadeghi F, Farhadi A (2014) Incorporating scene context and object layout into appearance modeling. IEEE Conference on Computer Vision and Pattern Recognition 2014:232–239
Jia Y, Shelhamer E, Donahue J, et al (2014) Caffe: convolutional architecture for fast feature embedding. In: proceedings of the 22nd ACM international conference on multimedia, pp 675-678
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2:2169–2178
Li LJ, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. IEEE 11th international conference on computer vision, vol 2007, pp 1–8
Li LJ, Su H, Xing E P, Fei-fei L (2010) Object bank: a high-level image representation for scene classification and semantic feature sparsification. Adv Neural Inf Proces Syst 23.
Lin M, Chen Q, Yan S (2013) Network in Network. arXiv:1312.4400.
Lin D, Lu C, Liao R, Jia J (2014) Learning important spatial pooling regions for scene classification. IEEE Conference on Computer Vision and Pattern Recognition 2014:3726–3733
Liu J, Yu Q, Javed O, et al (2013) Video event recognition using concept attributes. In: proceedings of the 2013 IEEE workshop on applications of computer vision, pp 339–346
Liu M, Liu X, Li Y et al (2015) Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. IEEE International Conference on Computer Vision Workshops 2015:274–279
Lowe DG (1999) Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE International Conference on Computer Vision 1999:1150
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Mattivi R, Uijlings J, Natale FGBD, Sebe N (2011) Exploitation of time constraints for (sub-)event recognition. In: proceedings of the 2011 joint ACM workshop on modeling and representing, pp 7–12
Mousavian A, Kosecka J (2015) Deep convolutional features for image based retrieval and scene categorization. arXiv:1509.06033
Oh SJ, Benenson R, Fritz M, Schiele B (2015) Person recognition in personal photo collections. IEEE International Conference on Computer Vision 42(1):203–220
Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. International Conference on Computer Vision 2011:1307–1314
Parizi SN, Oberlin JG, Felzenszwalb PF (2012) Reconfigurable models for scene recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2012:2775–2782
Park S, Kwak N (2015) Cultural event recognition by subregion classification with convolutional neural network. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:45–50
Quattoni A, Torralba A (2009) Recognizing indoor scenes. IEEE Conference on Computer Vision and Pattern Recognition 2009:413–420
Quelhas P, Odobez JM, Gaticaperez D et al (2007) A thousand words in a scene. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(9):1575–1589
Rachmadi RF, Uchimura K, Koutaki G (2016) Combined convolutional neural network for event recognition. The Korea-Japan joint workshop on Frontiers of Computer Vision, pp 85–90
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Salvador A, Manchon-Vizuete D, Calafell A et al (2015) Cultural event recognition with visual ConvNets and temporal models. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:36–44
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large- scale image recognition. arXiv:1409.1556
Sun X, Zhang L, Wang Z, Chang J, Yao Y, Li P, Zimmermann R (2018) Scene categorization using deeply learned gaze shifting kernel. IEEE Transactions on Cybernetics 49(6):2156–2167
Szegedy C, Liu W, Jia Y et al (2014) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition 2014:1–9
Tian Z, Huang W, He T et al (2016) Detecting text in natural image with connectionist text proposal network. European Conference on Computer Vision 9912:56–72
Wang J, Yang J, Yu K et al (2010) Locality-constrained linear coding for image classification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2010:3360–3367
Wang L, Guo S, Huang W, Qiao Y (2015) Places205-VGGNet models for scene recognition. arXiv:1508.01667
Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. IEEE Conference on Computer Vision and Pattern Recognition Workshop 2015:30–35
Wang L, Wang Z, Guo S, Qiao Y (2015) Better exploiting os-cnns for better event recognition in images. IEEE International Conference on Computer Vision Workshop 2015:45–52
Wang L, Xiong Y, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159
Wang Y, Lin Z, Shen X et al (2016) Event-specific image importance. IEEE Conference on Computer Vision and Pattern Recognition 2016:4810–4819
Wang J H, Liu T W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: Proceedings of the 30th conference on computational linguistics and speech processing, Hanoi, Vietnam, pp. 214–223
Wang L, Wang Z, Qiao Y, van Gool L (2018) Transferring deep object and scene representations for event recognition in still images. Int J Comput Vis 126(2–4):390–409
Wang M, Niu S, Gao Z (2019) A novel scene text recognition method based on deep learning. Computers, Materials & Continua 60(2):781–794
Wu X, Luo C, Zhang Q, Zhou J, Yang H, Li Y (2019) Text detection and recognition for natural scene images using deep convolutional neural networks. Computers, Materials & Continua 61(1):289–300
Xiong Y, Zhu K, Lin D et al (2015) Recognize complex events from static images by fusing deep channels. IEEE Conference on Computer Vision and Pattern Recognition 2015:1600–1609
Xu F, Zhang X, Xin Z, Yang A (2019) Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning. Computers, Materials and Continua 58(3):697–709
Yang Y, Shah M (2012) Complex events detection using data-driven concepts. In Proceedings of the European conference on Computer vision, pp 722–735
Yogatama D, Dyer C, Ling W, et al (2017) Generative and discriminative text classification with recurrent neural networks. arXiv:1703.01898
Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Proces Syst 27:3320–3328
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional neural networks. European Conference on Computer Vision 2014:818–833
Zhang N, Paluri M, Taigman Y et al (2015) Beyond frontal faces: improving person recognition using multiple cues. IEEE Conference on Computer Vision and Pattern Recognition 2015:4804–4813
Zhang C, Li R, Huang Q, Tian Q (2017) Hierarchical deep semantic representation for visual categorization. Neurocomputing 257:88–96
Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.
Zhou B, Khosla A, Lapedriza A, et al (2014) Object detectors emerge in deep scene CNNs. arXiv:1412.6856
Zhou B, Lapedriza A, Xiao J et al (2015) Learning deep features for scene recognition using places database. Adv Neural Inf Proces Syst 2015:487–495
Zhou P, Qi Z, Zheng S, Xu J, Bao H, Xu B (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: proceedings of the 26th international conference on computational linguistics, pp 3485-3495
Acknowledgments
This work was supported in part by National Science Foundation Project of P. R. China under Grant No.52071349, No.61701554 and the cross-discipline research project of Minzu University of China (2020MDJC08), State Language Commission Key Project (ZDl135-39), Promotion plan for young teachers’ scientific research ability of Minzu University of China, MUC 111 Project, First class courses (Digital Image Processing KC2066). We gratefully acknowledge the assistance of Dr. Lizhi Zhao providing part of the revised manuscript and valuable discussion.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, P., Tang, H., Yu, J. et al. LSTM and multiple CNNs based event image classification. Multimed Tools Appl 80, 30743–30760 (2021). https://doi.org/10.1007/s11042-020-10165-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10165-4