Abstract
Scene recognition is an important branch of computer vision and a common task for deep learning. As is known to all, different scenes are supported by different “key objects”. Therefore, the neural network used for the scene recognition task needs to extract the features of these key objects in the scene, sometimes even has to integrate the positional relation between objects to determine the class to which the scene belongs. Under some circumstances, key objects in the scenes are very small and the features of them become extremely inconspicuous or even disappear in the deep layers of the network. Such kind of phenomenon is called “small object-supported scenes”. In this paper, Multi-Level Ensemble Network (MLEN), a convolutional neural network, has been proposed, to improve the recognition accuracy of these “small object-supported scenes”. Features from multiple levels of the net are used to make separate predictions. Then ensemble learning is performed within the net to make the final prediction. Apart from all this, “Feature Transfer Path” is added and feature fusion methods are adopted to make full use of low-level and high-level features. Moreover, a class-weight loss function for the problem of non-uniform class distribution has been designed. This function can help further improve accuracy in most scene recognition datasets. The experiments involve the Urban Management Case (UMC) dataset collated from two smart urban management system databases by ourselves, and the Places-mini dataset, which is a subset of the well-known Places dataset [36]. The results show that our Multi-Level Ensemble Network achieves much higher accuracy than the state-of-the-art scene recognition networks on both datasets.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bertinetto L, Valmadre J, Henriques JF, et al. (2016) Fully-Convolutional Siamese Networks for Object Tracking[C]// European Conference on Computer Vision. Springer International Publishing, 850–865
Chen Y, Li J, Xiao H, et al. (2017) Dual Path Networks[J]
Cheng Z, Shen J (2016) On very large scale test collection for landmark image search benchmarking[J]. Signal Processing, 124:13–26
Cheng Z, Chang X, et al. (2018) MMALFM: Explainable Recommendation by Leveraging Reviews and Images[J]. ACM Transactions on Information Systems
Danelljan M, Bhat G, Khan FS, et al. (2016) ECO: Efficient Convolution Operators for Tracking[J]. 6931–6939
Ding G, Chen W et al (2018) Real-Time Scalable Visual Tracking via Quadrangle Kernelized Correlation Filters[J]. IEEE Trans Intell Transp Syst 19(1):140–150
Fan H, SANet LH (2017) Structure-Aware Network for Visual Tracking[C]// Computer Vision and Pattern Recognition Workshops. IEEE, 2217–2224
George M, Dixit M, Zogg G, et al. (2016) Semantic Clustering for Robust Fine-Grained Scene Recognition[M]// Computer Vision – ECCV 2016. Springer International Publishing, 783–798.
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks[J]. J Mach Learn Res 9:249–256
Hariharan B, Arbeláez P, Girshick R et al (2014) Simultaneous Detection and Segmentation[C]// European Conference on Computer Vision. Springer, Cham, pp 297–312
He K, Zhang X, Ren S, et al. (2016) Deep Residual Learning for Image Recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, :770–778.
He K, Gkioxari G, Dollar P et al. (2017) Mask R-CNN[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, PP(99):1–1
Herranz L, Jiang S, Li X (2016) Scene Recognition with CNNs: Objects, Scales and Dataset Bias[C]// Computer Vision and Pattern Recognition. IEEE, 571–579
Hu J, Shen L, Sun G (2017) Squeeze-and-Excitation Networks[J]
Huang G, Liu Z, Laurens VDM, et al. (2016) Densely Connected Convolutional Networks[J]. 2261–2269.
Ioffe S, Szegedy C (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift[J]. 448–456
Jia Y, Shelhamer E, Donahue J et al. (2014) Caffe: Convolutional Architecture for Fast Feature Embedding[J].
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc. 1097–1105.
Lécun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition[J]. Proc IEEE 86(11):2278–2324
Li Y, Qi H, Dai J, et al. (2016) Fully Convolutional Instance-aware Semantic Segmentation[J]. 4438–4446
Romera-Paredes B, Torr PHS (2016) Recurrent Instance Segmentation[C]// European Conference on Computer Vision. Springer International Publishing, 312–329
Shen L, Lin Z, Huang Q (2016) Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks[C]// European Conference on Computer Vision. Springer International Publishing, 467–482
Szegedy C, Liu W, Jia Y, et al. (2015) Going deeper with convolutions[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–9
Szegedy C, Vanhoucke V, Ioffe S, et al. (2016) Rethinking the Inception Architecture for Computer Vision[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2818–2826
Szegedy C, Ioffe S, Vanhoucke V, et al. (2016) Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning[J]
Wang L, Ouyang W, Wang X et al. (2016) Visual Tracking with Fully Convolutional Networks[C]// IEEE International Conference on Computer Vision. IEEE, 3119–3127
Xie S, Girshick R, Dollar P, et al. (2017) Aggregated Residual Transformations for Deep Neural Networks[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5987–5995
Yan C, Tu Y, Wang X, et al. (2019) STAT: Spatial-Temporal Attention Mechanism for Video Captioning, IEEE Transactions on Multimedia
Yan C, Li L, Zhang C, et al. (2019) Cross-modality Bridging and Knowledge Transferring for Image Understanding, IEEE Transactions on Multimedia
Zagoruyko S, Komodakis N (2016) Wide Residual Networks[J]
Zeiler MD, Fergus R (2014) Visualizing and Understanding Convolutional Networks[J]. 8689:818–833
Zhao S, Yao H et al. (2016) Continuous Probability Distribution Prediction of Image Emotions via Multi-Task Shared Sparse Regression[J]. IEEE Transactions on Multimedia, PP(99):1–1
Zhao S, Yao H, et al. (2016) Predicting Personalized Image Emotion Perceptions in Social Networks[J]. IEEE Transactions on Affective Computing, 1–1
Zhao S, Gao Y, et al. (2017) Real-Time Multimedia Social Event Detection in Microblog[J]. IEEE Transactions on Cybernetics, 1–14
Zhou B, Lapedriza A, Xiao J, et al. (2014) Learning deep features for scene recognition using places database[C]// International Conference on Neural Information Processing Systems. MIT Press, 487–495
Zhou B, Lapedriza A, Khosla A, et al. (2018) Places: A 10 million Image Database for Scene Recognition.[J]. IEEE Trans Pattern Anal Mach Intell, PP(99):1–1
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declared that they have no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, L., Li, L., Pan, X. et al. Multi-Level Ensemble Network for Scene Recognition. Multimed Tools Appl 78, 28209–28230 (2019). https://doi.org/10.1007/s11042-019-07933-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-07933-2