Abstract
Multi-label classification is one of the most challenging tasks in the computer vision community, owing to different composition and interaction (e.g. partial visibility or occlusion) between objects in multi-label images. Intuitively, some objects usually co-occur with some specific scenes, e.g. the sofa often appears in a living room. Therefore, the scene of a given image may provides informative cues for identifying those embedded objects. In this paper, we propose a novel scene-aware deep framework for addressing the challenging multi-label classification task. In particular, we incorporate two sub-networks that are pre-trained for different tasks (i.e. object classification and scene classification) into a unified framework, so that informative scene-aware cues can be leveraged for benefiting multi-label object classification. In addition, we also present a novel one vs. all multiple-cross-entropy (MCE) loss for optimizing the proposed scene-aware deep framework by independently penalizing the classification error for each label. The proposed method can be learned in an end-to-end manner and extensive experimental results on Pascal VOC 2007 and MS COCO demonstrate that our approach is able to make a noticeable improvement for the multi-label classification task.
Similar content being viewed by others
References
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: British machine vision conference
Cheng MM, Zhang Z, Lin WY, Torr PHS (2014) BING: binarized normed gradients for objectness estimation at 300fps. In: Computer vision and pattern recognition
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, pp 248–255
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: a deep convolutional activation feature for generic visual recognition. arXiv:1310.1531
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A The PASCAL Visual Object Classes Challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524
Gong Y, Jia Y, leung TK, Toshev A, Ioffe S (2014) Deep convolutional ranking for multi label image annotation. In: International conference on learning representations
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361
Jia Y (2013) Caffe: an open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/
Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: image annotation by exploiting image metadata. In: Proceedings of the IEEE international conference on computer vision, pp 4624–4632
Kordumova S, Mensink T, Snoek CG (2016) Pooling objects for recognizing scenes without examples. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 143–150
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: Neural information processing systems, pp 1106–1114
Lai H, Yan P, Shu X, Wei Y, Yan S (2016) Instance-aware hashing for multi-label image retrieval. IEEE Trans Image Process 25(6):2469–2479
Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv 49(1):14
Liang X, Liu S, Wei Y, Liu L, Lin L, Yan S (2014) Computational baby learning. arXiv:1411.2861
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. arXiv:1411.4038
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Computer vision and pattern recognition, pp 1717–1724
Oquab M, Bottou L, Laptev I, Sivic J (2014) Weakly supervised object recognition with convolutional neural networks. Tech. Rep. HAL-01015140, INRIA
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: IEEE conference on computer vision and pattern recognition, pp 413–420
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. arXiv:1403.6382
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
Verma Y, Jawahar C (2017) A support vector approach for cross-modal search of images and texts. Comput Vis Image Underst 154:48–63
Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 30–35
Wei Y, Liang X, Chen Y, Shen X, Cheng MM, Zhao Y, Yan S (2015) Stc: a simple to complex framework for weakly-supervised semantic segmentation. arXiv:1509.03150
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) Hcp: a flexible cnn framework for multi-label image classification. IEEE Trans Pattern Recognit Mach Intell 38(9):1901–1907
Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Trans Cybern 47 (2):449–460
Xiao J, Hays J, Ehinger K, Oliva A, Torralba A et al (2010) Sun database: large-scale scene recognition from abbey to zoo. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3485–3492
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, pp 391–405
Acknowledgments
This work was supported in part by National Natural Science Foundation of China (No.61272353, 61370128, 61428201), Program for New Century Excellent Talents in University (NCET-13-0659), Scientic and Technological Research of Shandong, China (NO.2016GGX101029).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Z., Lu, W., Sun, Z. et al. Improving multi-label classification using scene cues. Multimed Tools Appl 77, 6079–6094 (2018). https://doi.org/10.1007/s11042-017-4517-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4517-0