Abstract
Training object detectors with only image-level annotations is an important problem with a variety of applications. However, due to the deformable nature of objects, a target object delineated by a bounding box always includes irrelevant context and occlusions, which causes large intra-class object variations and ambiguity in object-background distinction. For this reason, identifying the object of interest from a substantial amount of cluttered backgrounds is very challenging. In this paper, we propose a decoupled attention-based deep model to optimize region-based object representation. Different from existing approaches posing object representation in a single-tower model, our proposed network decouples object representation into two separate modules, i.e., image representation and attention localization. The image representation module captures content-based semantic representation, while the attention localization module regresses an attention map which simultaneously highlights the locations of the discriminative object parts and down weights the irrelevant backgrounds presented in the image. The combined representation alleviates the impact from the noisy context and occlusions inside an object bounding box. As a result, object-background ambiguity can be largely reduced and background regions can be suppressed effectively. In addition, the proposed object representation model can be seamlessly integrated into a state-of-the-art weakly supervised detection framework, and the entire model can be trained end-to-end. We extensively evaluate the detection performance on the PASCAL VOC 2007, VOC 2010 and VOC2012 datasets. Experimental results demonstrate that our approach effectively improves weakly supervised object detection.







Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. International Conference on Learning Representations, In, pp 1–10
Bency AJ, Kwon H, Lee H, Karthikeyan S, Manjunath BS (2016) Weakly supervised localization using deep feature maps. European Conference on Computer Vision
Bilen H, Vedaldi A (2016) Weakly supervised deep detection networks. IEEE Conference on Computer Vision and Pattern Recognition
Bilen H, Pedersoli M, Tuytelaars T (2015) Weakly supervised object detection with convex clustering. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 1081–1089
Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2582746
Chang X, Yu Y, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39:1617–1632. doi:10.1109/TPAMI.2016.2608901
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27:1502–1513
Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26:3911–3920
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47:1180–1197
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. International Conference on Learning Representations, In, pp 1–14
Cinbis RG, Verbeek J, Schmid C (2017) Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans Pattern Anal Mach Intell 39:189–203. doi:10.1109/TPAMI.2016.2535231
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Deselaers T, Alexe B, Ferrari V (2012) Weakly supervised localization and learning with generic knowledge. Int J Comput Vis 100:275–293. doi:10.1007/s11263-012-0538-3
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2014) The Pascal visual object classes challenge: a retrospective. Int J Comput Vis 111:98–136. doi:10.1007/s11263-014-0733-5
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the KITTI dataset. Int J Robot Res 32:1231–1237. doi:10.1177/0278364913491297
Gidaris S, Komodakis N (2015) Object detection via a multi-region & semantic segmentation-aware CNN model. IEEE International Conference on Computer Vision
Girshick R (2015) Fast R-CNN. IEEE International Conference on Computer Vision
Han J, Zhang D, Cheng G, Guo L, Ren J (2015) Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans Geosci Remote Sens 53:3325–3337
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 171–180
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia. pp 675–678
Jiang W, Zhao Z, Su F (2016) Bayes pooling of visual phrases for object retrieval. Multimed Tools Appl 75:9095–9119. doi:10.1007/s11042-015-2939-0
Karthikeyan S, Ngo T, Eckstein M, Manjunath BS (2015) Eye tracking assisted extraction of attentionally important objects from videos. Proc IEEE Conf Comput Vis Pattern Recognit. doi:10.1109/CVPR.2015.7298944
Krizhevsky A, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceeding NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, Curran Associates Inc., Lake Tahoe, Nevada — December 03–06, 2012, pp. 1097–1105
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016) SSD : single shot MultiBox detector. European Conference on Computer Vision
Long J, Shelhamer E (2015) Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition
Ma Z, Chang X, Yang Y, Sebe N, Hauptmann AG (2017) The many shades of negativity. IEEE Trans Multimedia 19:1558–1568
Ma Z, Chang X, Xu Z, Sebe N, Hauptmann AG (2017) Joint attributes and event analysis for multimedia event detection. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2017.2709308
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, In, pp 2204–2212
Oquab M, Bottou L, Laptev I, Sivic J (1717–1724) (2014) learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition. pp, In
Oquab M, Bottou L, Laptev I, Sivic J (2015) Is object localization for free? - weakly-supervised learning with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, In, pp 685–694
Papadopoulos DP, Clarke ADF, Keller F, Ferrari V (2014) Training object class detectors from eye tracking data. In: European Conference on Computer Vision. pp 1–16
Redmon J, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceeding NIPS'15 Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada — December 07–12, 2015, pp. 91–99
Ren W, Member S, Huang K, Member S (2016) Weakly supervised large scale object localization with multiple instance learning and bag splitting. IEEE Trans Pattern Anal Mach Intell 38:405–416. doi:10.1109/TPAMI.2015.2456908
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252. doi:10.1007/s11263-015-0816-y
Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. International Conference on Learning Representations, In, pp 1–11
Shi M, Ferrari V (2016) Weakly supervised object localization using size estimates. In: European Conference on Computer Vision
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. IEEE, Las Vegas
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. pp 1–14
Song HO, Girshick R, Jegelka S, Mairal J, Harchaoui Z, Darrell T (2014) On learning to localize objects with minimal supervision. In: Proceeding ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning vol. 32, Beijing, China, 21–26 June, 2014
Song HO, Lee YJ, Jegelka S, Darrell T (2014) Weakly-supervised discovery of visual pattern configurations. In: Proceeding NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada, 8–13 December, 2014
Treue S, Martinez Trujillo JC (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579. doi:10.1038/21176
Uijlings JRR, Sande KE a., Gevers T, Smeulders a. WM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
Uijlings JRR, Keller F, Ferrari V (2016) We don’t need no bounding-boxes: training object class detectors using only human verification. IEEE Conference on Computer Vision and Pattern Recognition
Wang C, Huang K, Ren W, Zhang J, Maybank S (2015) Large-scale weakly supervised object localization via latent category learning. IEEE Trans Image Process 24:1371–1385. doi:10.1109/TIP.2015.2396361
Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. European Conference on Computer Vision, In, pp 451–466
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine learning
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In, IEEE Conference on Computer Vision and Pattern Recognition, p 10
Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120:215–232. doi:10.1007/s11263-016-0907-4
Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on Intrasaliency prior transfer and deep Intersaliency mining. IEEE Trans Neural Netw Learn Syst 27:1163–1176. doi:10.1109/TNNLS.2015.2495161
Zhang D, Meng D, Zhao L, Han J (2016) Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July, 2016, pp. 3538–3544
Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39:865–878. doi:10.1109/TPAMI.2016.2567393
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. IEEE Conference on Computer Vision and Pattern Recognition
Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17:981–993. doi:10.1109/TMM.2015.2431496
Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45:2756–2769. doi:10.1109/TCYB.2014.2383389
Zhu Z, Liang D, Zhang S, Huang X, Baoli Li SH (2016) Traffic-sign detection and classification in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 2110–2118
Zhu L, Shen J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July 2016, pp. 3959–3965
Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591068
Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29:472–486. doi:10.1109/TKDE.2016.2562624
Acknowledgements
This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiang, W., Zhao, Z. & Su, F. Weakly supervised detection with decoupled attention-based deep representation. Multimed Tools Appl 77, 3261–3277 (2018). https://doi.org/10.1007/s11042-017-5087-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5087-x