Abstract
Learning object detectors from weak image annotations is an important yet challenging problem. Many weakly supervised approaches formulate the task as a multiple instance learning problem, where each image is represented as a bag of instances. For predicting the score for each object that occurs in an image, existing MIL based approaches tend to select the instance that responds more strongly to a specific class, which, however, overlooks the contextual information. Besides, objects often exhibit dramatic variations such as scaling and transformations, which makes them hard to detect. In this paper, we propose the weakly supervised group mask network (WSGMN), which mainly has two distinctive properties: (i) it exploits the relations among regions to generate community instances, which contain context information and are robust to object variations. (ii) It generates a mask for each label group, and utilizes these masks to dynamically select the feature information of the most useful community instances for recognizing specific objects. Extensive experiments on several benchmark datasets demonstrate the effectiveness of WSGMN on the tasks of weakly supervised object detection.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01397-w/MediaObjects/11263_2020_1397_Fig8_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In our experiment, we set \(Z = 10\) for PASCAL VOC datasets, \(Z = 24\) for MS-COCO, and \(Z = 26\) for ImageNet detection dataset.
References
Arun, A., Jawahar, C., & Kumar, M. P. (2019). Dissimilarity coefficient based weakly supervised object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9432–9441).
Bency, A. J., Kwon, H., Lee, H., Karthikeyan, S., & Manjunath, B. (2016). Weakly supervised localization using deep feature maps. In: European conference on computer vision (pp. 714–731).
Bilen, H., & Vedaldi, A. (2016). Weakly supervised deep detection networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2846–2854).
Bilen, H., Pedersoli, M., & Tuytelaars, T. (2014). Weakly supervised object detection with posterior regularization. In: Proceedings of the British machine vision conference (pp. 1–12).
Bosch, A., Munoz, X., Oliver, A., & Marti, R. (2006). Object and scene classification: What does a supervised approach provide us? International Conference on Pattern Recognition, 1, 773–777.
Cao, J., Cholakkal, H., Anwer, R. M., Khan, F. S., Pang, Y., & Shao, L. (2020) D2det: Towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11485–11494).
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. ArXiv preprint, arXiv:1405.3531.
Cinbis, R. G., Verbeek, J., & Schmid, C. (2017a). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.
Cinbis, R. G., Verbeek, J., & Schmid, C. (2017b). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.
Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.
Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., & Van Gool, L. (2017). Weakly supervised cascaded convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 914–922).
Dietterich, T. G., Lathrop, R. H., & Lozano-Pérez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1), 31–71.
Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical study of context in object detection. In: IEEE conference on computer vision and pattern recognition (pp. 1271–1278).
Durand, T., Mordan, T., Thome, N., & Cord, M. (2017). Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
Durand, T., Thome, N., & Cord, M. (2015). Mantra: Minimum maximum latent structural svm for image classification and ranking. In: Proceedings of the IEEE international conference on computer vision (pp. 2713–2721).
Durand, T., Thome, N., & Cord, M. (2016). Weldon: Weakly supervised learning of deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4743–4752).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The pascal visual object classes challenge 2007 (voc 2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop
Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2012) The pascal visual object classes challenge 2012 results. In: See http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (Vol. 5).
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Ge, W., Yang, S., & Yu, Y. (2018). Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1277–1286).
Girshick, R. (2015) Fast r-CNN. In: Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
Hand, E. M., & Chellappa, R. (2017). Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In: AAAI (pp. 4068–4074).
He, S., Lau, R. W., Liu, W., Huang, Z., & Yang, Q. (2015). Supercnn: A superpixelwise convolutional neural network for salient object detection. International Journal of Computer Vision, 115(3), 330–344.
Huang, J., Li, G., Huang, Q., & Wu, X. (2015). Learning label specific features for multi-label classification. In: IEEE international conference on data mining (pp. 181–190).
Jie, Z., Wei, Y., Jin, X., Feng, J., & Liu, W. (2017). Deep self-taught learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
Kantorov, V., Oquab, M., Cho, M., & Laptev, I. (2016). Contextlocnet: Context-aware deep network models for weakly supervised localization. In: European conference on computer vision (pp. 350–365).
Li, Y. F., Hu, J. H., Jiang, Y., & Zhou, Z. H. (2012). Towards discovering what patterns trigger what labels. In: Proceedings of the 26th AAAI conference on artificial intelligence.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision (pp. 740–755).
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In: The proceedings of the 7th IEEE international conference on computer vision (Vol. 2, pp. 1150–1157).
Nikulin, M. S. (2001). Hellinger distance. Encyclopedia of Mathematics. http://encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453
Oliva, A., & Torralba, A. (2007). The role of context in object recognition. Trends in Cognitive Sciences, 11(12), 520–527.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 685–694).
Parizi, S. N., Vedaldi, A., Zisserman, A., & Felzenszwalb, P. (2014). Automatic discovery and optimization of parts for image classification. ArXiv preprint, arXiv:1412.6598.
Pourian, N., Karthikeyan, S., & Manjunath, B. (2015). Weakly supervised graph based semantic segmentation by learning communities of image-parts. In: Proceedings of the IEEE international conference on computer vision (pp. 1359–1367).
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007) Objects in context. In: IEEE international conference on Computer vision (pp. 1–8).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (pp. 91–99).
Ren, Z., Yu, Z., Yang, X., Liu, M. Y., Lee, Y. J., Schwing, A. G., & Kautz, J. (2020). Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10598–10607).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations.
Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10790).
Tang, P., Xinggang, W., Xiang, B., & Liu, W. (2017). Multiple instance detection network with online instance classifier refinement. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
Tang, P., Wang, X., Bai, S., Shen, W., Bai, X., Liu, W., et al. (2018a). PCL: Proposal cluster learning for weakly supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1), 176–191.
Tang, P., Wang, X., Wang, A., Yan, Y., Liu, W., Huang, J., & Yuille, A. (2018b). Weakly supervised region proposal network and object detection. In: Proceedings of the European conference on computer vision (ECCV) (pp. 352–368).
Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H., & Fu, Y. (2020) Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10186–10195).
Zhang, M. L., & Wu, L. (2015). Lift: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 107–120.
Zhang, X., Feng, J., Xiong, H., & Tian, Q. (2018) Zigzag learning for weakly supervised object detection. In: The IEEE conference on computer vision and pattern recognition.
Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multi-context deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1265–1274).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2921–2929).
Zhu, C., Zheng, Y., Luu, K., & Savvides, M. (2017) CMS-rCNN: Contextual multi-scale region-based CNN for unconstrained face detection. In: Deep learning for biometrics (pp. 57–79).
Zitnick, C. L., & Dollár, P. (2014) Edge boxes: Locating object proposals from edges. In: European conference on computer vision (pp. 391–405).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Antonio Torralba.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was supported in part by National Key Research and Development Program of China under Grant No. 2018YFB1004500, National Nature Science Foundation of China under Grant Nos. 61772426, 61672419, 61672418, 61532004, 61502377, 61532015, 61721002, the Joint Funds of the National Natural Science Foundation of China under Grant No. U1811262, Innovation Research Team of Ministry of Education under Grant No. IRT_17R86, Fundamental Research Funds for the Central Universities under Grant No. D5000200146, China Postdoctoral Science Foundation under Grant No. 2020M673487.
Rights and permissions
About this article
Cite this article
Song, L., Liu, J., Sun, M. et al. Weakly Supervised Group Mask Network for Object Detection. Int J Comput Vis 129, 681–702 (2021). https://doi.org/10.1007/s11263-020-01397-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01397-w