Abstract
Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03% and 66.85% Top-1 Loc, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, O., Adler, S., Agarwal, S.: GPT-4 technical report (2023)
Bai, H., Zhang, R., Wang, J., Wan, X.: Weakly supervised object localization via transformer with implicit spatial calibration. In: Proceedings of the ECCV (2022)
Cao, X., et al.: LocLoc: low-level cues and local-area guides for weakly supervised object localization. In: Proceedings of the ACM MM, pp. 5655–5664 (2023)
Cen, J., et al.: Segment anything in 3D with NeRFs. In: Proceedings of the NeurIPS, vol. 36 (2024)
Chen, Z., et al.: Category-aware allocation transformer for weakly supervised object localization. In: Proceedings of the ICCV, pp. 6643–6652 (2023)
Chen, Z., et al.: LCTR: on awakening the local continuity of transformer for weakly supervised object localization. In: Proceedings of the AAAI, pp. 410–418 (2022)
Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. In: Proceedings of the CVPR, pp. 3133–3142 (2020)
Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Processing of the CVPR (2019)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Houlsby, N.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: Proceedings of the ICLR (2021)
Feng, C.B., Lai, Q., Liu, K., Su, H., Vong, C.M.: Boosting few-shot semantic segmentation via segment anything model. arXiv preprint arXiv:2401.09826 (2024)
Gupta, S., Lakhotia, S., Rawat, A., Tallamraju, R.: Vitol: Vision transformer for weakly supervised object localization. In: Proceedings of the CVPR, pp. 4101–4110 (2022)
Huang, Y., et al.: Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024)
Ke, L., et al.: Segment anything in high quality. In: Proceedings of the NeurIPS, vol. 36 (2024)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)
Kirillov, A., et al.: Segment anything. In: Proceedings of the ICCV, pp. 3992–4003 (2023)
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the ACL (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the ICCV, pp. 10012–10022 (2021)
Lu, W., Jia, X., Xie, W., Shen, L., Zhou, Y., Duan, J.: Geometry constrained weakly supervised object localization. In: Proceedings of the ECCV, pp. 481–496 (2020)
Mai, J., Yang, M., Luo, W.: Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Processing of the CVPR (2020)
Mazurowski, M.A., Dong, H., Gu, H., Yang, J., Konz, N., Zhang, Y.: Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023)
Meng, M., Zhang, T., Tian, Q., Zhang, Y., Wu, F.: Foreground activation maps for weakly supervised object localization. In: Processing of the ICCV (2021)
Pan, X., et al.: Unveiling the potential of structure preserving for weakly supervised object localization. In: Processing of the CVPR (2021)
Pan, Y., Yao, Y., Cao, Y., Chen, C., Lu, X.: Coarse2fine: local consistency aware re-prediction for weakly supervised object localization. In: Proceedings of the AAAI, vol. 37, pp. 2002–2010 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the ICML, vol. 139, pp. 8748–8763 (2021)
Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Song, Y., Jang, S., Katabi, D., Son, J.: Unsupervised object localization with representer point selection. In: Proceedings of the ICCV, pp. 6534–6544 (2023)
Brown, T.B., et al.: Language models are few-shot learners. In: Proceedings of the NeurIPS, pp. 1877–1901 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the ICML, pp. 10347–10357. PMLR (2021)
Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Proceedings of the CVPR, pp. 14543–14553 (2022)
Wei, J., Wang, Q., Li, Z., Wang, S., Zhou, S.K., Cui, S.: Shallow feature matters for weakly supervised object localization. In: Proceedings of the CVPR, pp. 5993–6001 (2021)
Welinder, P., et al.: Caltech-UCSD birds 200 (2010)
Wu, P., Zhai, W., Cao, Y.: Background activation suppression for weakly supervised object localization. In: Proceedings of the CVPR, pp. 14228–14237 (2022)
Wu, P., Zhai, W., Cao, Y., Luo, J., Zha, Z.J.: Spatial-aware token for weakly supervised object localization. In: Proceedings of the ICCV, pp. 1844–1854 (2023)
Xie, J., Luo, C., Zhu, X., Jin, Z., Lu, W., Shen, L.: Online refinement of low-level feature based activation map for weakly supervised object localization. In: Proceedings of the ICCV, pp. 132–141 (2021)
Xie, J., Xiang, J., Chen, J., Hou, X., Zhao, X., Shen, L.: C2AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In: Proceedings of the CVPR, pp. 989–998 (2022)
Xu, J., et al.: Cream: weakly supervised object localization via class re-activation mapping. In: Proceedings of the CVPR, pp. 9437–9446 (2022)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Learning multi-modal class-specific tokens for weakly supervised dense object localization. In: Proceedings of the CVPR, pp. 19596–19605 (2023)
Xu, M., Yin, X., Qiu, L., Liu, Y., Tong, X., Han, X.: Sampro3D: locating SAM prompts in 3D for zero-shot scene segmentation. arXiv preprint arXiv:2311.17707 (2023)
Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., Ye, Q.: DANet: divergent activation for weakly supervised object localization. In: Proceedings of the ICCV, pp. 6589–6598 (2019)
Yan, Z., et al.: RingMo-SAM: a foundation model for segment anything in multimodal remote-sensing images. IEEE Trans. Geosci. Remote Sens. 61, 1–16 (2023)
Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos. arXiv preprint arXiv:2304.11968 (2023)
Yao, Y., et al.: TS-CAM: token semantic coupled attention map for weakly supervised object localization. IEEE Trans. Neural Netw. Learn. Syst. 1–13 (2022)
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhang, C.L., Cao, Y.H., Wu, J.: Rethinking the route towards weakly supervised object localization. In: Proceedings of the CVPR, pp. 13460–13469 (2020)
Zhang, D., Han, J., Cheng, G., Yang, M.H.: Weakly supervised object localization and detection: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5866–5885 (2021)
Zhang, X., Liu, Y., Lin, Y., Liao, Q., Li, Y.: UV-SAM: adapting segment anything model for urban village identification. In: Proceeding of the AAAI (2024)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Processing of the CVPR (2016)
Zhu, L., Chen, Q., Jin, L., You, Y., Lu, Y.: Bagging regional classification activation maps for weakly supervised object localization. In: Proceedings of the ECCV, pp. 176–192 (2022)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 62372348, Grant 62441601, Grant U22A2096 and Grant 62221005; in part by the Key Research and Development Program of Shaanxi under Grant 2024GX-ZDCYL-02-10; in part by Shaanxi Outstanding Youth Science Fund Project under Grant 2023-JC-JQ-53; in part by the Innovation Collaboration Special Project of Science, Technology and Innovation Bureau of Shenzhen Municipality under Project CJGJZD20210408092603008; in part by the Fundamental Research Funds for the Central Universities under Grant QTZX24080 and Grant QTZX23042.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, X., Duan, S., Wang, N., Gao, X. (2025). Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72890-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72889-1
Online ISBN: 978-3-031-72890-7
eBook Packages: Computer ScienceComputer Science (R0)