Skip to main content

Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03% and 66.85% Top-1 Loc, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, O., Adler, S., Agarwal, S.: GPT-4 technical report (2023)

    Google Scholar 

  2. Bai, H., Zhang, R., Wang, J., Wan, X.: Weakly supervised object localization via transformer with implicit spatial calibration. In: Proceedings of the ECCV (2022)

    Google Scholar 

  3. Cao, X., et al.: LocLoc: low-level cues and local-area guides for weakly supervised object localization. In: Proceedings of the ACM MM, pp. 5655–5664 (2023)

    Google Scholar 

  4. Cen, J., et al.: Segment anything in 3D with NeRFs. In: Proceedings of the NeurIPS, vol. 36 (2024)

    Google Scholar 

  5. Chen, Z., et al.: Category-aware allocation transformer for weakly supervised object localization. In: Proceedings of the ICCV, pp. 6643–6652 (2023)

    Google Scholar 

  6. Chen, Z., et al.: LCTR: on awakening the local continuity of transformer for weakly supervised object localization. In: Proceedings of the AAAI, pp. 410–418 (2022)

    Google Scholar 

  7. Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. In: Proceedings of the CVPR, pp. 3133–3142 (2020)

    Google Scholar 

  8. Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Processing of the CVPR (2019)

    Google Scholar 

  9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Houlsby, N.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: Proceedings of the ICLR (2021)

    Google Scholar 

  10. Feng, C.B., Lai, Q., Liu, K., Su, H., Vong, C.M.: Boosting few-shot semantic segmentation via segment anything model. arXiv preprint arXiv:2401.09826 (2024)

  11. Gupta, S., Lakhotia, S., Rawat, A., Tallamraju, R.: Vitol: Vision transformer for weakly supervised object localization. In: Proceedings of the CVPR, pp. 4101–4110 (2022)

    Google Scholar 

  12. Huang, Y., et al.: Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024)

    Article  Google Scholar 

  13. Ke, L., et al.: Segment anything in high quality. In: Proceedings of the NeurIPS, vol. 36 (2024)

    Google Scholar 

  14. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)

    Google Scholar 

  15. Kirillov, A., et al.: Segment anything. In: Proceedings of the ICCV, pp. 3992–4003 (2023)

    Google Scholar 

  16. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the ACL (2020)

    Google Scholar 

  17. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  18. Lu, W., Jia, X., Xie, W., Shen, L., Zhou, Y., Duan, J.: Geometry constrained weakly supervised object localization. In: Proceedings of the ECCV, pp. 481–496 (2020)

    Google Scholar 

  19. Mai, J., Yang, M., Luo, W.: Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Processing of the CVPR (2020)

    Google Scholar 

  20. Mazurowski, M.A., Dong, H., Gu, H., Yang, J., Konz, N., Zhang, Y.: Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023)

    Article  Google Scholar 

  21. Meng, M., Zhang, T., Tian, Q., Zhang, Y., Wu, F.: Foreground activation maps for weakly supervised object localization. In: Processing of the ICCV (2021)

    Google Scholar 

  22. Pan, X., et al.: Unveiling the potential of structure preserving for weakly supervised object localization. In: Processing of the CVPR (2021)

    Google Scholar 

  23. Pan, Y., Yao, Y., Cao, Y., Chen, C., Lu, X.: Coarse2fine: local consistency aware re-prediction for weakly supervised object localization. In: Proceedings of the AAAI, vol. 37, pp. 2002–2010 (2023)

    Google Scholar 

  24. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the ICML, vol. 139, pp. 8748–8763 (2021)

    Google Scholar 

  25. Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)

  26. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  27. Song, Y., Jang, S., Katabi, D., Son, J.: Unsupervised object localization with representer point selection. In: Proceedings of the ICCV, pp. 6534–6544 (2023)

    Google Scholar 

  28. Brown, T.B., et al.: Language models are few-shot learners. In: Proceedings of the NeurIPS, pp. 1877–1901 (2020)

    Google Scholar 

  29. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the ICML, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  30. Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Proceedings of the CVPR, pp. 14543–14553 (2022)

    Google Scholar 

  31. Wei, J., Wang, Q., Li, Z., Wang, S., Zhou, S.K., Cui, S.: Shallow feature matters for weakly supervised object localization. In: Proceedings of the CVPR, pp. 5993–6001 (2021)

    Google Scholar 

  32. Welinder, P., et al.: Caltech-UCSD birds 200 (2010)

    Google Scholar 

  33. Wu, P., Zhai, W., Cao, Y.: Background activation suppression for weakly supervised object localization. In: Proceedings of the CVPR, pp. 14228–14237 (2022)

    Google Scholar 

  34. Wu, P., Zhai, W., Cao, Y., Luo, J., Zha, Z.J.: Spatial-aware token for weakly supervised object localization. In: Proceedings of the ICCV, pp. 1844–1854 (2023)

    Google Scholar 

  35. Xie, J., Luo, C., Zhu, X., Jin, Z., Lu, W., Shen, L.: Online refinement of low-level feature based activation map for weakly supervised object localization. In: Proceedings of the ICCV, pp. 132–141 (2021)

    Google Scholar 

  36. Xie, J., Xiang, J., Chen, J., Hou, X., Zhao, X., Shen, L.: C2AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In: Proceedings of the CVPR, pp. 989–998 (2022)

    Google Scholar 

  37. Xu, J., et al.: Cream: weakly supervised object localization via class re-activation mapping. In: Proceedings of the CVPR, pp. 9437–9446 (2022)

    Google Scholar 

  38. Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Learning multi-modal class-specific tokens for weakly supervised dense object localization. In: Proceedings of the CVPR, pp. 19596–19605 (2023)

    Google Scholar 

  39. Xu, M., Yin, X., Qiu, L., Liu, Y., Tong, X., Han, X.: Sampro3D: locating SAM prompts in 3D for zero-shot scene segmentation. arXiv preprint arXiv:2311.17707 (2023)

  40. Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., Ye, Q.: DANet: divergent activation for weakly supervised object localization. In: Proceedings of the ICCV, pp. 6589–6598 (2019)

    Google Scholar 

  41. Yan, Z., et al.: RingMo-SAM: a foundation model for segment anything in multimodal remote-sensing images. IEEE Trans. Geosci. Remote Sens. 61, 1–16 (2023)

    Google Scholar 

  42. Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos. arXiv preprint arXiv:2304.11968 (2023)

  43. Yao, Y., et al.: TS-CAM: token semantic coupled attention map for weakly supervised object localization. IEEE Trans. Neural Netw. Learn. Syst. 1–13 (2022)

    Google Scholar 

  44. Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)

  45. Zhang, C.L., Cao, Y.H., Wu, J.: Rethinking the route towards weakly supervised object localization. In: Proceedings of the CVPR, pp. 13460–13469 (2020)

    Google Scholar 

  46. Zhang, D., Han, J., Cheng, G., Yang, M.H.: Weakly supervised object localization and detection: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5866–5885 (2021)

    Google Scholar 

  47. Zhang, X., Liu, Y., Lin, Y., Liao, Q., Li, Y.: UV-SAM: adapting segment anything model for urban village identification. In: Proceeding of the AAAI (2024)

    Google Scholar 

  48. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Processing of the CVPR (2016)

    Google Scholar 

  49. Zhu, L., Chen, Q., Jin, L., You, Y., Lu, Y.: Bagging regional classification activation maps for weakly supervised object localization. In: Proceedings of the ECCV, pp. 176–192 (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62372348, Grant 62441601, Grant U22A2096 and Grant 62221005; in part by the Key Research and Development Program of Shaanxi under Grant 2024GX-ZDCYL-02-10; in part by Shaanxi Outstanding Youth Science Fund Project under Grant 2023-JC-JQ-53; in part by the Innovation Collaboration Special Project of Science, Technology and Innovation Bureau of Shenzhen Municipality under Project CJGJZD20210408092603008; in part by the Fundamental Research Funds for the Central Universities under Grant QTZX24080 and Grant QTZX23042.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Songsong Duan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 835 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, X., Duan, S., Wang, N., Gao, X. (2025). Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72890-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72889-1

  • Online ISBN: 978-3-031-72890-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics