skip to main content
10.1145/3581783.3612165acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

LocLoc: Low-level Cues and Local-area Guides for Weakly Supervised Object Localization

Published: 27 October 2023 Publication History

Abstract

Weakly Supervised Object Localization (WSOL) aims to localize objects using only image-level labels while ensuring competitive classification performance. However, previous efforts have prioritized localization over classification accuracy in discriminative features, in which low-level information is neglected. We argue that low-level image representations, such as edges, color, texture, and motions are crucial for accurate detection. That is, using such information further achieves more refined localization, which can be used to promote classification accuracy. In this paper, we propose a unified framework that simultaneously improves localization and classification accuracy, termed as LocLoc (Low-level Cues and Local-area Guides). It leverages low-level image cues to explore global and local representations for accurate localization and classification. Specifically, we introduce a GrabCut-Enhanced Generator (GEG) to learn global semantic representations for localization based on graph cuts to enhance low-level information based on long-range dependencies captured by the transformer. We further design a Local Feature Digging Module (LFDM) that utilizes low-level cues to guide the learning route of local feature representations for accurate classification. Extensive experiments demonstrate the effectiveness of LocLoc with 84.4%(↑5.2%) Top-1 Loc., 85.8% Top-1 Cls. on CUB-200-2011 and 57.6% (↑1.5%) Top-1 Loc., 78.6% Top-1Cls. on ILSVRC 2012, indicating that our method achieves competitive performance with a large margin compared to previous approaches. Code and models are available at https://github.com/Cliffia123/LocLoc.

References

[1]
Haotian Bai, Ruimao Zhang, Jiong Wang, and Xiang Wan. 2022. Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX (Lecture Notes in Computer Science, Vol. 13669). Springer, 612--628. https://doi.org/10.1007/978-3-031-20077-9_36
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12346), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 213--229. https://doi.org/10.1007/978-3-030-58452-8_13
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9630--9640. https://doi.org/10.1109/ICCV48922.2021.00951
[4]
Tony F. Chan and Jianhong Shen. 2005. Image processing and analysis - variational, PDE, wavelet, and stochastic methods. SIAM. https://doi.org/10.1137/1.9780898717877
[5]
Lulu Chen, Yongqiang Zhao, Jonathan Cheung-Wai Chan, and Seong G. Kong. 2022b. Histograms of oriented mosaic gradients for snapshot spectral image description. ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 183 (2022), 79--93. https://doi.org/10.1016/j.isprsjprs.2021.10.018
[6]
Zhiwei Chen, Changan Wang, Yabiao Wang, Guannan Jiang, Yunhang Shen, Ying Tai, Chengjie Wang, Wei Zhang, and Liujuan Cao. 2022a. LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 410--418. https://ojs.aaai.org/index.php/AAAI/article/view/19918
[7]
Junsuk Choe, Seungho Lee, and Hyunjung Shim. 2021. Attention-Based Dropout Layer for Weakly Supervised Single Object Localization and Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 43, 12 (2021), 4256--4271. https://doi.org/10.1109/TPAMI.2020.2999099
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
[9]
Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han, Bolei Zhou, and Qixiang Ye. 2021a. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2866--2875. https://doi.org/10.1109/ICCV48922.2021.00288
[10]
Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han, Bolei Zhou, and Qixiang Ye. 2021b. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2866--2875. https://doi.org/10.1109/ICCV48922.2021.00288
[11]
Guangyu Guo, Junwei Han, Fang Wan, and Dingwen Zhang. 2021. Strengthen Learning Tolerance for Weakly Supervised Object Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 7403--7412. https://doi.org/10.1109/CVPR46437.2021.00732
[12]
Kun He, Dan Wang, Miao Tong, and Zhijuan Zhu. 2020. An improved GrabCut on multiscale features. Pattern Recognition, Vol. 103 (2020), 107292. https://doi.org/10.1016/j.patcog.2020.107292
[13]
Jinjie Mai, Meng Yang, and Wenfeng Luo. 2020a. Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 8763--8772. https://doi.org/10.1109/CVPR42600.2020.00879
[14]
Jinjie Mai, Meng Yang, and Wenfeng Luo. 2020b. Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 8763--8772. https://doi.org/10.1109/CVPR42600.2020.00879
[15]
Meng Meng, Tianzhu Zhang, Qi Tian, Yongdong Zhang, and Feng Wu. 2021. Foreground Activation Maps for Weakly Supervised Object Localization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 3365--3375. https://doi.org/10.1109/ICCV48922.2021.00337
[16]
Xingjia Pan, Yingguo Gao, Zhiwen Lin, Fan Tang, Weiming Dong, Haolei Yuan, Feiyue Huang, and Changsheng Xu. 2021. Unveiling the Potential of Structure Preserving for Weakly Supervised Object Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 11642--11651. https://doi.org/10.1109/CVPR46437.2021.01147
[17]
Nikos Paragios, Yunmei Chen, and Olivier D. Faugeras (Eds.). 2006. Handbook of Mathematical Models in Computer Vision. Springer. https://doi.org/10.1007/0-387-28831-7
[18]
Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do Vision Transformers See Like Convolutional Neural Networks?. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 12116--12128. https://proceedings.neurips.cc/paper/2021/hash/652cf38361a209088302ba2b8b7f51e0-Abstract.html
[19]
Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. "GrabCut": interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., Vol. 23, 3 (2004), 309--314. https://doi.org/10.1145/1015706.1015720
[20]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis., Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-y
[21]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2020. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis., Vol. 128, 2 (2020), 336--359. https://doi.org/10.1007/s11263-019-01228-7
[22]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jé gou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10347--10357. http://proceedings.mlr.press/v139/touvron21a.html
[23]
Scott E. Umbaugh, Jeffrey Snyder, and Elena A. Fedorovskaya. 2011. Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools, Second Edition. J. Electronic Imaging, Vol. 20, 3 (2011), 039901. https://doi.org/10.1117/1.3628179
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[25]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).
[26]
Jun Wei, Qin Wang, Zhen Li, Sheng Wang, S. Kevin Zhou, and Shuguang Cui. 2021. Shallow Feature Matters for Weakly Supervised Object Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 5993--6001. https://doi.org/10.1109/CVPR46437.2021.00593
[27]
Pingyu Wu, Wei Zhai, and Yang Cao. 2022. Background Activation Suppression for Weakly Supervised Object Localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 14228--14237. https://doi.org/10.1109/CVPR52688.2022.01385
[28]
Jinheng Xie, Jianfeng Xiang, Junliang Chen, Xianxu Hou, Xiaodong Zhao, and Linlin Shen. 2022. C(2) AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 979--988. https://doi.org/10.1109/CVPR52688.2022.00106
[29]
Jingyuan Xu, Hongtao Xie, Chuanbin Liu, and Yongdong Zhang. 2022. Proxy Probing Decoder for Weakly Supervised Object Localization: A Baseline Investigation. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, Jo a o Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4185--4193. https://doi.org/10.1145/3503161.3547945
[30]
Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. 2019. DANet: Divergent Activation for Weakly Supervised Object Localization. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 6588--6597. https://doi.org/10.1109/ICCV.2019.00669
[31]
Ke Yang, Dongsheng Li, and Yong Dou. 2019. Towards Precise End-to-End Weakly Supervised Object Detection Network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 8371--8380. https://doi.org/10.1109/ICCV.2019.00846
[32]
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. 2019. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 6022--6031. https://doi.org/10.1109/ICCV.2019.00612
[33]
Chen-Lin Zhang, Yun-Hao Cao, and Jianxin Wu. 2020a. Rethinking the Route Towards Weakly Supervised Object Localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 13457--13466. https://doi.org/10.1109/CVPR42600.2020.01347
[34]
Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S. Huang. 2018a. Adversarial Complementary Learning for Weakly Supervised Object Localization. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 1325--1334. https://doi.org/10.1109/CVPR.2018.00144
[35]
Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S. Huang. 2018b. Adversarial Complementary Learning for Weakly Supervised Object Localization. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 1325--1334. https://doi.org/10.1109/CVPR.2018.00144
[36]
Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang, and Thomas S. Huang. 2018c. Self-produced Guidance for Weakly-Supervised Object Localization. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII (Lecture Notes in Computer Science, Vol. 11216). Springer, 610--625. https://doi.org/10.1007/978-3-030-01258-8_37
[37]
Xiaolin Zhang, Yunchao Wei, and Yi Yang. 2020b. Inter-Image Communication for Weakly Supervised Localization. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIX (Lecture Notes in Computer Science, Vol. 12364), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 271--287. https://doi.org/10.1007/978-3-030-58529-7_17
[38]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, and Li Zhang. 2021. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 6881--6890. https://doi.org/10.1109/CVPR46437.2021.00681
[39]
Bolei Zhou, Aditya Khosla, À gata Lapedriza, Aude Oliva, and Antonio Torralba. 2016a. Learning Deep Features for Discriminative Localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2921--2929. https://doi.org/10.1109/CVPR.2016.319
[40]
Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016b. Learning Deep Features for Discriminative Localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2921--2929. https://doi.org/10.1109/CVPR.2016.319
[41]
Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016c. Learning Deep Features for Discriminative Localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2921--2929. https://doi.org/10.1109/CVPR.2016.319
[42]
C. Lawrence Zitnick and Piotr Dollá r. 2014. Edge Boxes: Locating Object Proposals from Edges. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 391--405. https://doi.org/10.1007/978-3-319-10602-1_26

Cited By

View all
  • (2024)Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object LocalizationComputer Vision – ECCV 202410.1007/978-3-031-72890-7_24(387-403)Online publication date: 7-Dec-2024

Index Terms

  1. LocLoc: Low-level Cues and Local-area Guides for Weakly Supervised Object Localization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. low-level cues
    2. transformer
    3. weakly supervised object localization

    Qualifiers

    • Research-article

    Funding Sources

    • Key-Area Research and Development Program of Guangdong Province
    • The Major Program of Guangdong Basic and Applied Research
    • The Major Key Project of PCL

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object LocalizationComputer Vision – ECCV 202410.1007/978-3-031-72890-7_24(387-403)Online publication date: 7-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media