skip to main content
10.1145/3674225.3674394acmotherconferencesArticle/Chapter ViewAbstractPublication PagespeaiConference Proceedingsconference-collections
research-article

Semi-AVS: Segmenting the Sounding Objects via Semi-supervised Learning

Published: 31 July 2024 Publication History

Abstract

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the sounding objects from the visual frames. Existing method introduces the audio semantics and a regularization loss for guiding visual segmentation. However, the one-shot annotation and full-shot prediction fact in single-source dataset is discarded (i.e., only the ground truth of first sampled frame is given in a video). In this work, we propose a semi-supervised audio-visual segmentation framework called Semi-AVS, to propagate the mask of the first frame to the later frames since all the frames in a video share a same sound source. Furthermore, an audio-visual interaction module is designed to both locate object in the visual frame via audio and make the audio percept the visual context. Our method in single-source AVS task outperforms the state-of-the-art models by semi-supervised learning. Meanwhile audio-visual interaction module is also verified in fully supervised multi-source and semantic AVS tasks.

References

[1]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[2]
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong, “Audio-visual segmentation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII. Springer, 2022, pp.386–403.
[3]
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, “Audiovisual segmentation with semantics,” arXiv preprint arXiv:2301.13190, 2023.
[4]
Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang Gao, “St++: Make self-training work better for semisupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4268–4277.
[5]
Shumeng Li, Heng Cai, Lei Qi, Qian Yu, Yinghuan Shi, and Yang Gao, “Pln: Parasitic-like network for barely supervised medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 582–593, 2022.
[6]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, “CNN architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE,2017, pp. 131–135.
[7]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
[8]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao, “PVTv2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
[9]
Zhichao Wei, Xiaohao Chen, Mingqiang Chen, and Siyu Zhu, “Learning aligned cross-modal representations for referring image segmentation,” arXiv preprint arXiv:2301.06429, 2023.
[10]
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar, “Maskedattention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, 2022, pp. 1290–1299.
[11]
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman, “Localizing visual sounds the hard way,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16867–16876.
[12]
Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin, “Multiple sound sources localization from coarse to fine,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 2020, pp. 292–308.
[13]
Sabarinath Mahadevan, Ali Athar, Aljoˇsa Oˇsep, Sebastian Hennen, Laura Leal-Taix´e, and Bastian Leibe, “Making a case for 3d convolutions for object segmentation in videos,” arXiv preprint arXiv:2008.11516, 2020.
[14]
Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor, “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5912–5921.
[15]
Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv, Xinyu Tian, Deng-Ping Fan, and Nick Barnes, “Transformer transforms salient object detection and camouflaged object detection,” arXiv preprint arXiv:2104.10127, 2021.
[16]
Jing Zhang, Jianwen Xie, Nick Barnes, and Ping Li, “Learning generative vision transformer with energybased latent space for saliency prediction,” Advances in Neural Information Processing Systems, vol. 34, pp.15448–15463, 2021.
[17]
Shentong Mo and Yapeng Tian, “Av-sam: Segment anything model meets audio-visual localization and segmentation,” arXiv preprint arXiv:2305.01836, 2023.
[18]
Zongxin Yang, Yunchao Wei, and Yi Yang, “Associating objects with transformers for video object segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 2491–2502, 2021.

Index Terms

  1. Semi-AVS: Segmenting the Sounding Objects via Semi-supervised Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    PEAI '24: Proceedings of the 2024 International Conference on Power Electronics and Artificial Intelligence
    January 2024
    969 pages
    ISBN:9798400716638
    DOI:10.1145/3674225
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    PEAI 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 14
      Total Downloads
    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media