Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Wang, Yaoting; Sun, Peiwen; Li, Yuanchao; Zhang, Honggang; Hu, Di

doi:10.1007/978-3-031-72904-1_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

301 Accesses

Abstract

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference.

Y. Wang and P. Sun—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Audio-Visual Segmentation with Semantics

Article 15 October 2024

Audio–Visual Segmentation

Multi-frequency Fine-Grained Matching for Audio-Visual Segmentation

Notes

1.
To be clear, we define audible objects as objects capable of producing sound, while sounding objects are defined as objects that are currently producing sound.

References

Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
Bao, F., et al.: One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555 (2023)
Chen, S., et al.: BEATs: audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058 (2022)
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2Former for video instance segmentation (2021)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Espejel, J.L., Ettifouri, E.H., Alassan, M.S.Y., Chouham, E.M., Dahhane, W.: GPT-3.5, GPT-4, or bard? evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts. Nat. Lang. Process. J. 5, 100032 (2023)
Google Scholar
Gao, S., Chen, Z., Chen, G., Wang, W., Lu, T.: AVSegFormer: audio-visual segmentation with transformer (2023)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
Google Scholar
Hao, D., Mao, Y., He, B., Han, X., Dai, Y., Zhong, Y.: Improving audio-visual segmentation with bidirectional generation. arXiv preprint arXiv:2308.08288 (2023)
Van der Heiden, R.M., Janssen, C.P., Donker, S.F., Hardeman, L.E., Mans, K., Kenemans, J.L.: Susceptibility to audio signals during autonomous driving. PLoS ONE 13(8), e0201963 (2018)
Article Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Hu, D., Wei, Y., Qian, R., Lin, W., Song, R., Wen, J.R.: Class-aware sounding objects localization via audiovisual correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9844–9859 (2021)
Article Google Scholar
Huang, S., et al.: Discovering sounding objects by audio queries for audio visual segmentation (2023)
Google Scholar
Hur, C., Park, H.: Zero-shot image classification with rectified embedding vectors using a caption generator. Appl. Sci. 13(12), 7071 (2023)
Article MATH Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Li, K., Yang, Z., Chen, L., Yang, Y., Xun, J.: CATR: combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709 (2023)
Ling, Y., Li, Y., Gan, Z., Zhang, J., Chi, M., Wang, Y.: Hear to segment: unmixing the audio to guide the semantic segmentation (2023)
Google Scholar
Liu, C., et al.: Audio-visual segmentation by exploring cross-modal mutual semantics (2023)
Google Scholar
Liu, C., et al.: BAVS: bootstrapping audio-visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
Google Scholar
Liu, H., et al.: LLaVA-next: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Liu, J., Ju, C., Ma, C., Wang, Y., Wang, Y., Zhang, Y.: Audio-aware query-enhanced transformer for audio-visual segmentation (2023)
Google Scholar
Liu, J., Wang, Y., Ju, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. arXiv preprint arXiv:2305.11019 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Ma, J., Sun, P., Wang, Y., Hu, D.: Stepping stones: a progressive training strategy for audio-visual semantic segmentation. In: IEEE European Conference on Computer Vision (ECCV) (2024)
Google Scholar
Majumder, S., Al-Halah, Z., Grauman, K.: Move2Hear: active audio-visual source separation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 275–285 (2021)
Google Scholar
Mo, S., Tian, Y.: AV-SAM: segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836 (2023)
Park, S., Senocak, A., Chung, J.S.: MarginNCE: robust sound localization with a negative margin. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366 (2018)
Google Scholar
Sharma, H., Jalal, A.S.: Image captioning improved visual question answering. Multimed. Tools Appl. 81(24), 34775–34796 (2022)
Article Google Scholar
Shi, Z., Zhou, X., Qiu, X., Zhu, X.: Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807 (2020)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models (2023)
Google Scholar
Wang, Y., Liu, W., Li, G., Ding, J., Hu, D., Li, X.: Prompting segmentation with sound is generalizable audio-visual source localizer. arXiv preprint arXiv:2309.07929 (2023)
Wang, Y., Sun, P., Zhou, D., Li, G., Zhang, H., Hu, D.: Ref-AVS: refer and segment objects in audio-visual scenes. In: IEEE European Conference on Computer Vision (ECCV) (2024)
Google Scholar
Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: GPT4Vis: what can GPT-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023)
Yan, S., et al.: Referred by multi-modality: a unified temporal transformer for video object segmentation. arXiv preprint arXiv:2305.16318 (2023)
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4v (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)
Zhang, H., Li, X., Bing, L.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding (2023)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127(3), 302–321 (2019)
Article MATH Google Scholar
Zhou, J., et al.: Audio-visual segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13697, pp. 386–403. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_22
Chapter Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Zürn, J., Burgard, W.: Self-supervised moving vehicle detection from audio-visual cues (2022)
Google Scholar

Download references

Acknowledgements

This research was supported by National Natural Science Foundation of China (NO. 62106272), and Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yaoting Wang & Di Hu
Beijing University of Posts and Telecommunications, Beijing, China
Peiwen Sun & Honggang Zhang
University of Edinburgh, Edinburgh, Scotland, UK
Yuanchao Li
Engineering Research Center of Next-Generation Search and Recommendation, Beijing, China
Di Hu

Authors

Yaoting Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peiwen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yuanchao Li
View author publications
You can also search for this author in PubMed Google Scholar
Honggang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Di Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Hu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3092 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Sun, P., Li, Y., Zhang, H., Hu, D. (2025). Can Textual Semantics Mitigate Sounding Object Segmentation Preference?. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_20
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics