Audio–Visual Segmentation

Zhou, Jinxing; Wang, Jianyuan; Zhang, Jiayi; Sun, Weixuan; Zhang, Jing; Birchfield, Stan; Guo, Dan; Kong, Lingpeng; Wang, Meng; Zhong, Yiran

doi:10.1007/978-3-031-19836-6_22

Jinxing Zhou^12,13,
Jianyuan Wang^13,14,
Jiayi Zhang^13,15,
Weixuan Sun^13,14,
Jing Zhang¹⁴,
Stan Birchfield¹⁶,
Dan Guo¹²,
Lingpeng Kong^17,18,
Meng Wang¹² &
…
Yiran Zhong^13,18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

European Conference on Computer Vision

3618 Accesses

Abstract

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a new method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

J. Zhou and J. Wang—Equal contribution. This work is done when Jinxing Zhou is an intern at SenseTime Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Audio-Visual Segmentation with Semantics

Article 15 October 2024

Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Multi-frequency Fine-Grained Matching for Audio-Visual Segmentation

Notes

1.
F-score considers both the precision and recall: $ F_\beta = \frac{(1+\beta ^2) \times \textsf{precision} \times \textsf{recall}}{\beta ^2 \times \textsf{precision} + \textsf{recall}} $, where $\beta ^2$ is set to 0.3 in our experiments.

References

Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 208–224. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_13
Chapter Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 609–617 (2017)
Google Scholar
Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27
Chapter Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems 29 (2016)
Google Scholar
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16867–16876 (2021)
Google Scholar
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: VGGSound: a large-scale audio-visual dataset. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725 (2020)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Cheng, Y., Wang, R., Pan, Z., Feng, R., Zhang, Y.: Look, listen, and attend: co-attention network for self-supervised audio-visual representation learning. In: Proceedings of the 28th ACM International Conference on Multimedia (ACM), pp. 3884–3892 (2020)
Google Scholar
Duan, B., Tang, H., Wang, W., Zong, Z., Yang, G., Yan, Y.: Audio-visual event localization via recursive fusion by joint co-attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4013–4022 (2021)
Google Scholar
Duke, B., Ahmed, A., Wolf, C., Aarabi, P., Taylor, G.W.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5912–5921 (2021)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Chapter Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3879–3888 (2019)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135 (2017)
Google Scholar
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9248–9257 (2019)
Google Scholar
Hu, D., et al.: Discriminative sounding objects localization via self-supervised audiovisual matching. In: Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 10077–10087 (2020)
Google Scholar
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6399–6408 (2019)
Google Scholar
Lin, Y.B., Li, Y.J., Wang, Y.C.F.: Dual-modality seq2seq network for audio-visual event localization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2002–2006. IEEE (2019)
Google Scholar
Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: Proceedings of the Asian Conference on Computer Vision (ACCV) (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038 (2014). http://arxiv.org/abs/1411.4038
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Mahadevan, S., Athar, A., Ošep, A., Hennen, S., Leal-Taixé, L., Leibe, B.: Making a case for 3D convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516 (2020)
Mao, Y., et al.: Transformer transforms salient object detection and camouflaged object detection. arXiv preprint arXiv:2104.10127 (2021)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Multiple sound sources localization from coarse to fine. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 292–308. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_18
Chapter Google Scholar
Ramaswamy, J.: What makes the sound?: a dual-modality interacting network for audio-visual event localization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4372–4376. IEEE (2020)
Google Scholar
Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2970–2979 (2020)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., Torralba, A.: Self-supervised audio-visual co-segmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. IEEE (2019)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4358–4366 (2018)
Google Scholar
Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_26
Chapter Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Wang, W., et al.: PVTv2: improved baselines with pyramid vision transformer. Comput. Visual Media 8(3), 1–10 (2022)
Article Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)
Google Scholar
Wu, Y., Yang, Y.: Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1326–1335 (2021)
Google Scholar
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6292–6300 (2019)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203 (2021)
Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: Proceedings of the 28th ACM International Conference on Multimedia (ACM), pp. 3893–3901 (2020)
Google Scholar
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 279–286 (2020)
Google Scholar
Yu, J., Cheng, Y., Zhao, R.W., Feng, R., Zhang, Y.: MM-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing. arXiv preprint arXiv:2111.12374 (2021)
Zhang, J., Xie, J., Barnes, N., Li, P.: Learning generative vision transformer with energy-based latent space for saliency prediction. Advances in Neural Information Processing Systems (NeurIPS) 34 (2021)
Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1735–1744 (2019)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Chapter Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016)
Google Scholar
Zhou, J., Zheng, L., Zhong, Y., Hao, S., Wang, M.: Positive sample propagation along the audio-visual event line. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8436–8444 (2021)
Google Scholar

Download references

Acknowledgement

The research of Jinxing Zhou, Dan Guo, and Meng Wang was supported by the National Key Research and Development Program of China (2018YFB0804205), and the National Natural Science Foundation of China (72188101, 61725203). Thanks to the SenseTime Research for providing access to the GPUs used for conducting experiments. The article solely reflects the opinions and conclusions of its authors but not the funding agents.

Author information

Authors and Affiliations

Hefei University of Technology, Hefei, China
Jinxing Zhou, Dan Guo & Meng Wang
SenseTime Research, Hangzhou, China
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun & Yiran Zhong
Australian National University, Canberra, Australia
Jianyuan Wang, Weixuan Sun & Jing Zhang
Beihang University, Beijing, China
Jiayi Zhang
NVIDIA, Santa Clara, USA
Stan Birchfield
The University of Hong Kong, Pok Fu Lam, Hong Kong
Lingpeng Kong
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Lingpeng Kong & Yiran Zhong

Authors

Jinxing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiayi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weixuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stan Birchfield
View author publications
You can also search for this author in PubMed Google Scholar
Dan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lingpeng Kong
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiran Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Meng Wang or Yiran Zhong .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3301 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, J. et al. (2022). Audio–Visual Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-19836-6_22
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Audio–Visual Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Audio-Visual Segmentation with Semantics

Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Multi-frequency Fine-Grained Matching for Audio-Visual Segmentation

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3301 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Audio–Visual Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Audio-Visual Segmentation with Semantics

Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Multi-frequency Fine-Grained Matching for Audio-Visual Segmentation

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3301 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation