Abstract:
The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given the images of the sources and the audio mix...Show MoreMetadata
Abstract:
The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given the images of the sources and the audio mixture. The goal of this study is to investigate the ability to learn the mapping between the sounds and the images of instruments in the self-supervisied mix-and-seperate training paradigm used by state-of-the-art audio-visual sound source separation methods. Theoretical and empirical analyses illustrate that the self-supervised mix-and-separate training does not automatically learn the 1-to-1 correspondence between visual and audio signals, leading to low audio-visual object classification accuracy. Based on this analysis, a weakly-supervised method called Object-Prior is proposed and evaluated on two audio-visual datasets. The experimental results show that the Object-Prior method outperforms state-of-the-art baselines in the audio-visual sound source separation task. It is also more robust against asynchronized data, where the frame and the audio do not come from the same video, and recognizes musical instruments based on their sound with higher accuracy. This indicates that learning the 1-to-1 correspondence between visual and audio features of an instrument improves the effectiveness of audio-visual sound source separation.
Date of Conference: 10-15 January 2021
Date Added to IEEE Xplore: 05 May 2021
ISBN Information:
Print on Demand(PoD) ISSN: 1051-4651