Abstract
With the rapid development of computer technology, acquiring audio-visual signals in a complex environment is not difficult. Combining the visual information to assist speech separation shows excellent potential. However, the problem of speech signal separation in multiple speakers containing facial information in audio-visual scenes has not been well solved. Due to the strong correlation between the speaker's lip information and the sound signal, this paper, based on atrous convolution Neural Network (DCNN) and U-Net, proposes a DCNN-U-Net speech separation model for audio-visual fusion. The model uses fused signals from lips and audio for training to better focus on the audio signal in the speaker, achieving the effect of aided speech separation. The experiments were tested based on the AVspeech dataset, and the speech separation effect was evaluated using PESQ, STOI, and SDR metrics. The experimental results show that the DCNN-U-Net model has better audio-visual speech separation than the AV and DCNN-LSTM models.







Similar content being viewed by others
Availability of data and materials
All the data included in this study are available upon request by contacting the corresponding author.
References
Agrawal, J., Gupta, M., Garg, H.: A review on speech separation in cocktail party environment: challenges and approaches. Multimedia Tools Appl. 82(20), 31035–31067 (2023)
Shi, J., Xu, J., Liu, G.: Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation. In IJCAI'18. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4353–4360 (2018).
Min, X., Zhai, G., Zhou, J.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)
Luo, Y.Y., Wang, J., Xu, L.: Multi-stream gated and pyramidal temporal convolutional neural networks for audio-visual speech separation in multi-talker environments. Interspeech 5, 1104–1108 (2021)
Brousmiche, M., Rouat, J., Dupont, S.: Multimodal attentive fusion network for audio-visual event recognition. Inf. Fusion 85, 52–59 (2022)
Yoon, Y., Wolfert, P., Kucherenko, T.: The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In: Proceedings of the 2022 International Conference on Multimodal Interaction. Bengaluru, India, pp. 736–747 (2022)
Wu, Y.L., Li, C., Bai, J.: Time-domain audio-visual speech separation on low quality videos. In: Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 256–260 (2022)
Gogate, M., Dashtipour, K., Bell, P.: Deep neural network driven binaural audio-visual speech separation. In: Proceedings of the 2020 international joint conference on neural networks (IJCNN). IEEE, 1–7 Apr (2022)
Chern I, Hung K H, Chen Y T.: Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings. Institute of Electrical and Electronics Engineers, (2023).
Rahimi A, Afouras T, Zisserman A.: Voicevector: multimodal enrolment vectors for speaker separation. 785–89 (2024).
Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)
Rahimi, A., Afouras, T., Zisserman, A.: Reading to listen at the cocktail party: Multi-modal speech separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10493–10502 (2022).
Lee S, Jung C, Jang Y, et al. Seeing through the conversation: Audio-visual speech separation based on diffusion model. In: ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 12632–12636 (2024).
Li G, Deng J, Geng M, et al. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
Li, Y., Zhang, X.: Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network. Neurocomputing 549, 126432 (2023)
Vanambathina, S.D., Nandyala, S., Jannu, C.: Speech enhancement using U-net-based progressive learning with squeeze-TCN. In: International Conference on Advances in Distributed Computing and Machine Learning. Springer Nature Singapore, Singapore, 419–432 (2024).
Jannu, C., Vanambathina, S.D.: Shuffle attention u-net for speech enhancement in time domain. Int. J. Image Graph. 24(4), 2450043 (2024)
Jannu, C., Vanambathina, S.D.: Multi-stage progressive learning-based speech enhancement using timefrequency attentive squeezed temporal convolutional networks. Circ. Syst. Signal Process. 42(12), 7467–7493 (2023)
Gabbay, A., Ephrat, A., Halperin, T.: Seeing Through Noise: Visually Driven Speaker Separation and Enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3051–3055 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech. Enhancement 34(2), 15–18 (2018)
Hou, J.C., Wang, S.S., Lai, Y.H.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)
Wang, Y., Wang, D.L.: Cocktail party processing via structured prediction. Adv. Neural. Inf. Process. Syst. 25(1), 224–232 (2012)
Hossain, M.I., Jahan, S., Al Asif, M.R.: Detecting tomato leaf diseases by image processing through deep convolutional neural networks. Smart Agric. Technol. 5, 100301 (2023)
Tan, K., Xu, Y., Zhang, S.X.: Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J. Sel. Top. Signal Process. 14(3), 542–553 (2020)
Gu, R., Zhang, S.X., Xu, Y.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Signal Process. 14(3), 530–541 (2020)
Zhang L, Pei K, Li W, et al. A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech. SAE Technical Paper (2024).
Aldarmaki I, Solorio T, Raj B, et al. RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement. 2410.05019 (2024).
Liang, Xu., Jing, W., Wenjing, Y., et al.: Conv-TasNet based multi-feature fusion audio-video joint speech separation algorithm. J. Signal Process. 37(10), 1799–1805 (2021)
Bpiwowar. ICLR 2016: ICLR 2AV016: International Conference on Learning Representations 2016 (2018).
Wang, J., Luo, Y., Yi, W., et al.: Speaker-independent audio-visual speech separation based on transformer in multi-talker environments. IEICE Trans. Inf. Syst. 105(4), 766–777 (2022)
Ephrat, A., Mosseri, I., Lang, O., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Gr. 37(4), 11.21-112.11 (2018)
Lan, C.F., Wang, S.B., Guo, X.X.: Research on single-channel audio-visual fusion speech separation method based on DCNN and BiLSTM. Electron. J. 44(7), 1–8 (2022)
Saleem, N., Khattak, M.I., AlQahtani, S.A., Jan, A., Hussain, I., Khan, M.N., Dahshan, M.: U-shaped low-complexity type-2 fuzzy LSTM neural network for speech enhancement. IEEE Access 11, 20814–20826 (2023)
Funding
This research was supported by the Key Project of the "Outstanding Young Teachers Basic Research Support Program" of Heilongjiang Province (No. YQJH2024064), the Natural Science Foundation of Heilongjiang Province (No. LH2020F033), the National Natural Science Youth Foundation of China (No. 11804068), and the Research Project of the Heilongjiang Province Health Commission (No. 20221111001069).
Author information
Authors and Affiliations
Contributions
Chaofeng Lan contributed to the conception of the study and contributed significantly to analysis and manuscript preparation; Lei Zhang and Rui Guo made important contributions in making adjustments to the structure, revising the paper, english editing and revisions of this manuscript; Shunbo Wang performed the experiment、the data analyses and wrote the original manuscript; Meng Zhang made important contributions in making adjustments to the proofread English.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
See Table
3.
Appendix B
The detailed data of U-Net up-sampling block and down-sampling block are shown in Table 3 and Table
4, Table
5.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lan, C., Guo, R., Zhang, L. et al. Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion. SIViP 19, 269 (2025). https://doi.org/10.1007/s11760-025-03836-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-025-03836-y