Skip to main content
Log in

Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

With the rapid development of computer technology, acquiring audio-visual signals in a complex environment is not difficult. Combining the visual information to assist speech separation shows excellent potential. However, the problem of speech signal separation in multiple speakers containing facial information in audio-visual scenes has not been well solved. Due to the strong correlation between the speaker's lip information and the sound signal, this paper, based on atrous convolution Neural Network (DCNN) and U-Net, proposes a DCNN-U-Net speech separation model for audio-visual fusion. The model uses fused signals from lips and audio for training to better focus on the audio signal in the speaker, achieving the effect of aided speech separation. The experiments were tested based on the AVspeech dataset, and the speech separation effect was evaluated using PESQ, STOI, and SDR metrics. The experimental results show that the DCNN-U-Net model has better audio-visual speech separation than the AV and DCNN-LSTM models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

All the data included in this study are available upon request by contacting the corresponding author.

References

  1. Agrawal, J., Gupta, M., Garg, H.: A review on speech separation in cocktail party environment: challenges and approaches. Multimedia Tools Appl. 82(20), 31035–31067 (2023)

    Article  MATH  Google Scholar 

  2. Shi, J., Xu, J., Liu, G.: Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation. In IJCAI'18. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4353–4360 (2018).

  3. Min, X., Zhai, G., Zhou, J.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)

    Article  MATH  Google Scholar 

  4. Luo, Y.Y., Wang, J., Xu, L.: Multi-stream gated and pyramidal temporal convolutional neural networks for audio-visual speech separation in multi-talker environments. Interspeech 5, 1104–1108 (2021)

    Google Scholar 

  5. Brousmiche, M., Rouat, J., Dupont, S.: Multimodal attentive fusion network for audio-visual event recognition. Inf. Fusion 85, 52–59 (2022)

    Article  Google Scholar 

  6. Yoon, Y., Wolfert, P., Kucherenko, T.: The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In: Proceedings of the 2022 International Conference on Multimodal Interaction. Bengaluru, India, pp. 736–747 (2022)

  7. Wu, Y.L., Li, C., Bai, J.: Time-domain audio-visual speech separation on low quality videos. In: Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 256–260 (2022)

  8. Gogate, M., Dashtipour, K., Bell, P.: Deep neural network driven binaural audio-visual speech separation. In: Proceedings of the 2020 international joint conference on neural networks (IJCNN). IEEE, 1–7 Apr (2022)

  9. Chern I, Hung K H, Chen Y T.: Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings. Institute of Electrical and Electronics Engineers, (2023).

  10. Rahimi A, Afouras T, Zisserman A.: Voicevector: multimodal enrolment vectors for speaker separation. 785–89 (2024).

  11. Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)

    Article  MATH  Google Scholar 

  12. Rahimi, A., Afouras, T., Zisserman, A.: Reading to listen at the cocktail party: Multi-modal speech separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10493–10502 (2022).

  13. Lee S, Jung C, Jang Y, et al. Seeing through the conversation: Audio-visual speech separation based on diffusion model. In: ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 12632–12636 (2024).

  14. Li G, Deng J, Geng M, et al. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).

  15. Li, Y., Zhang, X.: Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network. Neurocomputing 549, 126432 (2023)

    Article  MATH  Google Scholar 

  16. Vanambathina, S.D., Nandyala, S., Jannu, C.: Speech enhancement using U-net-based progressive learning with squeeze-TCN. In: International Conference on Advances in Distributed Computing and Machine Learning. Springer Nature Singapore, Singapore, 419–432 (2024).

  17. Jannu, C., Vanambathina, S.D.: Shuffle attention u-net for speech enhancement in time domain. Int. J. Image Graph. 24(4), 2450043 (2024)

    Article  MATH  Google Scholar 

  18. Jannu, C., Vanambathina, S.D.: Multi-stage progressive learning-based speech enhancement using timefrequency attentive squeezed temporal convolutional networks. Circ. Syst. Signal Process. 42(12), 7467–7493 (2023)

    Article  MATH  Google Scholar 

  19. Gabbay, A., Ephrat, A., Halperin, T.: Seeing Through Noise: Visually Driven Speaker Separation and Enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3051–3055 (2018)

  20. Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech. Enhancement 34(2), 15–18 (2018)

    Google Scholar 

  21. Hou, J.C., Wang, S.S., Lai, Y.H.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)

    Article  Google Scholar 

  22. Wang, Y., Wang, D.L.: Cocktail party processing via structured prediction. Adv. Neural. Inf. Process. Syst. 25(1), 224–232 (2012)

    MATH  Google Scholar 

  23. Hossain, M.I., Jahan, S., Al Asif, M.R.: Detecting tomato leaf diseases by image processing through deep convolutional neural networks. Smart Agric. Technol. 5, 100301 (2023)

    Article  Google Scholar 

  24. Tan, K., Xu, Y., Zhang, S.X.: Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J. Sel. Top. Signal Process. 14(3), 542–553 (2020)

    Article  MATH  Google Scholar 

  25. Gu, R., Zhang, S.X., Xu, Y.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Signal Process. 14(3), 530–541 (2020)

    Article  MATH  Google Scholar 

  26. Zhang L, Pei K, Li W, et al. A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech. SAE Technical Paper (2024).

  27. Aldarmaki I, Solorio T, Raj B, et al. RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement. 2410.05019 (2024).

  28. Liang, Xu., Jing, W., Wenjing, Y., et al.: Conv-TasNet based multi-feature fusion audio-video joint speech separation algorithm. J. Signal Process. 37(10), 1799–1805 (2021)

    Google Scholar 

  29. Bpiwowar. ICLR 2016: ICLR 2AV016: International Conference on Learning Representations 2016 (2018).

  30. Wang, J., Luo, Y., Yi, W., et al.: Speaker-independent audio-visual speech separation based on transformer in multi-talker environments. IEICE Trans. Inf. Syst. 105(4), 766–777 (2022)

    Article  MATH  Google Scholar 

  31. Ephrat, A., Mosseri, I., Lang, O., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Gr. 37(4), 11.21-112.11 (2018)

    Article  Google Scholar 

  32. Lan, C.F., Wang, S.B., Guo, X.X.: Research on single-channel audio-visual fusion speech separation method based on DCNN and BiLSTM. Electron. J. 44(7), 1–8 (2022)

    MATH  Google Scholar 

  33. Saleem, N., Khattak, M.I., AlQahtani, S.A., Jan, A., Hussain, I., Khan, M.N., Dahshan, M.: U-shaped low-complexity type-2 fuzzy LSTM neural network for speech enhancement. IEEE Access 11, 20814–20826 (2023)

    Article  Google Scholar 

Download references

Funding

This research was supported by the Key Project of the "Outstanding Young Teachers Basic Research Support Program" of Heilongjiang Province (No. YQJH2024064), the Natural Science Foundation of Heilongjiang Province (No. LH2020F033), the National Natural Science Youth Foundation of China (No. 11804068), and the Research Project of the Heilongjiang Province Health Commission (No. 20221111001069).

Author information

Authors and Affiliations

Authors

Contributions

Chaofeng Lan contributed to the conception of the study and contributed significantly to analysis and manuscript preparation; Lei Zhang and Rui Guo made important contributions in making adjustments to the structure, revising the paper, english editing and revisions of this manuscript; Shunbo Wang performed the experiment、the data analyses and wrote the original manuscript; Meng Zhang made important contributions in making adjustments to the proofread English.

Corresponding authors

Correspondence to Lei Zhang or Meng Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

See Table 

Table 3 DCNN Parameter settings for each network layer

3.

Appendix B

The detailed data of U-Net up-sampling block and down-sampling block are shown in Table 3 and Table 

Table 4 U-Net up-sampling block lower sampling block detailed data

4, Table

Table 5 U-Net down-sampling block lower sampling block detailed data

5.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lan, C., Guo, R., Zhang, L. et al. Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion. SIViP 19, 269 (2025). https://doi.org/10.1007/s11760-025-03836-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-025-03836-y

Keywords