Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

Lan, Chaofeng; Guo, Rui; Zhang, Lei; Wang, Shunbo; Zhang, Meng

doi:10.1007/s11760-025-03836-y

Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

Original Paper
Published: 04 February 2025

Volume 19, article number 269, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Chaofeng Lan¹^na1,
Rui Guo¹^na1,
Lei Zhang²,
Shunbo Wang¹ &
…
Meng Zhang³

131 Accesses
Explore all metrics

Abstract

With the rapid development of computer technology, acquiring audio-visual signals in a complex environment is not difficult. Combining the visual information to assist speech separation shows excellent potential. However, the problem of speech signal separation in multiple speakers containing facial information in audio-visual scenes has not been well solved. Due to the strong correlation between the speaker's lip information and the sound signal, this paper, based on atrous convolution Neural Network (DCNN) and U-Net, proposes a DCNN-U-Net speech separation model for audio-visual fusion. The model uses fused signals from lips and audio for training to better focus on the audio signal in the speaker, achieving the effect of aided speech separation. The experiments were tested based on the AVspeech dataset, and the speech separation effect was evaluated using PESQ, STOI, and SDR metrics. The experimental results show that the DCNN-U-Net model has better audio-visual speech separation than the AV and DCNN-LSTM models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

RAVSSNet: Recurrent Audio Visual Speech Separation

Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Article 06 April 2023

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Article 24 February 2022

Availability of data and materials

All the data included in this study are available upon request by contacting the corresponding author.

References

Agrawal, J., Gupta, M., Garg, H.: A review on speech separation in cocktail party environment: challenges and approaches. Multimedia Tools Appl. 82(20), 31035–31067 (2023)
Article MATH Google Scholar
Shi, J., Xu, J., Liu, G.: Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation. In IJCAI'18. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4353–4360 (2018).
Min, X., Zhai, G., Zhou, J.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)
Article MATH Google Scholar
Luo, Y.Y., Wang, J., Xu, L.: Multi-stream gated and pyramidal temporal convolutional neural networks for audio-visual speech separation in multi-talker environments. Interspeech 5, 1104–1108 (2021)
Google Scholar
Brousmiche, M., Rouat, J., Dupont, S.: Multimodal attentive fusion network for audio-visual event recognition. Inf. Fusion 85, 52–59 (2022)
Article Google Scholar
Yoon, Y., Wolfert, P., Kucherenko, T.: The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In: Proceedings of the 2022 International Conference on Multimodal Interaction. Bengaluru, India, pp. 736–747 (2022)
Wu, Y.L., Li, C., Bai, J.: Time-domain audio-visual speech separation on low quality videos. In: Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 256–260 (2022)
Gogate, M., Dashtipour, K., Bell, P.: Deep neural network driven binaural audio-visual speech separation. In: Proceedings of the 2020 international joint conference on neural networks (IJCNN). IEEE, 1–7 Apr (2022)
Chern I, Hung K H, Chen Y T.: Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings. Institute of Electrical and Electronics Engineers, (2023).
Rahimi A, Afouras T, Zisserman A.: Voicevector: multimodal enrolment vectors for speaker separation. 785–89 (2024).
Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)
Article MATH Google Scholar
Rahimi, A., Afouras, T., Zisserman, A.: Reading to listen at the cocktail party: Multi-modal speech separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10493–10502 (2022).
Lee S, Jung C, Jang Y, et al. Seeing through the conversation: Audio-visual speech separation based on diffusion model. In: ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 12632–12636 (2024).
Li G, Deng J, Geng M, et al. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
Li, Y., Zhang, X.: Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network. Neurocomputing 549, 126432 (2023)
Article MATH Google Scholar
Vanambathina, S.D., Nandyala, S., Jannu, C.: Speech enhancement using U-net-based progressive learning with squeeze-TCN. In: International Conference on Advances in Distributed Computing and Machine Learning. Springer Nature Singapore, Singapore, 419–432 (2024).
Jannu, C., Vanambathina, S.D.: Shuffle attention u-net for speech enhancement in time domain. Int. J. Image Graph. 24(4), 2450043 (2024)
Article MATH Google Scholar
Jannu, C., Vanambathina, S.D.: Multi-stage progressive learning-based speech enhancement using timefrequency attentive squeezed temporal convolutional networks. Circ. Syst. Signal Process. 42(12), 7467–7493 (2023)
Article MATH Google Scholar
Gabbay, A., Ephrat, A., Halperin, T.: Seeing Through Noise: Visually Driven Speaker Separation and Enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3051–3055 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech. Enhancement 34(2), 15–18 (2018)
Google Scholar
Hou, J.C., Wang, S.S., Lai, Y.H.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)
Article Google Scholar
Wang, Y., Wang, D.L.: Cocktail party processing via structured prediction. Adv. Neural. Inf. Process. Syst. 25(1), 224–232 (2012)
MATH Google Scholar
Hossain, M.I., Jahan, S., Al Asif, M.R.: Detecting tomato leaf diseases by image processing through deep convolutional neural networks. Smart Agric. Technol. 5, 100301 (2023)
Article Google Scholar
Tan, K., Xu, Y., Zhang, S.X.: Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J. Sel. Top. Signal Process. 14(3), 542–553 (2020)
Article MATH Google Scholar
Gu, R., Zhang, S.X., Xu, Y.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Signal Process. 14(3), 530–541 (2020)
Article MATH Google Scholar
Zhang L, Pei K, Li W, et al. A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech. SAE Technical Paper (2024).
Aldarmaki I, Solorio T, Raj B, et al. RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement. 2410.05019 (2024).
Liang, Xu., Jing, W., Wenjing, Y., et al.: Conv-TasNet based multi-feature fusion audio-video joint speech separation algorithm. J. Signal Process. 37(10), 1799–1805 (2021)
Google Scholar
Bpiwowar. ICLR 2016: ICLR 2AV016: International Conference on Learning Representations 2016 (2018).
Wang, J., Luo, Y., Yi, W., et al.: Speaker-independent audio-visual speech separation based on transformer in multi-talker environments. IEICE Trans. Inf. Syst. 105(4), 766–777 (2022)
Article MATH Google Scholar
Ephrat, A., Mosseri, I., Lang, O., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Gr. 37(4), 11.21-112.11 (2018)
Article Google Scholar
Lan, C.F., Wang, S.B., Guo, X.X.: Research on single-channel audio-visual fusion speech separation method based on DCNN and BiLSTM. Electron. J. 44(7), 1–8 (2022)
MATH Google Scholar
Saleem, N., Khattak, M.I., AlQahtani, S.A., Jan, A., Hussain, I., Khan, M.N., Dahshan, M.: U-shaped low-complexity type-2 fuzzy LSTM neural network for speech enhancement. IEEE Access 11, 20814–20826 (2023)
Article Google Scholar

Download references

Funding

This research was supported by the Key Project of the "Outstanding Young Teachers Basic Research Support Program" of Heilongjiang Province (No. YQJH2024064), the Natural Science Foundation of Heilongjiang Province (No. LH2020F033), the National Natural Science Youth Foundation of China (No. 11804068), and the Research Project of the Heilongjiang Province Health Commission (No. 20221111001069).

Author information

Chaofeng Lan and Shunbo Wang contributed equally to this work.

Authors and Affiliations

School of Measurement and Communication Engineering, Harbin University of Science and Technology, Harbin, 150080, China
Chaofeng Lan, Rui Guo & Shunbo Wang
Beidahuang Industry Group General Hospital, Harbin, 150088, China
Lei Zhang
School of Electronics and Communication Engineering, Guangzhou University, Guangzhou, 510006, China
Meng Zhang

Authors

Chaofeng Lan
View author publications
You can also search for this author inPubMed Google Scholar
Rui Guo
View author publications
You can also search for this author inPubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Shunbo Wang
View author publications
You can also search for this author inPubMed Google Scholar
Meng Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Chaofeng Lan contributed to the conception of the study and contributed significantly to analysis and manuscript preparation; Lei Zhang and Rui Guo made important contributions in making adjustments to the structure, revising the paper, english editing and revisions of this manuscript; Shunbo Wang performed the experiment、the data analyses and wrote the original manuscript; Meng Zhang made important contributions in making adjustments to the proofread English.

Corresponding authors

Correspondence to Lei Zhang or Meng Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

See Table

Table 3 DCNN Parameter settings for each network layer

Full size table

3.

Appendix B

The detailed data of U-Net up-sampling block and down-sampling block are shown in Table 3 and Table

Table 4 U-Net up-sampling block lower sampling block detailed data

Full size table

4, Table

Table 5 U-Net down-sampling block lower sampling block detailed data

Full size table

5.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lan, C., Guo, R., Zhang, L. et al. Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion. SIViP 19, 269 (2025). https://doi.org/10.1007/s11760-025-03836-y

Download citation

Received: 03 July 2023
Revised: 29 December 2024
Accepted: 09 January 2025
Published: 04 February 2025
DOI: https://doi.org/10.1007/s11760-025-03836-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RAVSSNet: Recurrent Audio Visual Speech Separation

Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now