Speaker extraction network with attention mechanism for speech dialogue system

Hao, Yun; Wu, Jiaju; Huang, Xiangkang; Zhang, Zijia; Liu, Fei; Wu, Qingyao

doi:10.1007/s11761-022-00340-w

Speaker extraction network with attention mechanism for speech dialogue system

Special Issue Paper
Published: 13 June 2022

Volume 16, pages 111–119, (2022)
Cite this article

Service Oriented Computing and Applications Aims and scope Submit manuscript

Yun Hao¹^na1,
Jiaju Wu¹^na1,
Xiangkang Huang¹,
Zijia Zhang¹,
Fei Liu¹ &
…
Qingyao Wu^1,2

334 Accesses
1 Citation
Explore all metrics

Abstract

Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Article 08 April 2024

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Article Open access 05 July 2021

Time-domain adaptive attention network for single-channel speech separation

Article Open access 11 May 2023

References

Lee K-F, Hon H-W, Reddy R (1990) An overview of the sphinx speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45
Article Google Scholar
Lin Y-C, Chiang T-H, Wang H-M, Peng C-M, Chang C-H (1998) The design of a multi-domain mandarin Chinese spoken dialogue system. In: Fifth international conference on spoken language processing
Zibert J, Martincic-Ipsic S, Hajdinjak M, Ipsic I, Mihelic F (2003) Development of a bilingual spoken dialog system for weather information retrieval. In: Eighth European conference on speech communication and technology
Huang C, Xu P, Zhang X, Zhao S, Huang T, Xu B (1999) Lodestar: a mandarin spoken dialogue system for travel information retrieval. In: Sixth European conference on speech communication and technology. Citeseer
Liu J, Xu Y, Seneff S, Zue V (2008) Citybrowser II: a multimodal restaurant guide in mandarin. In: International symposium on Chinese spoken language processing
Loizou PC, Kim G (2010) Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions. IEEE Trans Audio Speech Lang Process 19(1):47–56
Article Google Scholar
Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27(2):113–120
Article Google Scholar
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604
Article Google Scholar
Liang S, Liu W, Jiang W (2012) A new Bayesian method incorporating with local correlation for IBM estimation. IEEE Trans Audio Speech Lang Process 21(3):476–487
Article Google Scholar
Roweis ST (2000) One microphone source separation. In: NIPS, vol 13
Ozerov A, Vincent E, Bimbot F (2011) A general flexible framework for the handling of prior information in audio source separation. IEEE Trans Audio Speech Lang Process 20(4):1118–1133
Article Google Scholar
Mohammadiha N, Smaragdis P, Leijon A (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process 21(10):2140–2151
Article Google Scholar
Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074
Article Google Scholar
Wang D, Brown G (2008) Computational auditory scene analysis: principles, algorithms and applications. IEEE Trans Neural Netw 19(1):199–199
Article Google Scholar
Jia X, Li D (2022) TFCN: temporal-frequential convolutional network for single-channel speech enhancement. arXiv:2201.00480
Hao Y, Huang X, Huang H, Wu Q (2021) Denoi-spex+: a speaker extraction network based speech dialogue system. In: The IEEE international conference on e-business engineering (ICEBE)
Paliwal K, Wójcicki K, Schwerin B (2010) Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun 52(5):450–475
Article Google Scholar
Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Icml
Abdel-Hamid O, Mohamed A-r, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4277–4280
Abdel-Hamid O, Deng L, Yu D (2013) Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Interspeech, vol 11. Citeseer, pp 73–75
Smahi MI, Hadjila F, Tibermacine C, Benamar A (2021) A deep learning approach for collaborative prediction of web service QoS. SOCA 15(1):5–20
Article Google Scholar
Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio Speech Lang Process 26(10):1702-1726
Article Google Scholar
Fan C, Liu B, Tao J, Wen Z, Yi J, Bai Y (2018) Utterance-level permutation invariant training with discriminative learning for single channel speech separation. In: 2018 11th international symposium on chinese spoken language processing (ISCSLP). IEEE, pp 26–30
Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245
Kolbæk M, Yu D, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913
Article Google Scholar
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
Williamson DS, Wang Y, Wang DL (2016) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process 24(3):483
Article Google Scholar
Lee Y-S, Wang C-Y, Wang S-F, Wang J-C, Wu C-H (2017) Fully complex deep neural network for phase-incorporating monaural source separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 281–285
Pascual S, Bonafonte A, Serra J (2017) Segan: speech enhancement generative adversarial network. arXiv:1703.09452
Chen J, Wang D (2017) Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 141(6):4705-4714
Article MathSciNet Google Scholar
Luo Y, Mesgarani N (2019) CONV-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266
Article Google Scholar
Xu C, Rao W, Chng ES, Li H (2019) Time-domain speaker extraction network. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, pp 327–334
Lea C, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks: a unified approach to action segmentation. In: European conference on computer vision. Springer, pp 47–54
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xu C, Rao W, Chng ES, Li H (2020) SpEx: Multi-scale time domain speaker extraction network. IEEE/ACM Trans Audio Speech Lang Process 28:1370–1384
Article Google Scholar
Delcroix M, Ochiai T, Zmolikova K, Kinoshita K, Tawara N, Nakatani T, Araki S (2020) Improving speaker discrimination of target speech extraction with time-domain speakerbeam. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 691–695
Bu H, Du J, Na X, Wu B, Zheng H (2017) Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, pp 1–5

Download references

Funding

This work was supported by National Natural Science Foundation of China (61876208 and 61873094).

Author information

Yun Hao and Jiaju Wu have contributed equally to this work.

Authors and Affiliations

School of Software Engineering, South China University of Technology, Guangzhou, China
Yun Hao, Jiaju Wu, Xiangkang Huang, Zijia Zhang, Fei Liu & Qingyao Wu
Pazhou Lab, Guangzhou, China
Qingyao Wu

Authors

Yun Hao
View author publications
You can also search for this author inPubMed Google Scholar
Jiaju Wu
View author publications
You can also search for this author inPubMed Google Scholar
Xiangkang Huang
View author publications
You can also search for this author inPubMed Google Scholar
Zijia Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Fei Liu
View author publications
You can also search for this author inPubMed Google Scholar
Qingyao Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Fei Liu or Qingyao Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hao, Y., Wu, J., Huang, X. et al. Speaker extraction network with attention mechanism for speech dialogue system. SOCA 16, 111–119 (2022). https://doi.org/10.1007/s11761-022-00340-w

Download citation

Received: 27 March 2022
Revised: 09 May 2022
Accepted: 13 May 2022
Published: 13 June 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11761-022-00340-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker extraction network with attention mechanism for speech dialogue system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Time-domain adaptive attention network for single-channel speech separation

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now