Skip to main content
Log in

Speaker extraction network with attention mechanism for speech dialogue system

  • Special Issue Paper
  • Published:
Service Oriented Computing and Applications Aims and scope Submit manuscript

Abstract

Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Lee K-F, Hon H-W, Reddy R (1990) An overview of the sphinx speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45

    Article  Google Scholar 

  2. Lin Y-C, Chiang T-H, Wang H-M, Peng C-M, Chang C-H (1998) The design of a multi-domain mandarin Chinese spoken dialogue system. In: Fifth international conference on spoken language processing

  3. Zibert J, Martincic-Ipsic S, Hajdinjak M, Ipsic I, Mihelic F (2003) Development of a bilingual spoken dialog system for weather information retrieval. In: Eighth European conference on speech communication and technology

  4. Huang C, Xu P, Zhang X, Zhao S, Huang T, Xu B (1999) Lodestar: a mandarin spoken dialogue system for travel information retrieval. In: Sixth European conference on speech communication and technology. Citeseer

  5. Liu J, Xu Y, Seneff S, Zue V (2008) Citybrowser II: a multimodal restaurant guide in mandarin. In: International symposium on Chinese spoken language processing

  6. Loizou PC, Kim G (2010) Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions. IEEE Trans Audio Speech Lang Process 19(1):47–56

    Article  Google Scholar 

  7. Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27(2):113–120

    Article  Google Scholar 

  8. Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604

    Article  Google Scholar 

  9. Liang S, Liu W, Jiang W (2012) A new Bayesian method incorporating with local correlation for IBM estimation. IEEE Trans Audio Speech Lang Process 21(3):476–487

    Article  Google Scholar 

  10. Roweis ST (2000) One microphone source separation. In: NIPS, vol 13

  11. Ozerov A, Vincent E, Bimbot F (2011) A general flexible framework for the handling of prior information in audio source separation. IEEE Trans Audio Speech Lang Process 20(4):1118–1133

    Article  Google Scholar 

  12. Mohammadiha N, Smaragdis P, Leijon A (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process 21(10):2140–2151

    Article  Google Scholar 

  13. Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074

    Article  Google Scholar 

  14. Wang D, Brown G (2008) Computational auditory scene analysis: principles, algorithms and applications. IEEE Trans Neural Netw 19(1):199–199

    Article  Google Scholar 

  15. Jia X, Li D (2022) TFCN: temporal-frequential convolutional network for single-channel speech enhancement. arXiv:2201.00480

  16. Hao Y, Huang X, Huang H, Wu Q (2021) Denoi-spex+: a speaker extraction network based speech dialogue system. In: The IEEE international conference on e-business engineering (ICEBE)

  17. Paliwal K, Wójcicki K, Schwerin B (2010) Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun 52(5):450–475

    Article  Google Scholar 

  18. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  Google Scholar 

  19. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Icml

  20. Abdel-Hamid O, Mohamed A-r, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4277–4280

  21. Abdel-Hamid O, Deng L, Yu D (2013) Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Interspeech, vol 11. Citeseer, pp 73–75

  22. Smahi MI, Hadjila F, Tibermacine C, Benamar A (2021) A deep learning approach for collaborative prediction of web service QoS. SOCA 15(1):5–20

    Article  Google Scholar 

  23. Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio Speech Lang Process 26(10):1702-1726

    Article  Google Scholar 

  24. Fan C, Liu B, Tao J, Wen Z, Yi J, Bai Y (2018) Utterance-level permutation invariant training with discriminative learning for single channel speech separation. In: 2018 11th international symposium on chinese spoken language processing (ISCSLP). IEEE, pp 26–30

  25. Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245

  26. Kolbæk M, Yu D, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913

    Article  Google Scholar 

  27. Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35

  28. Williamson DS, Wang Y, Wang DL (2016) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process 24(3):483

    Article  Google Scholar 

  29. Lee Y-S, Wang C-Y, Wang S-F, Wang J-C, Wu C-H (2017) Fully complex deep neural network for phase-incorporating monaural source separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 281–285

  30. Pascual S, Bonafonte A, Serra J (2017) Segan: speech enhancement generative adversarial network. arXiv:1703.09452

  31. Chen J, Wang D (2017) Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 141(6):4705-4714

    Article  MathSciNet  Google Scholar 

  32. Luo Y, Mesgarani N (2019) CONV-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266

    Article  Google Scholar 

  33. Xu C, Rao W, Chng ES, Li H (2019) Time-domain speaker extraction network. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, pp 327–334

  34. Lea C, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks: a unified approach to action segmentation. In: European conference on computer vision. Springer, pp 47–54

  35. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  36. Xu C, Rao W, Chng ES, Li H (2020) SpEx: Multi-scale time domain speaker extraction network. IEEE/ACM Trans Audio Speech Lang Process 28:1370–1384

    Article  Google Scholar 

  37. Delcroix M, Ochiai T, Zmolikova K, Kinoshita K, Tawara N, Nakatani T, Araki S (2020) Improving speaker discrimination of target speech extraction with time-domain speakerbeam. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 691–695

  38. Bu H, Du J, Na X, Wu B, Zheng H (2017) Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, pp 1–5

Download references

Funding

This work was supported by National Natural Science Foundation of China (61876208 and 61873094).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fei Liu or Qingyao Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, Y., Wu, J., Huang, X. et al. Speaker extraction network with attention mechanism for speech dialogue system. SOCA 16, 111–119 (2022). https://doi.org/10.1007/s11761-022-00340-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11761-022-00340-w

Keywords

Navigation