ABSTRACT
The purpose of single-channel target speaker extraction is to simulate human selective auditory attention by extracting the voice of the target speaker from multiple speaker environments. For the above actual scenario, We advocate a temporal goal speaker recognition model. It transforms the blended speech into coefficients of embedding, the speech sign does now not want to be dessicated into an amplitude spectrum and a segment spectrum. Four network components are included in the network, namely the speaker encoder, the voice encoder, the speaker extractor, and the speech decoder. In particular, the sound encoder transforms the mixed sound into parameters of embedding, and the speaker encoder knows that to represent the target speaker, the embedding of the speaker is used. The speaker extractor takes the embedding factor and the target speaker's embedding as data, estimating the receiving mask. At last, according to the masked embedding parameters, the voice decoder rebuilds the aim speaker's sound. The experimental results show, Under open evaluation conditions, This method is 45.6% and 47.5% higher than the best pipeline in the aspect of signal distortion ratio (SDR) and scale-invariant signal distortion ratio (SI-SDR).
- S. Watanabe, M. Delcroix, F. Metze, and J. R Hershey, New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer, 2017.Google ScholarDigital Library
- Dehak N, Kenny P J, Dehak R, et al. Front-End Factor Analysis for Speaker Verification[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(4):788--798.Google ScholarDigital Library
- J. R Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation, " in Proc. of ICASSP. IEEE, 2016, pp. 31--35Google Scholar
- Chen Z, Luo Y, Mesgarani N. Deep attractor network for single-microphone speaker separation[J]. 2016.Google Scholar
- Kolbaek M, Yu D, Tan Z H, et al. Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2017:1--1.Google Scholar
- Luo Y, Mesgarani N. TasNet: time-domain audio separation network for realtime, single-channel speech separation[J]. 2017.Google Scholar
- J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, "Deep extractor network for target speaker recovery from single channel speech mixtures, " in Proc. of INTERSPEECH, 2018, pp. 307--311.Google ScholarCross Ref
- Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016: 165--170.Google Scholar
- Williamson D, Wang D L. Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017:1492--1501.Google Scholar
- Le Roux J, Wisdom S, Erdogan H, et al. SDR-half-baked or well done?[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 626--630.Google Scholar
- J. Garofolo, D Graff, D Paul, and D Pallett, "Csr-i(wsj0) complete ldc93s6a, " Philadelphia: Linguistic Data Consortium, 1993.Google Scholar
- Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- Delcroix M, Zmolikova K, Kinoshita K, et al. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5554--5558.Google Scholar
- Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation[J]. IEEE transactions on audio, speech, and language processing, 2006, 14(4): 1462--1469.Google ScholarDigital Library
- Huang Z, Wang S, Yu K. Angular Softmax for Short-Duration Text-independent Speaker Verification[C]//Interspeech. 2018: 3623--3627.Google Scholar
Index Terms
- Single Channel Target Speaker Extraction Based on Deep Learning
Recommendations
Single Channel Target Speaker Extraction and Recognition with Speaker Beam
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)This paper addresses the problem of single channel speech recognition of a target speaker in a mixture of speech signals. We propose to exploit auxiliary speaker information provided by an adaptation utterance from the target speaker to extract and ...
Unsupervised speech separation by detecting speaker changeover points under single channel condition
AbstractIn this paper, we propose a method to separate two speakers from a single channel speech mixture in an unsupervised way by detecting the speaker change over points. In this work, we have taken the combinations of male–female, male–male and female–...
Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an ...
Comments