skip to main content
10.1145/3452940.3453037acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciteeConference Proceedingsconference-collections
research-article

Single Channel Target Speaker Extraction Based on Deep Learning

Published: 17 May 2021 Publication History

Abstract

The purpose of single-channel target speaker extraction is to simulate human selective auditory attention by extracting the voice of the target speaker from multiple speaker environments. For the above actual scenario, We advocate a temporal goal speaker recognition model. It transforms the blended speech into coefficients of embedding, the speech sign does now not want to be dessicated into an amplitude spectrum and a segment spectrum. Four network components are included in the network, namely the speaker encoder, the voice encoder, the speaker extractor, and the speech decoder. In particular, the sound encoder transforms the mixed sound into parameters of embedding, and the speaker encoder knows that to represent the target speaker, the embedding of the speaker is used. The speaker extractor takes the embedding factor and the target speaker's embedding as data, estimating the receiving mask. At last, according to the masked embedding parameters, the voice decoder rebuilds the aim speaker's sound. The experimental results show, Under open evaluation conditions, This method is 45.6% and 47.5% higher than the best pipeline in the aspect of signal distortion ratio (SDR) and scale-invariant signal distortion ratio (SI-SDR).

References

[1]
S. Watanabe, M. Delcroix, F. Metze, and J. R Hershey, New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer, 2017.
[2]
Dehak N, Kenny P J, Dehak R, et al. Front-End Factor Analysis for Speaker Verification[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(4):788--798.
[3]
J. R Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation, " in Proc. of ICASSP. IEEE, 2016, pp. 31--35
[4]
Chen Z, Luo Y, Mesgarani N. Deep attractor network for single-microphone speaker separation[J]. 2016.
[5]
Kolbaek M, Yu D, Tan Z H, et al. Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2017:1--1.
[6]
Luo Y, Mesgarani N. TasNet: time-domain audio separation network for realtime, single-channel speech separation[J]. 2017.
[7]
J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, "Deep extractor network for target speaker recovery from single channel speech mixtures, " in Proc. of INTERSPEECH, 2018, pp. 307--311.
[8]
Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016: 165--170.
[9]
Williamson D, Wang D L. Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017:1492--1501.
[10]
Le Roux J, Wisdom S, Erdogan H, et al. SDR-half-baked or well done?[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 626--630.
[11]
J. Garofolo, D Graff, D Paul, and D Pallett, "Csr-i(wsj0) complete ldc93s6a, " Philadelphia: Linguistic Data Consortium, 1993.
[12]
Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.
[13]
Delcroix M, Zmolikova K, Kinoshita K, et al. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5554--5558.
[14]
Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation[J]. IEEE transactions on audio, speech, and language processing, 2006, 14(4): 1462--1469.
[15]
Huang Z, Wang S, Yu K. Angular Softmax for Short-Duration Text-independent Speaker Verification[C]//Interspeech. 2018: 3623--3627.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICITEE '20: Proceedings of the 3rd International Conference on Information Technologies and Electrical Engineering
December 2020
687 pages
ISBN:9781450388665
DOI:10.1145/3452940
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Depth separable convolution
  2. Single channel
  3. Target speaker extraction
  4. Time Domain

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICITEE2020

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 31
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media