research-article

Single Channel Target Speaker Extraction Based on Deep Learning

Authors:

Ting XiaoAuthors Info & Claims

ICITEE '20: Proceedings of the 3rd International Conference on Information Technologies and Electrical Engineering

Pages 503 - 507

https://doi.org/10.1145/3452940.3453037

Published: 17 May 2021 Publication History

Abstract

The purpose of single-channel target speaker extraction is to simulate human selective auditory attention by extracting the voice of the target speaker from multiple speaker environments. For the above actual scenario, We advocate a temporal goal speaker recognition model. It transforms the blended speech into coefficients of embedding, the speech sign does now not want to be dessicated into an amplitude spectrum and a segment spectrum. Four network components are included in the network, namely the speaker encoder, the voice encoder, the speaker extractor, and the speech decoder. In particular, the sound encoder transforms the mixed sound into parameters of embedding, and the speaker encoder knows that to represent the target speaker, the embedding of the speaker is used. The speaker extractor takes the embedding factor and the target speaker's embedding as data, estimating the receiving mask. At last, according to the masked embedding parameters, the voice decoder rebuilds the aim speaker's sound. The experimental results show, Under open evaluation conditions, This method is 45.6% and 47.5% higher than the best pipeline in the aspect of signal distortion ratio (SDR) and scale-invariant signal distortion ratio (SI-SDR).

References

[1]

S. Watanabe, M. Delcroix, F. Metze, and J. R Hershey, New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer, 2017.

Digital Library

[2]

Dehak N, Kenny P J, Dehak R, et al. Front-End Factor Analysis for Speaker Verification[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(4):788--798.

Digital Library

[3]

J. R Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation, " in Proc. of ICASSP. IEEE, 2016, pp. 31--35

[4]

Chen Z, Luo Y, Mesgarani N. Deep attractor network for single-microphone speaker separation[J]. 2016.

[5]

Kolbaek M, Yu D, Tan Z H, et al. Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2017:1--1.

[6]

Luo Y, Mesgarani N. TasNet: time-domain audio separation network for realtime, single-channel speech separation[J]. 2017.

[7]

J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, "Deep extractor network for target speaker recovery from single channel speech mixtures, " in Proc. of INTERSPEECH, 2018, pp. 307--311.

[8]

Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016: 165--170.

[9]

Williamson D, Wang D L. Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017:1492--1501.

[10]

Le Roux J, Wisdom S, Erdogan H, et al. SDR-half-baked or well done?[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 626--630.

[11]

J. Garofolo, D Graff, D Paul, and D Pallett, "Csr-i(wsj0) complete ldc93s6a, " Philadelphia: Linguistic Data Consortium, 1993.

[12]

Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.

[13]

Delcroix M, Zmolikova K, Kinoshita K, et al. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5554--5558.

[14]

Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation[J]. IEEE transactions on audio, speech, and language processing, 2006, 14(4): 1462--1469.

Digital Library

[15]

Huang Z, Wang S, Yu K. Angular Softmax for Short-Duration Text-independent Speaker Verification[C]//Interspeech. 2018: 3623--3627.

Index Terms

Single Channel Target Speaker Extraction Based on Deep Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Single Channel Target Speaker Extraction and Recognition with Speaker Beam
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper addresses the problem of single channel speech recognition of a target speaker in a mixture of speech signals. We propose to exploit auxiliary speaker information provided by an adaptation utterance from the target speaker to extract and ...
Unsupervised speech separation by detecting speaker changeover points under single channel condition
Abstract
In this paper, we propose a method to separate two speakers from a single channel speech mixture in an unsupervised way by detecting the speaker change over points. In this work, we have taken the combinations of male–female, male–male and female–...
Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICITEE '20: Proceedings of the 3rd International Conference on Information Technologies and Electrical Engineering

December 2020

687 pages

ISBN:9781450388665

DOI:10.1145/3452940

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

ICITEE2020

ICITEE2020: The 3rd International Conference on Information Technologies and Electrical Engineering

December 3 - 5, 2020

Hunan, Changde City, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten