research-article

Multi-Task Learning Based End-to-End Speaker Recognition

Authors:

Wei-Qiang ZhangAuthors Info & Claims

SPML '19: Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning

Pages 56 - 61

https://doi.org/10.1145/3372806.3372818

Published: 21 January 2020 Publication History

Abstract

Recently, there has been an increasing interest in end-to-end speaker recognition that directly take raw speech waveform as input without any hand-crafted features such as FBANK and MFCC. SincNet is a recently developed novel convolutional neural network (CNN) architecture in which the filters in the first convolutional layer are set to band-pass filters (sinc functions). Experiments show that SincNet achieves a significant decrease in frame error rate (FER) than traditional CNNs and DNNs.

In this paper we demonstrate how to improve the performance of SincNet using Multi-Task learning (MTL). In the proposed Sinc- Net architecture, besides the main task (speaker recognition), a phoneme recognition task is employed as an auxiliary task. The network uses sinc layers and convolutional layers as shared layers to improve the extensiveness of the network, and the outputs of shared layers are fed into two different sets of full-connected layers for classification. Our experiments, conducted on TIMIT corpora, show that the proposed architecture SincNet-MTL performs better than standard SincNet architecture in both classification error rates (CER) and convergence rate.

References

[1]

R.K. Ando and T. Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Machine Learning Research 6 (2005), 1817--1853.

Digital Library

[2]

B. Bakker and T. Heskes. 2003. Task clustering and gating for Bayesian multi--task learning. Machine Learning Research 4 (2003), 83--99.

Digital Library

[3]

J. Baxter. 2000. A model for inductive bias learning. Artificial Intelligence Research 12 (2000), 149--198.

Digital Library

[4]

S. Ben-David and R. Schuller. 2003. Exploiting task relatedness for multiple task learning. In Proceedings of Computational Learning Theory (COLT).

[5]

N. Brummer and E. De Villiers. 2010. The speaker partitioning problem. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 34.

[6]

Rich Caruana. 1997. Multitask learning: A knowledge-based source of inductive bias. Machine Learning 28 (1997), 41--75.

Digital Library

[7]

H. Chen, L. Xu, and Z. Yang. 2018. Multi-dimensional Speaker Information Recognition with Multi-task Neural Network. In Proceedings of 2018 IEEE 4th International Conference on Computer and Communications (ICCC). 2064--2068.

[8]

G. E. Dahl, D. Yu, and A. Acero L. Deng. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 20, 1 (2012), 30--42.

Digital Library

[9]

T. Evgeniou, C.A. Micchelli, and M. Pontil. 2005. Learning multiple tasks with kernel methods. Machine Learning Research 6 (2005), 615--637.

Digital Library

[10]

D. Garcia-Romero and C. Espy-Wilson. 2011. Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of Interspeech. 249--252.

[11]

X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS). 249--256.

[12]

G. Hinton. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29, 6 (2012), 82--97.

[13]

S. Ioffe. 2006. Probabilistic linear discriminant analysis. In Proceedings of European Conference on Computer Vision (ECCV). 531--542.

Digital Library

[14]

S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning (ICML). 448--456.

[15]

P. Kenny. 2010. Bayesian speaker verification with heavy-tailed priors. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 14.

[16]

P. Kenny, V. Gupta, T. Stafylakis, and J. Alam P. Ouellet. 2014. Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 293--298.

[17]

K. F. Lee and H. W. Hon. 1989. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. 37, 11 (1989), 1641--1648.

[18]

Y. Lei. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 1695--1699.

[19]

Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 1695--1699.

[20]

L. Li, Y. Lin, and D.Wang Z. Zhang. 2015. Improved deep speaker feature learning for text-dependent speaker recognition. In Proceedings of Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. 426--429.

[21]

X. Lu, P. Shen, Y. Tsao, and H. Kawai. 2016. A pseudo-task design in Multi-Task learning deep neural network for speaker recognition. In Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP). 1--5.

[22]

A. L. Maas, A. Y. Hannun, and A. Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of International Conference on Machine Learning (ICML).

[23]

S. Novitasari, Q. T. Do, S. Sakti, D. Lestari, and S. Nakamura. 2018. Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data. In Proceedings of 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments. 37--42.

[24]

M. K. Omar and J. W. Pelecanos. 2010. Training universal background models for speaker recognition. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 52--57.

[25]

L. R. Rabiner and R. W. Schafer. 2011. Theory and Applications of Digital Speech Processing. Prentice Hall, NJ.

[26]

M. Ravanelli and Y. Bengio. 2018. Speaker Recognition from RawWaveform with SincNet. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT). 1021--1028.

[27]

F. Richardsonand D. A. Reynolds, and N. Dehak. 2015. A unified deep neural network for speaker and language recognition. In Proceedings of Interspeech. 1146--1150.

[28]

F. Richardson and N. Dehak D. Reynolds. 2015. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters 22, 10 (2015), 1671--1675.

[29]

F. Richardson, D. Reynolds, and N. Dehak. 2015. Deep Neural Network Approaches to Speaker and Language Recognition. IEEE Signal Processing Letters 22, 10 (2015), 1671--1675.

[30]

S. O. Sadjadi, S. Ganapathy, and J. W. Pelecanos. 2016. The IBM 2016 speaker recognition system. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 174--180.

[31]

G. Saon, H. Soltau, and M. Picheny D. Nahamoo. 2013. Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of IEEE Workshop Autom. Speech Recognit. Understanding. 55--59.

[32]

A. W. Senior and I. Lopez-Moreno. 2014. Improving DNN speaker independence with i-vector inputs. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 225--229.

[33]

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur. 2017. Deep neural network embeddings for text-independent speaker verification. In Proceedings of Interspeech. 999--1003.

[34]

Z. Tang, L. Li, D. Wang, and R. Vipperla. 2017. Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 3 (2017), 493--504.

Digital Library

[35]

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 4052--4056.

[36]

E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and J. Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 4052--4056.

[37]

J. Zhou, T. Jiang, Q. Hong L. Li, Z. Wang, and B. Xia. 2019. Training Multi--task Adversarial Network for Extracting Noise-robust Speaker Embedding. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 6196--6200.

Cited By

Foggia PGreco ARoberto ASaggese AVento M(2024)Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive RoboticsCognitive Computation10.1007/s12559-023-10241-516:5(2713-2723)Online publication date: 5-Feb-2024
https://doi.org/10.1007/s12559-023-10241-5
Lu W(2023)Knowledge distillation-enhanced multitask framework for recommendationInformation Sciences10.1016/j.ins.2023.02.021630(235-251)Online publication date: Jun-2023
https://doi.org/10.1016/j.ins.2023.02.021

Index Terms

Multi-Task Learning Based End-to-End Speaker Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Multi-style speaker recognition database in practical conditions

This work describes the process of collection and organization of a multi-style database for speaker recognition. The multi-style database organization is based on three different categories of speaker recognition: voice-password, text-dependent and ...
A deep learning approach for speaker recognition
Abstract
Speaker verification (SV) is an important branch in speaker recognition. Several approaches have been investigated within the last few decades. In this context, deep learning has received much more interest by speech processing researchers, and it ...
Text-independent speaker recognition using LSTM-RNN and speech enhancement
Abstract
Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SPML '19: Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning

November 2019

135 pages

ISBN:9781450372213

DOI:10.1145/3372806

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Ritsumeikan University: Ritsumeikan University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SPML '19

SPML '19: 2019 2nd International Conference on Signal Processing and Machine Learning

November 27 - 29, 2019

Hangzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
138
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Foggia PGreco ARoberto ASaggese AVento M(2024)Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive RoboticsCognitive Computation10.1007/s12559-023-10241-516:5(2713-2723)Online publication date: 5-Feb-2024
https://doi.org/10.1007/s12559-023-10241-5
Lu W(2023)Knowledge distillation-enhanced multitask framework for recommendationInformation Sciences10.1016/j.ins.2023.02.021630(235-251)Online publication date: Jun-2023
https://doi.org/10.1016/j.ins.2023.02.021

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten