skip to main content
10.1145/3372806.3372818acmotherconferencesArticle/Chapter ViewAbstractPublication PagesspmlConference Proceedingsconference-collections
research-article

Multi-Task Learning Based End-to-End Speaker Recognition

Published: 21 January 2020 Publication History

Abstract

Recently, there has been an increasing interest in end-to-end speaker recognition that directly take raw speech waveform as input without any hand-crafted features such as FBANK and MFCC. SincNet is a recently developed novel convolutional neural network (CNN) architecture in which the filters in the first convolutional layer are set to band-pass filters (sinc functions). Experiments show that SincNet achieves a significant decrease in frame error rate (FER) than traditional CNNs and DNNs.
In this paper we demonstrate how to improve the performance of SincNet using Multi-Task learning (MTL). In the proposed Sinc- Net architecture, besides the main task (speaker recognition), a phoneme recognition task is employed as an auxiliary task. The network uses sinc layers and convolutional layers as shared layers to improve the extensiveness of the network, and the outputs of shared layers are fed into two different sets of full-connected layers for classification. Our experiments, conducted on TIMIT corpora, show that the proposed architecture SincNet-MTL performs better than standard SincNet architecture in both classification error rates (CER) and convergence rate.

References

[1]
R.K. Ando and T. Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Machine Learning Research 6 (2005), 1817--1853.
[2]
B. Bakker and T. Heskes. 2003. Task clustering and gating for Bayesian multi--task learning. Machine Learning Research 4 (2003), 83--99.
[3]
J. Baxter. 2000. A model for inductive bias learning. Artificial Intelligence Research 12 (2000), 149--198.
[4]
S. Ben-David and R. Schuller. 2003. Exploiting task relatedness for multiple task learning. In Proceedings of Computational Learning Theory (COLT).
[5]
N. Brummer and E. De Villiers. 2010. The speaker partitioning problem. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 34.
[6]
Rich Caruana. 1997. Multitask learning: A knowledge-based source of inductive bias. Machine Learning 28 (1997), 41--75.
[7]
H. Chen, L. Xu, and Z. Yang. 2018. Multi-dimensional Speaker Information Recognition with Multi-task Neural Network. In Proceedings of 2018 IEEE 4th International Conference on Computer and Communications (ICCC). 2064--2068.
[8]
G. E. Dahl, D. Yu, and A. Acero L. Deng. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 20, 1 (2012), 30--42.
[9]
T. Evgeniou, C.A. Micchelli, and M. Pontil. 2005. Learning multiple tasks with kernel methods. Machine Learning Research 6 (2005), 615--637.
[10]
D. Garcia-Romero and C. Espy-Wilson. 2011. Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of Interspeech. 249--252.
[11]
X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS). 249--256.
[12]
G. Hinton. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29, 6 (2012), 82--97.
[13]
S. Ioffe. 2006. Probabilistic linear discriminant analysis. In Proceedings of European Conference on Computer Vision (ECCV). 531--542.
[14]
S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning (ICML). 448--456.
[15]
P. Kenny. 2010. Bayesian speaker verification with heavy-tailed priors. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 14.
[16]
P. Kenny, V. Gupta, T. Stafylakis, and J. Alam P. Ouellet. 2014. Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 293--298.
[17]
K. F. Lee and H. W. Hon. 1989. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. 37, 11 (1989), 1641--1648.
[18]
Y. Lei. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 1695--1699.
[19]
Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 1695--1699.
[20]
L. Li, Y. Lin, and D.Wang Z. Zhang. 2015. Improved deep speaker feature learning for text-dependent speaker recognition. In Proceedings of Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. 426--429.
[21]
X. Lu, P. Shen, Y. Tsao, and H. Kawai. 2016. A pseudo-task design in Multi-Task learning deep neural network for speaker recognition. In Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP). 1--5.
[22]
A. L. Maas, A. Y. Hannun, and A. Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of International Conference on Machine Learning (ICML).
[23]
S. Novitasari, Q. T. Do, S. Sakti, D. Lestari, and S. Nakamura. 2018. Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data. In Proceedings of 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments. 37--42.
[24]
M. K. Omar and J. W. Pelecanos. 2010. Training universal background models for speaker recognition. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 52--57.
[25]
L. R. Rabiner and R. W. Schafer. 2011. Theory and Applications of Digital Speech Processing. Prentice Hall, NJ.
[26]
M. Ravanelli and Y. Bengio. 2018. Speaker Recognition from RawWaveform with SincNet. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT). 1021--1028.
[27]
F. Richardsonand D. A. Reynolds, and N. Dehak. 2015. A unified deep neural network for speaker and language recognition. In Proceedings of Interspeech. 1146--1150.
[28]
F. Richardson and N. Dehak D. Reynolds. 2015. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters 22, 10 (2015), 1671--1675.
[29]
F. Richardson, D. Reynolds, and N. Dehak. 2015. Deep Neural Network Approaches to Speaker and Language Recognition. IEEE Signal Processing Letters 22, 10 (2015), 1671--1675.
[30]
S. O. Sadjadi, S. Ganapathy, and J. W. Pelecanos. 2016. The IBM 2016 speaker recognition system. In Proceedings of Odyssey: Speaker Lang. Recognit. Workshop. 174--180.
[31]
G. Saon, H. Soltau, and M. Picheny D. Nahamoo. 2013. Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of IEEE Workshop Autom. Speech Recognit. Understanding. 55--59.
[32]
A. W. Senior and I. Lopez-Moreno. 2014. Improving DNN speaker independence with i-vector inputs. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 225--229.
[33]
D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur. 2017. Deep neural network embeddings for text-independent speaker verification. In Proceedings of Interspeech. 999--1003.
[34]
Z. Tang, L. Li, D. Wang, and R. Vipperla. 2017. Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 3 (2017), 493--504.
[35]
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 4052--4056.
[36]
E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and J. Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 4052--4056.
[37]
J. Zhou, T. Jiang, Q. Hong L. Li, Z. Wang, and B. Xia. 2019. Training Multi--task Adversarial Network for Extracting Noise-robust Speaker Embedding. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP). 6196--6200.

Cited By

View all
  • (2024)Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive RoboticsCognitive Computation10.1007/s12559-023-10241-516:5(2713-2723)Online publication date: 5-Feb-2024
  • (2023)Knowledge distillation-enhanced multitask framework for recommendationInformation Sciences10.1016/j.ins.2023.02.021630(235-251)Online publication date: Jun-2023

Index Terms

  1. Multi-Task Learning Based End-to-End Speaker Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SPML '19: Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning
    November 2019
    135 pages
    ISBN:9781450372213
    DOI:10.1145/3372806
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Ritsumeikan University: Ritsumeikan University

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 January 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. convolutional neural networks
    2. multi-task learning
    3. raw samples
    4. speaker recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SPML '19

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive RoboticsCognitive Computation10.1007/s12559-023-10241-516:5(2713-2723)Online publication date: 5-Feb-2024
    • (2023)Knowledge distillation-enhanced multitask framework for recommendationInformation Sciences10.1016/j.ins.2023.02.021630(235-251)Online publication date: Jun-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media