Latent discriminative representation learning for speaker recognition

Huang, Duolin; Mao, Qirong; Ma, Zhongchen; Zheng, Zhishen; Routryar, Sidheswar; Ocquaye, Elias-Nii-Noi

doi:10.1631/FITEE.1900690

Latent discriminative representation learning for speaker recognition

用于说话人识别的潜在可区分性表征学习

Published: 29 January 2021

Volume 22, pages 697–708, (2021)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Duolin Huang (黄多林) ORCID: orcid.org/0000-0002-3149-2605¹,
Qirong Mao (毛启容) ORCID: orcid.org/0000-0002-0616-4431^1,2,
Zhongchen Ma (马忠臣)¹,
Zhishen Zheng (郑智燊)¹,
Sidheswar Routryar¹ &
…
Elias-Nii-Noi Ocquaye¹

123 Accesses
2 Citations
Explore all metrics

An Erratum to this article was published on 22 May 2021

This article has been updated

Abstract

Extracting discriminative speaker-specific representations from speech signals and transforming them into fixed length vectors are key steps in speaker identification and verification systems. In this study, we propose a latent discriminative representation learning method for speaker recognition. We mean that the learned representations in this study are not only discriminative but also relevant. Specifically, we introduce an additional speaker embedded lookup table to explore the relevance between different utterances from the same speaker. Moreover, a reconstruction constraint intended to learn a linear mapping matrix is introduced to make representation discriminative. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods based on the Apollo dataset used in the Fearless Steps Challenge in INTERSPEECH2019 and the TIMIT dataset.

摘要

从语音信号中提取特定说话人的可区分性表征, 并将其转换为固定长度的向量是说话人识别和验证系统的关键步骤. 提出一种潜在的可区分性表征学习方法, 用于说话人识别. 我们认为所学表征不仅具有可区分性, 还具有相关性. 具体来说, 引入附加说话人嵌入查找表以探索同一说话人不同语音之间的相关性. 此外, 引入一个重构约束用于学习线性映射矩阵, 使表征更具可区分性. 实验结果表明, 所提方法在INTERSPEECH2019会议的Fearless Step Challenge挑战赛的Apollo数据集和TIMIT数据集上的性能优于目前最先进方法.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Optimized Dictionary based Robust Speaker Recognition

Article 11 March 2016

Datao You, Baojun Qiao & Jie Li

Frame level sparse representation classification for speaker verification

Article 24 October 2016

Mohammad Hasheminejad & Hassan Farsi

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

Guangcun Wei, Yanna Zhang, … Yunfei Xu

Change history

22 May 2021
An Erratum to this paper has been published: https://doi.org/10.1631/FITEE.19e0690

References

Al-Kaltakchi MTS, Woo WL, Dlay SS, et al., 2016. Study of statistical robust closed set speaker identification with feature and score-based fusion. IEEE Statistical Signal Processing Workshop, p.1–5. https://doi.org/10.1109/SSP.2016.7551807
Al-Kaltakchi MTS, Woo WL, Dlay SS, et al., 2017. Speaker identification evaluation based on the speech biometric and i-vector model using the TIMIT and NTIMIT databases. Proc 5^th Int Workshop on Biometrics and Forensics, p.1–6. https://doi.org/10.1109/IWBF.2017.7935102
Chen NX, Qian YM, Yu K, 2015. Multi-task learning for text-dependent speaker verification. Proc 16^th Annual Conf of the Int Speech Communication Association, p.185–189.
Chen XB, Cai YF, Chen L, et al., 2015. Discriminant feature extraction for image recognition using complete robust maximum margin criterion. Mach Vis Appl, 26(7–8):857–870. https://doi.org/10.1007/s00138-015-0709-7
Article Google Scholar
Cumani S, Plchot O, Laface P, 2013. Probabilistic linear discriminant analysis of i-vector posterior distributions. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7644–7648. https://doi.org/10.1109/ICASSP.2013.6639150
Davis S, Mermelstein P, 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process, 28(4):357–366. https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
Dehak N, Kenny PJ, Dehak R, et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process, 19(4):788–798. https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
Desai D, Joshi M, 2013. Speaker recognition using MFCC and hybrid model of VQ and GMM. Proc 2^nd Int Symp on Intelligent Informatics, p.53–63. https://doi.org/10.1007/978-3-319-01778-5_6
Dey S, Motlicek P, Madikeri S, et al., 2017. Template-matching for text-dependent speaker verification. Speech Commun, 88:96–105. https://doi.org/10.1016/j.specom.2017.01.009
Article Google Scholar
Fisusi A, Yesufu T, 2007. Speaker recognition systems: a tutorial. Afr J Inform Commun Technol, 3(2):42–52. https://doi.org/10.5130/ajict.v3i2.508
Google Scholar
Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST Speech Disc 1–1.1. NASA STI/Recon Technical Report N, 93:27403, NASA, USA.
Hansen JHL, Sangwan A, Joglekar A, et al., 2018. Fearless steps: Apollo-11 corpus advancements for speech technologies from Earth to the Moon. Proc 19^th Annual Conf of the Int Speech Communication Association, p.2758–2762. https://doi.org/10.21437/Interspeech.2018-1942
Heigold G, Moreno I, Bengio S, et al., 2016. End-to-end text-dependent speaker verification. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5115–5119. https://doi.org/10.1109/ICASSP.2016.7472652
Hermansky H, 1990. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am, 87(4):1738–1752. https://doi.org/10.1121/1.399423
Article Google Scholar
Huang XD, Acero A, Hon HW, 2001. Spoken Language Processing: a Guide to Theory, Algorithm and System Development. Upper Saddle River, Prentice Hall PTR, USA.
Jiang HJ, Wang RP, Shan SG, et al., 2017. Learning discriminative latent attributes for zero-shot classification. IEEE Int Conf on Computer Vision, p.4233–4242. https://doi.org/10.1109/ICCV.2017.453
Kenny P, Boulianne G, Ouellet P, et al., 2007. Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process, 15(4):1448–1460. https://doi.org/10.1109/TASL.2007.894527
Article Google Scholar
Kim MJ, Yang IH, Kim MS, et al., 2017. Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition. Front Inform Technol Electron Eng, 18(5):738–750. https://doi.org/10.1631/FITEE.1500380
Article Google Scholar
Kumar R, Yeruva V, Ganapathy S, 2018. On convolutional LSTM modeling for joint wake-word detection and text dependent speaker verification. Proc 19^th Annual Conf of the Int Speech Communication Association, p.1121–1125. https://doi.org/10.21437/Interspeech.2018-1759
Lei Y, Scheffer N, Ferrer L, et al., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887
Li C, Ma XK, Jiang B, et al., 2017. Deep speaker: an end-to-end neural speaker embedding system. https://arxiv.org/abs/1705.02304
Luo Y, Liu Y, Zhang Y, et al., 2018. Speech bottleneck feature extraction method based on overlapping group lasso sparse deep neural network. Speech Commun, 99:56–61. https://doi.org/10.1016/j.specom.2018.02.005
Article Google Scholar
Mao QR, Dong M, Huang ZW, et al., 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multim, 16(8):2203–2213. https://doi.org/10.1109/TMM.2014.2360798
Article Google Scholar
Peri R, Pal M, Jati A, et al., 2019. Robust speaker recognition using unsupervised adversarial invariance. https://arxiv.org/abs/1911.00940
Rabiner LR, 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 77(2):257–286.
Article Google Scholar
Reynolds DA, Rose RC, 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process, 3(1):72–83. https://doi.org/10.1109/89.365379
Article Google Scholar
Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted Gaussian mixture models. Dig Signal Process, 10(1–3) 19–41. https://doi.org/10.1006/dspr.1999.0361
Article Google Scholar
Sadjadi SO, Slaney M, Heck L, et al., 2013. MSR Identity Toolbox v1.0: a MATLAB Toolbox for Speaker Recognition Research. Microsoft Research Technical Report, Piscataway, NJ, USA.
Schroff F, Kalenichenko D, Philbin J, 2015. FaceNet: a unified embedding for face recognition and clustering. IEEE Conf on Computer Vision and Pattern Recognition, p.815–823. https://doi.org/10.1109/CVPR.2015.7298682
Singh S, Rajan EG, 2011. Vector quantization approach for speaker recognition using MFCC and inverted MFCC. Int J Comput Appl, 17(1):1–7.
Google Scholar
Snyder D, Ghahremani P, Povey D, et al., 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. IEEE Spoken Language Technology Workshop, p.165–170. https://doi.org/10.1109/SLT.2016.7846260
Snyder D, Garcia-Romero D, Povey D, et al., 2017. Deep neural network embeddings for text-independent speaker verification. Proc 18^th Annual Conf of the Int Speech Communication Association, p.999–1003. https://doi.org/10.21437/Interspeech.2017-620
Snyder D, Garcia-Romero D, Sell G, et al., 2018. X-vectors: robust DNN embeddings for speaker recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
Togneri R, Pullella D, 2011. An overview of speaker identification: accuracy and robustness issues. IEEE Circ Syst Mag, 11(2):23–61. https://doi.org/10.1109/MCAS.2011.941079
Article Google Scholar
van Leeuwen DA, Saeidi R, 2013. Knowing the non-target speakers: the effect of the i-vector population for PLDA training in speaker recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6778–6782. https://doi.org/10.1109/ICASSP.2013.6638974
Variani E, Lei X, McDermott E, et al., 2014. Deep neural networks for small footprint text-dependent speaker verification. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
Wan V, Campbell WM, 2000. Support vector machines for speaker verification and identification. Neural Networks for Signal Processing X. Proc IEEE Signal Processing Society Workshop, p.775–784. https://doi.org/10.1109/NNSP.2000.890157
Wen YD, Zhang KP, Li ZF, et al., 2016. A discriminative feature learning approach for deep face recognition. Proc 14^th European Conf on Computer Vision, p.499–515. https://doi.org/10.1007/978-3-319-46478-7_31
Yadav S, Rai A, 2018. Learning discriminative features for speaker identification and verification. Proc 19^th Annual Conf of the Int Speech Communication Association, p.2237–2241. https://doi.org/10.21437/Interspeech.2018-1015
Yoshimura T, Koike N, Hashimoto K, et al., 2018. Discriminative feature extraction based on sequential variational autoencoder for speaker recognition. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1742–1746. https://doi.org/10.23919/APSIPA.2018.8659722
Young S, 1993. The HTK Hidden Markov Model Toolkit: Design and Philosophy. Department of Engineering, Cambridge University, Cambridge.
Google Scholar
Yu K, Mason J, Oglesby J, 1995. Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEE Proc Vis Image Signal Process, 142(5):313–318. https://doi.org/10.1049/ip-vis:19952144
Article Google Scholar
Zhang C, Koishida K, 2017. End-to-end text-independent speaker verification with triplet loss on short utterances. Proc 18^th Annual Conf of the Int Speech Communication Association, p.1487–1491. https://doi.org/10.21437/Interspeech.2017-1608
Zhang FF, Zhang TZ, Mao QR, et al., 2018. Joint pose and expression modeling for facial expression recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.3359–3368. https://doi.org/10.1109/CVPR.2018.00354

Download references

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China
Duolin Huang (黄多林), Qirong Mao (毛启容), Zhongchen Ma (马忠臣), Zhishen Zheng (郑智燊), Sidheswar Routryar & Elias-Nii-Noi Ocquaye
Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, 212013, China
Qirong Mao (毛启容)

Authors

Duolin Huang (黄多林)
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao (毛启容)
View author publications
You can also search for this author in PubMed Google Scholar
Zhongchen Ma (马忠臣)
View author publications
You can also search for this author in PubMed Google Scholar
Zhishen Zheng (郑智燊)
View author publications
You can also search for this author in PubMed Google Scholar
Sidheswar Routryar
View author publications
You can also search for this author in PubMed Google Scholar
Elias-Nii-Noi Ocquaye
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Duolin HUANG and Qirong MAO designed the research. Duolin HUANG processed the data. Duolin HUANG and Qirong MAO drafted the manuscript. Zhongchen MA, Zhishen ZHENG, Sidheswar ROUTRAY, and Elias-Nii-Noi OCQUAYE helped organize the manuscript. Duolin HUANG and Qirong MAO revised and finalized the paper.

Corresponding author

Correspondence to Qirong Mao (毛启容).

Ethics declarations

Duolin HUANG, Qirong MAO, Zhongchen MA, Zhishen ZHENG, Sidheswar ROUTRAY, and Elias-Nii-Noi OC-QUAYE declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. U1836220 and 61672267), the Qing Lan Talent Program of Jiangsu Province, China, and the Jiangsu Province Key Research and Development Plan (Industry Foresight and Key Core Technology) (No. BE2020036)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, D., Mao, Q., Ma, Z. et al. Latent discriminative representation learning for speaker recognition. Front Inform Technol Electron Eng 22, 697–708 (2021). https://doi.org/10.1631/FITEE.1900690

Download citation

Received: 10 December 2019
Accepted: 12 July 2020
Published: 29 January 2021
Issue Date: May 2021
DOI: https://doi.org/10.1631/FITEE.1900690

Key words

关键词

CLC number

TP391.4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Latent discriminative representation learning for speaker recognition

Abstract

摘要

Access this article

Similar content being viewed by others

The Optimized Dictionary based Robust Speaker Recognition

Frame level sparse representation classification for speaker verification

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Change history

22 May 2021

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Key words

关键词

CLC number

Navigation

Latent discriminative representation learning for speaker recognition

Abstract

摘要

Access this article

Similar content being viewed by others

The Optimized Dictionary based Robust Speaker Recognition

Frame level sparse representation classification for speaker verification

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Change history

22 May 2021

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation