Abstract
Far-field speaker verification is challenging, because of interferences caused by different distances between the speaker and the recorder. In this paper, a distance discriminator, which determines whether two utterances are recorded at the same distance, is used as an auxiliary task to learn distance discrepancy information. There are two identical auxiliary tasks, one is added before the speaker embedding layer to learn distance discrepancy information via multi-task learning, and then the other is added after that layer to suppress the learned discrepancy via a gradient reversal layer. In addition, to avoid conflicts among the optimization directions of all tasks, the loss weight of every task is updated dynamically during training. Experiments on AISHELL Wake-up show a relatively 7% and 10.3% reduction of equal error rate (EER) on far-far speaker verification and near-far speaker verification respectively, compared with the single-task model, demonstrating the effectiveness of the proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bai, Z., Zhang, X.L.: Speaker recognition based on deep learning: an overview. Neural Netw. (2021)
Tong, Y., et al.: The JD AI speaker verification system for the FFSVC 2020 challenge. In: Proceedings of Interspeech 2020, pp. 3476–3480 (2020)
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)
Mošner, L., Matějka, P., Novotnỳ, O., Černockỳ, J.H.: Dereverberation and beamforming in far-field speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5254–5258. IEEE (2018)
Qin, X., Cai, D., Li, M.: Far-field End-to-end text-dependent speaker verification based on mixed training data with transfer learning and enrollment data augmentation. In: Interspeech, pp. 4045–4049 (2019)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain Adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22(2), 199–210 (2010)
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)
Burget, L., Novotny, O., Glembek, O.: Analysis of BUT submission in far-field scenarios of voices 2019 challenge. In: Proceedings of Interspeech (2019)
Zhang, L., Wu, J., Xie, L.: NPU speaker verification system for Interspeech 2020 far-field speaker verification challenge. arXiv preprint arXiv:2008.03521 (2020)
Chen, Z., Miao, X., Xiao, R., Wang, W.: Cross-domain speaker recognition using domain adversarial Siamese network with a domain discriminator. Electron. Lett. 56(14), 737–739 (2020)
Yi, L., Mak, M.W.: Adversarial separation and adaptation network for far-field speaker verification. In: INTERSPEECH, pp. 4298–4302 (2020)
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. arXiv preprint arXiv:1608.06019 (2016)
Chen, Z., Wang, S., Qian, Y., Yu, K.: Channel invariant speaker embedding learning with joint multi-task and adversarial training. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6574–6578. IEEE (2020)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, PMLR, pp. 1180–1189 (2015)
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), IEEE, vol. 2, pp. 1735–1742 (2006)
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491 (2018)
Qin, X., Bu, H., Li, M.: HI-MIA: a far-field text-dependent speaker verification database and the baselines. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7609–7613. IEEE (2020)
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)
Chung, J.S., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant 61573151 and Grant 61976095 and the Science and Technology Planning Project of Guangdong Province under Grant 2018B030323026.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, W. et al. (2021). Jointing Multi-task Learning and Gradient Reversal Layer for Far-Field Speaker Verification. In: Feng, J., Zhang, J., Liu, M., Fang, Y. (eds) Biometric Recognition. CCBR 2021. Lecture Notes in Computer Science(), vol 12878. Springer, Cham. https://doi.org/10.1007/978-3-030-86608-2_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-86608-2_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86607-5
Online ISBN: 978-3-030-86608-2
eBook Packages: Computer ScienceComputer Science (R0)