Abstract
In recent years, methods to extract speaker-embedding information directly from raw waveforms have received much attention, with good results achieved by the RawNet3 network. However, the RawNet3 model only uses convolutional neural networks (CNNs) to extract speaker features directly from the raw waveform, which limits the perceptual field and leads to the model’s inability to learn more speaker features with long-term dependencies. This paper proposes a novel speaker recognition model with global information modelling of raw waveform, called GIMR-Net, which is able to extract more speaker features with long-term dependencies. The model uses the transformer structure in the network to extract global information and combines it with the CNN structure to extract local information, thus achieving the ability to model the global information of the raw waveform. Experiments show that the proposed GIMR-Net model is effective and outperforms the RawNet3 model on the Free ST Chinese Mandarin Corpus dataset. Specifically, the equal error rate of GIMR-Net is 1.22, a 12.9\(\%\) improvement compared with RawNet3. Finally, in the babble and factory noise environment, the experiment verifies that the model proposed in this paper still has the noise immunity of the RawNet3 model.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data related to this work will be available on reasonable request.
References
Kabi, M., Mridha, M. F., Shin, J., Jahan, I., & Ohi, A. Q. (2021). A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access., 9, 79236–79263.
Fechner, G.T. (1948). Elements of psychophysics. IEEE Access. 1860.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)., 5329–5333.
Desplanques, B., Thienpondt, J. & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv., 2005–07143.
Chung, J.S., Nagrani, A. & Zisserman, A. (2020). Voxceleb2: Deep speaker recognition. arXiv., 2005–07143.
Cai, W., Chen, J., Zhang, J., & Li, M. (2020). On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing., 28, 1038–1051.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4), 788–798.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Sainath, T., Weiss, R.J., Wilson, K., Senior, A.W. & Vinyals, O. (2015). Learning the speech front-end with raw waveform cldnns. Advances in neural information processing systems.
Ravanelli, M. & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)., 1021–1028.
Oglic, D., Cvetkovic, Z., Bell, P. & Renals, S. (2020). A deep 2d convolutional network for waveform-based speech recognition. Interspeech., 1654–1658.
Pariente, M., Cornell, S., Deleforge, A. & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 6364–6368.
Jung, J.W., Heo, H.S., Yang, I.H., Shim, H.J. & Yu, H.J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 5349–5353.
Muckenhirn, H., Doss, M.M. & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using cnns. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 4884–4888.
Zhu, G., Jiang, F. & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv, 2010–12951
Jung, J.W., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y. & Chung, J.S. (2022). Pushing the limits of raw waveform speaker recognition. arXiv, 2203–08488.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N. & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 35(12), 11106–11115.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
Wang, R., Ao, J., Zhou, L., Liu, S., Wei, Z., Ko, T. & Zhang, Y. (2022). Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6732–6736.
Noé, P.G., Parcollet, T. & Morchid, M. (2020). Cgcnn: Complex gabor convolutional neural network on raw speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7724–7728.
Andén, J., & Mallat, S. (2014). Deep scattering spectrum. In IEEE Transactions on Signal Processing., 62(16), 4114–4128.
Balestriero, R., Cosentino, R., Glotin, H. & Baraniuk, R. (2018). Spline filters for end-to-end deep learning. In International conference on machine learning.PMRL., 364–373.
Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J. & Yu, H.J. (2019). Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv., 1904–08104.
Ba, J.L., Kiros, J.R. & Hinton, G.E. (2016). Layer normalization. arXiv., 1607–06450.
Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. In 2018 IEEE spoken language technology workshop (SLT), 1412–6980.
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T. & Soudry, D. (2020). Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8129–8138.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61972324), the Natural Science Foundation of Sichuan of China under Grant 2022NSFSC0462, Sichuan Science and Technology Program (2023NSFSC1985, 2023YFG0046, 2022YFG0181) and Research Fund of Chengdu University of Information Technology (KYTZ202149, KYTD202212).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, Y., Dong, J., Fang, Z. et al. Speaker recognition with global information modelling of raw waveforms. J Membr Comput 6, 42–51 (2024). https://doi.org/10.1007/s41965-024-00135-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41965-024-00135-2