skip to main content
10.1145/3548636.3548644acmotherconferencesArticle/Chapter ViewAbstractPublication PagesitccConference Proceedingsconference-collections
research-article

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms based on Metric Learning for Speaker Verification

Published: 23 August 2022 Publication History

Abstract

With the application of deep learning in the field of speaker recognition, the performance of speaker recognition systems has been greatly improved. However, most current work still relies on handcrafted features, existing raw waveform-based systems fail to utilize the multi-scale feature and multi-level information efficiently. Besides, the speaker embedding generated by speaker identification is used to complete speaker verification through similarity discrimination, resulting in a domain mismatch problem. To address these problems, we propose an end-to-end system called Wav2sv, which uses a stack of strided convolution layers as a feature encoder, SE-Res2Blocks and dense connection between each frame layer as the frame aggregator; and obtain the speaker embedding with a metric learning objective. This new end-to-end system can automatically learn the most suitable speaker embedding from raw waveform based on metric learning for speaker verification. Our simulation results on VoxCeleb1 indicate that the proposed approach achieves an EER of 4.75%, which is 18% superior to the Wav2spk baseline. Our work demonstrates the great potential of extracting speaker embeddings from raw waveforms.

References

[1]
Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020. In Defence of Metric Learning for Speaker Recognition. In Interspeech 2020, ISCA, 2977–2981.
[2]
Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (2011), 788–798.
[3]
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Interspeech 2020 (October 2020), 3830–3834.
[4]
Y. Fan, J.W. Kang, L.T. Li, K.C. Li, H.L. Chen, S.T. Cheng, P.Y. Zhang, Z.Y. Zhou, Y.Q. Cai, and D. Wang. 2020. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7604–7608.
[5]
Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. 2021. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2 (2021), 652–662.
[6]
Amirhossein Hajavi and Ali Etemad. 2019. A deep neural network for short-segment speaker recognition. ArXiv Prepr. ArXiv190710420 (2019).
[7]
Mahdi Hajibabaei and Dengxin Dai. 2018. Unified Hypersphere Embedding for Speaker Recognition. ArXiv180708312 Cs Eess (July 2018). Retrieved February 12, 2022 from http://arxiv.org/abs/1807.08312
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 770–778. Retrieved July 14, 2021 from https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
[9]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. 7132–7141. Retrieved August 12, 2021 from https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper
[10]
Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. ArXiv190408104 Cs Eess (July 2019). Retrieved July 7, 2021 from http://arxiv.org/abs/1904.08104
[11]
Jee-Weon Jung, Hee-Soo Heo, Il-Ho Yang, Hye-Jin Shim, and Ha-Jin Yu. 2018. Avoiding Speaker Overfitting in End-to-End DNNs using Raw Waveform for Text-Independent Speaker Verification. In 19th Annual Conference of the International Speech Communication Association (interspeech 2018), Vols 1-6: Speech Research for Emerging Markets in Multilingual Societies. Isca-Int Speech Communication Assoc, Baixas, 3583–3587.
[12]
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, and Ha-Jin Yu. 2020. Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms. (2020).
[13]
Wei-Wei Lin and Man-Wai Mak. 2020. Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. In INTERSPEECH, 3211–3215.
[14]
Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. 2016. The speakers in the wild (SITW) speaker recognition database. In Interspeech, 818–822.
[15]
Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcell. 2018. Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884–4888.
[16]
Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60, (March 2020), 101027.
[17]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Interspeech 2017, ISCA, 2616–2620.
[18]
Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive Statistics Pooling for Deep Speaker Embedding. Interspeech 2018 (September 2018), 2252–2256.
[19]
Dan Oneaț ă ă, Lucian Georgescu, Horia Cucu, Dragoş Burileanu, and Corneliu Burileanu. 2021. Revisiting SincNet: An Evaluation of Feature and Network Hyperparameters for Speaker Recognition. In 2020 28th European Signal Processing Conference (EUSIPCO), 1–5.
[20]
Mirco Ravanelli and Yoshua Bengio. 2018. Speaker Recognition from Raw Waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), 1021–1028.
[21]
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-training for Speech Recognition. ArXiv190405862 Cs (September 2019). Retrieved August 16, 2021 from http://arxiv.org/abs/1904.05862
[22]
Suwon Shon, Hao Tang, and James Glass. 2018. Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model. In 2018 IEEE Spoken Language Technology Workshop (SLT), 1007–1013.
[23]
David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. 2017. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Interspeech, 999–1003.
[24]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333.
[25]
Ge Zhu and Z. Duan. 2021. Y-Vector: Multiscale Waveform Encoder for Speaker Embedding. In Interspeech.

Index Terms

  1. Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms based on Metric Learning for Speaker Verification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ITCC '22: Proceedings of the 4th International Conference on Information Technology and Computer Communications
      June 2022
      138 pages
      ISBN:9781450396820
      DOI:10.1145/3548636
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 August 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ITCC 2022

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 64
        Total Downloads
      • Downloads (Last 12 months)13
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media