research-article

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms based on Metric Learning for Speaker Verification

Authors:

Yuesheng ZhuAuthors Info & Claims

ITCC '22: Proceedings of the 4th International Conference on Information Technology and Computer Communications

Pages 53 - 58

https://doi.org/10.1145/3548636.3548644

Published: 23 August 2022 Publication History

ITCC '22: Proceedings of the 4th International Conference on Information Technology and Computer Communications

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms based on Metric Learning for Speaker Verification

Pages 53 - 58

Abstract
References

Abstract

With the application of deep learning in the field of speaker recognition, the performance of speaker recognition systems has been greatly improved. However, most current work still relies on handcrafted features, existing raw waveform-based systems fail to utilize the multi-scale feature and multi-level information efficiently. Besides, the speaker embedding generated by speaker identification is used to complete speaker verification through similarity discrimination, resulting in a domain mismatch problem. To address these problems, we propose an end-to-end system called Wav2sv, which uses a stack of strided convolution layers as a feature encoder, SE-Res2Blocks and dense connection between each frame layer as the frame aggregator; and obtain the speaker embedding with a metric learning objective. This new end-to-end system can automatically learn the most suitable speaker embedding from raw waveform based on metric learning for speaker verification. Our simulation results on VoxCeleb1 indicate that the proposed approach achieves an EER of 4.75%, which is 18% superior to the Wav2spk baseline. Our work demonstrates the great potential of extracting speaker embeddings from raw waveforms.

References

[1]

Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020. In Defence of Metric Learning for Speaker Recognition. In Interspeech 2020, ISCA, 2977–2981.

[2]

Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (2011), 788–798.

Digital Library

[3]

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Interspeech 2020 (October 2020), 3830–3834.

[4]

Y. Fan, J.W. Kang, L.T. Li, K.C. Li, H.L. Chen, S.T. Cheng, P.Y. Zhang, Z.Y. Zhou, Y.Q. Cai, and D. Wang. 2020. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7604–7608.

[5]

Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. 2021. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2 (2021), 652–662.

Digital Library

[6]

Amirhossein Hajavi and Ali Etemad. 2019. A deep neural network for short-segment speaker recognition. ArXiv Prepr. ArXiv190710420 (2019).

[7]

Mahdi Hajibabaei and Dengxin Dai. 2018. Unified Hypersphere Embedding for Speaker Recognition. ArXiv180708312 Cs Eess (July 2018). Retrieved February 12, 2022 from http://arxiv.org/abs/1807.08312

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 770–778. Retrieved July 14, 2021 from https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

[9]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. 7132–7141. Retrieved August 12, 2021 from https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper

[10]

Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. ArXiv190408104 Cs Eess (July 2019). Retrieved July 7, 2021 from http://arxiv.org/abs/1904.08104

[11]

Jee-Weon Jung, Hee-Soo Heo, Il-Ho Yang, Hye-Jin Shim, and Ha-Jin Yu. 2018. Avoiding Speaker Overfitting in End-to-End DNNs using Raw Waveform for Text-Independent Speaker Verification. In 19th Annual Conference of the International Speech Communication Association (interspeech 2018), Vols 1-6: Speech Research for Emerging Markets in Multilingual Societies. Isca-Int Speech Communication Assoc, Baixas, 3583–3587.

[12]

Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, and Ha-Jin Yu. 2020. Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms. (2020).

[13]

Wei-Wei Lin and Man-Wai Mak. 2020. Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. In INTERSPEECH, 3211–3215.

[14]

Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. 2016. The speakers in the wild (SITW) speaker recognition database. In Interspeech, 818–822.

[15]

Hannah Muckenhirn, Mathew Magimai.-Doss, and Sébastien Marcell. 2018. Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884–4888.

Digital Library

[16]

Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60, (March 2020), 101027.

Digital Library

[17]

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Interspeech 2017, ISCA, 2616–2620.

[18]

Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive Statistics Pooling for Deep Speaker Embedding. Interspeech 2018 (September 2018), 2252–2256.

[19]

Dan Oneaț ă ă, Lucian Georgescu, Horia Cucu, Dragoş Burileanu, and Corneliu Burileanu. 2021. Revisiting SincNet: An Evaluation of Feature and Network Hyperparameters for Speaker Recognition. In 2020 28th European Signal Processing Conference (EUSIPCO), 1–5.

[20]

Mirco Ravanelli and Yoshua Bengio. 2018. Speaker Recognition from Raw Waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), 1021–1028.

[21]

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-training for Speech Recognition. ArXiv190405862 Cs (September 2019). Retrieved August 16, 2021 from http://arxiv.org/abs/1904.05862

[22]

Suwon Shon, Hao Tang, and James Glass. 2018. Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model. In 2018 IEEE Spoken Language Technology Workshop (SLT), 1007–1013.

[23]

David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. 2017. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Interspeech, 999–1003.

[24]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333.

Digital Library

[25]

Ge Zhu and Z. Duan. 2021. Y-Vector: Multiscale Waveform Encoder for Speaker Embedding. In Interspeech.

Index Terms

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms based on Metric Learning for Speaker Verification
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Security and privacy
  1. Security services
    1. Authentication
      1. Biometrics

Recommendations

Learning the Front-End Speech Feature with Raw Waveform for End-to-End Speaker Recognition
ICCAI '20: Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence

State-of-the-art deep neural network-based speaker recognition systems tend to follow the paradigm of speech feature extraction and then the speaker classifier training, namely "divide and conquer" approaches. These methods usually rely on fixed, ...
Speaker verification using excitation source information

In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the ...
Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM

In this paper, we propose a robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position. In the training stage, the system measures the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ITCC '22: Proceedings of the 4th International Conference on Information Technology and Computer Communications

June 2022

138 pages

ISBN:9781450396820

DOI:10.1145/3548636

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ITCC 2022

ITCC 2022: 2022 4th International Conference on Information Technology and Computer Communications

June 23 - 25, 2022

Guangzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
64
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten