research-article

Data augmentation for speaker verification

Authors:
Shiqing Yang

SpeakIn Technologies Co. Ltd., China

SpeakIn Technologies Co. Ltd., China

0000-0002-7639-6174
View Profile

,
Min Liu

SpeakIn Technologies Co. Ltd., China

SpeakIn Technologies Co. Ltd., China

0000-0003-4792-4051
View Profile

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer EngineeringOctober 2022Pages 1247–1251https://doi.org/10.1145/3573428.3573649

Published:15 March 2023Publication History

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

Pages 1247–1251

ABSTRACT

Data augmentation is a hot issue in neural network training. In this paper, we investigate data augmentation for speaker recognition, and we propose two data augmentation methods to enhance the performance of neural network system. One of which is spectral augmentation. Spectral augmentation is a newly proposed data augmentation method which applied to speech recognition and got state-of-the-art performance, by masking blocks of frequency channels(F-mask), and/or by masking blocks of time steps(T-mask). We also investigate the method of speed perturbation, which adjusts the time-scale of a given audio signal without altering its pitch content. Experimental results show that both two methods can boost the performance. By combining TF-mask with speed perturbation, we can obtain more than 5.2% and 8.7% relative improvements over the baseline systems in the Vox-H and WX tasks.

References

Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification.” IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.Google ScholarDigital Library
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed- dings for speaker recognition.” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. IEEE, 2018.Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.” in Proc. of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778Google ScholarCross Ref
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, “Squeeze-and-Excitation Networks.” arXiv:1709.01507 [cs.CV], 2019.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need.” arXiv:1706.03762 [cs.CL], 2017.Google Scholar
Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, Yichen Wei, “Circle Loss: A Unified Perspective of Pair Similarity Optimization.” arXiv:2002.10857 [cs.CV], 2020.Google Scholar
Y. Taigman, M. Yang, M. RanzatoandL.Wolf, “DeepFace: Closing the gap to human-level performance in face verification.” in CVPR, pp. 1701–1708, 2014.Google ScholarDigital Library
I. Sutskever, O. VinyalsandQ.V. Le, “Sequence to sequence learning with neural networks.” in Advances in Neural Information Processing Systems, pp. 3104-3112, 2014.Google ScholarDigital Library
Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art Speech Recognition with Sequence-to-Sequence Models.” in ICASSP, 2018.Google Scholar
Daniel Spark, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Spec augment: A simple data augmentation method for automatic speech recognition.” arXiv preprint arXiv:1904.08779, 2019.Google Scholar
David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker Recognition for Multi-Speaker Conversations Using X-Vectors.” in ICASSP, 2019.Google ScholarCross Ref
Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell, “JHU-HLTCOE SYSTEM FOR THE VOXSRC SPEAKER RECOGNITION CHALLENGE.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.Google ScholarCross Ref
Feng Wang, Weiyang Liu, Haijun Liu, Jian Cheng. “Additive Margin SoftMax for Face Verification.” [J]. arXiv preprint arXiv:1801.05599, 2018.Google Scholar
S. Ioffe, “Probabilistic linear discriminant analysis, in European Conference on Computer Vision.” Springer, 2006, pp. 531–542.Google ScholarDigital Library
Matejka, Pavel and Novotny, Ondrej and Plchot, Oldˇrich and Burget, Lukas and Diez, Mireia and Cˇernocky ́, Jan, “Analysis of Score Normalization in Multilingual Speaker Recognition.” 10.21437/Interspeech. 2017.Google Scholar
Joon Son Chung, Arsha Nagrani, Andrew Zisserman, “Vox-Celeb2: Deep Speaker Recognition.” arXiv:1806.05622 [cs.SD], 2018.Google Scholar
T. DeVries and G. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout.” in arXiv, 2017.Google Scholar
Driedger, J.; Mu ̈ller, M. “A Review of Time-Scale Modification of Music Signals.” Appl. Sci. 2016, 6, 57.Google ScholarCross Ref
Sound Touch, audio manipulation tool, accessed March 26, 2020. [Online]. Available: http://soundtouch.surina.net/Google Scholar

Recommendations

Speaker verification using excitation source information

In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the ...
Read More
Adversarial Data Augmentation for Robust Speaker Verification
ICCIP '23: Proceedings of the 2023 9th International Conference on Communication and Information Processing

Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn ...
Read More
Speaker verification under degraded condition: a perceptual study

This study analyzes the effect of degradation on human and automatic speaker verification (SV) tasks. The perceptual test is conducted by the subjects having knowledge about speaker verification. An automatic SV system is developed using the Mel-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering
October 2022
1999 pages
ISBN:9781450397148
DOI:10.1145/3573428

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 March 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate508of972submissions,52%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 50
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Data augmentation for speaker verification

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

ABSTRACT

References

Cited By

Recommendations

Speaker verification using excitation source information

Adversarial Data Augmentation for Robust Speaker Verification

Speaker verification under degraded condition: a perceptual study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Data augmentation for speaker verification

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

ABSTRACT

References

Cited By

Recommendations

Speaker verification using excitation source information

Adversarial Data Augmentation for Robust Speaker Verification

Speaker verification under degraded condition: a perceptual study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media