skip to main content
10.1145/3573428.3573649acmotherconferencesArticle/Chapter ViewAbstractPublication PageseitceConference Proceedingsconference-collections
research-article

Data augmentation for speaker verification

Published:15 March 2023Publication History

ABSTRACT

Data augmentation is a hot issue in neural network training. In this paper, we investigate data augmentation for speaker recognition, and we propose two data augmentation methods to enhance the performance of neural network system. One of which is spectral augmentation. Spectral augmentation is a newly proposed data augmentation method which applied to speech recognition and got state-of-the-art performance, by masking blocks of frequency channels(F-mask), and/or by masking blocks of time steps(T-mask). We also investigate the method of speed perturbation, which adjusts the time-scale of a given audio signal without altering its pitch content. Experimental results show that both two methods can boost the performance. By combining TF-mask with speed perturbation, we can obtain more than 5.2% and 8.7% relative improvements over the baseline systems in the Vox-H and WX tasks.

References

  1. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification.” IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed- dings for speaker recognition.” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. IEEE, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.” in Proc. of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778Google ScholarGoogle ScholarCross RefCross Ref
  4. Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, “Squeeze-and-Excitation Networks.” arXiv:1709.01507 [cs.CV], 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need.” arXiv:1706.03762 [cs.CL], 2017.Google ScholarGoogle Scholar
  6. Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, Yichen Wei, “Circle Loss: A Unified Perspective of Pair Similarity Optimization.” arXiv:2002.10857 [cs.CV], 2020.Google ScholarGoogle Scholar
  7. Y. Taigman, M. Yang, M. RanzatoandL.Wolf, “DeepFace: Closing the gap to human-level performance in face verification.” in CVPR, pp. 1701–1708, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Sutskever, O. VinyalsandQ.V. Le, “Sequence to sequence learning with neural networks.” in Advances in Neural Information Processing Systems, pp. 3104-3112, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art Speech Recognition with Sequence-to-Sequence Models.” in ICASSP, 2018.Google ScholarGoogle Scholar
  10. Daniel Spark, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Spec augment: A simple data augmentation method for automatic speech recognition.” arXiv preprint arXiv:1904.08779, 2019.Google ScholarGoogle Scholar
  11. David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker Recognition for Multi-Speaker Conversations Using X-Vectors.” in ICASSP, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  12. Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell, “JHU-HLTCOE SYSTEM FOR THE VOXSRC SPEAKER RECOGNITION CHALLENGE.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.Google ScholarGoogle ScholarCross RefCross Ref
  13. Feng Wang, Weiyang Liu, Haijun Liu, Jian Cheng. “Additive Margin SoftMax for Face Verification.” [J]. arXiv preprint arXiv:1801.05599, 2018.Google ScholarGoogle Scholar
  14. S. Ioffe, “Probabilistic linear discriminant analysis, in European Conference on Computer Vision.” Springer, 2006, pp. 531–542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matejka, Pavel and Novotny, Ondrej and Plchot, Oldˇrich and Burget, Lukas and Diez, Mireia and Cˇernocky ́, Jan, “Analysis of Score Normalization in Multilingual Speaker Recognition.” 10.21437/Interspeech. 2017.Google ScholarGoogle Scholar
  16. Joon Son Chung, Arsha Nagrani, Andrew Zisserman, “Vox-Celeb2: Deep Speaker Recognition.” arXiv:1806.05622 [cs.SD], 2018.Google ScholarGoogle Scholar
  17. T. DeVries and G. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout.” in arXiv, 2017.Google ScholarGoogle Scholar
  18. Driedger, J.; Mu ̈ller, M. “A Review of Time-Scale Modification of Music Signals.” Appl. Sci. 2016, 6, 57.Google ScholarGoogle ScholarCross RefCross Ref
  19. Sound Touch, audio manipulation tool, accessed March 26, 2020. [Online]. Available: http://soundtouch.surina.net/Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering
    October 2022
    1999 pages
    ISBN:9781450397148
    DOI:10.1145/3573428

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 15 March 2023

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate508of972submissions,52%
  • Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)3

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format