ABSTRACT
Data augmentation is a hot issue in neural network training. In this paper, we investigate data augmentation for speaker recognition, and we propose two data augmentation methods to enhance the performance of neural network system. One of which is spectral augmentation. Spectral augmentation is a newly proposed data augmentation method which applied to speech recognition and got state-of-the-art performance, by masking blocks of frequency channels(F-mask), and/or by masking blocks of time steps(T-mask). We also investigate the method of speed perturbation, which adjusts the time-scale of a given audio signal without altering its pitch content. Experimental results show that both two methods can boost the performance. By combining TF-mask with speed perturbation, we can obtain more than 5.2% and 8.7% relative improvements over the baseline systems in the Vox-H and WX tasks.
- Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification.” IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.Google ScholarDigital Library
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed- dings for speaker recognition.” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. IEEE, 2018.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.” in Proc. of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778Google ScholarCross Ref
- Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, “Squeeze-and-Excitation Networks.” arXiv:1709.01507 [cs.CV], 2019.Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need.” arXiv:1706.03762 [cs.CL], 2017.Google Scholar
- Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, Yichen Wei, “Circle Loss: A Unified Perspective of Pair Similarity Optimization.” arXiv:2002.10857 [cs.CV], 2020.Google Scholar
- Y. Taigman, M. Yang, M. RanzatoandL.Wolf, “DeepFace: Closing the gap to human-level performance in face verification.” in CVPR, pp. 1701–1708, 2014.Google ScholarDigital Library
- I. Sutskever, O. VinyalsandQ.V. Le, “Sequence to sequence learning with neural networks.” in Advances in Neural Information Processing Systems, pp. 3104-3112, 2014.Google ScholarDigital Library
- Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art Speech Recognition with Sequence-to-Sequence Models.” in ICASSP, 2018.Google Scholar
- Daniel Spark, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Spec augment: A simple data augmentation method for automatic speech recognition.” arXiv preprint arXiv:1904.08779, 2019.Google Scholar
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker Recognition for Multi-Speaker Conversations Using X-Vectors.” in ICASSP, 2019.Google ScholarCross Ref
- Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell, “JHU-HLTCOE SYSTEM FOR THE VOXSRC SPEAKER RECOGNITION CHALLENGE.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.Google ScholarCross Ref
- Feng Wang, Weiyang Liu, Haijun Liu, Jian Cheng. “Additive Margin SoftMax for Face Verification.” [J]. arXiv preprint arXiv:1801.05599, 2018.Google Scholar
- S. Ioffe, “Probabilistic linear discriminant analysis, in European Conference on Computer Vision.” Springer, 2006, pp. 531–542.Google ScholarDigital Library
- Matejka, Pavel and Novotny, Ondrej and Plchot, Oldˇrich and Burget, Lukas and Diez, Mireia and Cˇernocky ́, Jan, “Analysis of Score Normalization in Multilingual Speaker Recognition.” 10.21437/Interspeech. 2017.Google Scholar
- Joon Son Chung, Arsha Nagrani, Andrew Zisserman, “Vox-Celeb2: Deep Speaker Recognition.” arXiv:1806.05622 [cs.SD], 2018.Google Scholar
- T. DeVries and G. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout.” in arXiv, 2017.Google Scholar
- Driedger, J.; Mu ̈ller, M. “A Review of Time-Scale Modification of Music Signals.” Appl. Sci. 2016, 6, 57.Google ScholarCross Ref
- Sound Touch, audio manipulation tool, accessed March 26, 2020. [Online]. Available: http://soundtouch.surina.net/Google Scholar
Recommendations
Speaker verification using excitation source information
In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the ...
Adversarial Data Augmentation for Robust Speaker Verification
ICCIP '23: Proceedings of the 2023 9th International Conference on Communication and Information ProcessingData augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn ...
Speaker verification under degraded condition: a perceptual study
This study analyzes the effect of degradation on human and automatic speaker verification (SV) tasks. The perceptual test is conducted by the subjects having knowledge about speaker verification. An automatic SV system is developed using the Mel-...
Comments