In Deep Neural Network (DNN) i-vector based speaker recognition systems, acoustic models trained for Automatic Speech Recognition are employed to estimate sufficient statistics for i-vector modeling. The DNN based acoustic model is typically trained on a well-resourced language like English. In evaluation conditions where the enrollment and test data are not in English, as in the NIST SRE 2016 dataset, a DNN acoustic model generalizes poorly. In such conditions, a conventional Universal Background Model/Gaussian Mixture Model (UBM/GMM) based i-vector extractor performs better than the DNN based i-vector system. In this paper, we address the scenario in which one can develop a Automatic Speech Recognizer with limited resources for a language present in the evaluation condition, thus enabling the use of a DNN acoustic model instead of UBM/GMM. Experiments are performed on the Tagalog subset of the NIST SRE 2016 dataset assuming an open training condition. With a DNN i-vector system trained for Tagalog, a relative improvement of 12.1% is obtained over a baseline system trained for English.
Cite as: Madikeri, S., Dey, S., Motlicek, P. (2018) Analysis of Language Dependent Front-End for Speaker Recognition. Proc. Interspeech 2018, 1101-1105, doi: 10.21437/Interspeech.2018-2071
@inproceedings{madikeri18_interspeech, author={Srikanth Madikeri and Subhadeep Dey and Petr Motlicek}, title={{Analysis of Language Dependent Front-End for Speaker Recognition}}, year=2018, booktitle={Proc. Interspeech 2018}, pages={1101--1105}, doi={10.21437/Interspeech.2018-2071} }