This paper introduces a new method to extract speaker embeddings from a deep neural network (DNN) for text-independent speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over the frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker's frame-level hidden vectors and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads are also investigated to capture different aspects of a speaker's input speech. Finally, a PLDA classifier is used to compare pairs of embeddings. The proposed self-attentive speaker embedding system is compared with a strong DNN embedding baseline on NIST SRE 2016. We find that the self-attentive embeddings achieve superior performance. Moreover, the improvement produced by the self-attentive speaker embeddings is consistent with both short and long testing utterances.
Cite as: Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D. (2018) Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. Proc. Interspeech 2018, 3573-3577, doi: 10.21437/Interspeech.2018-1158
@inproceedings{zhu18_interspeech, author={Yingke Zhu and Tom Ko and David Snyder and Brian Mak and Daniel Povey}, title={{Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification}}, year=2018, booktitle={Proc. Interspeech 2018}, pages={3573--3577}, doi={10.21437/Interspeech.2018-1158} }