Almost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set.
Cite as: Lin, W., Mak, M.W., Yi, L. (2020) Learning Mixture Representation for Deep Speaker Embedding Using Attention. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 210-214, doi: 10.21437/Odyssey.2020-30
@inproceedings{lin20c_odyssey, author={Weiwei Lin and Man Wai Mak and Lu Yi}, title={{Learning Mixture Representation for Deep Speaker Embedding Using Attention}}, year=2020, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2020)}, pages={210--214}, doi={10.21437/Odyssey.2020-30} }