Learning Mixture Representation for Deep Speaker Embedding Using Attention

Lin, Weiwei; Mak, Man Wai; Yi, Lu

doi:10.21437/Odyssey.2020-30

Learning Mixture Representation for Deep Speaker Embedding Using Attention

Weiwei Lin, Man Wai Mak, Lu Yi

Almost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set.

doi: 10.21437/Odyssey.2020-30

Cite as: Lin, W., Mak, M.W., Yi, L. (2020) Learning Mixture Representation for Deep Speaker Embedding Using Attention. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 210-214, doi: 10.21437/Odyssey.2020-30

@inproceedings{lin20c_odyssey,
  author={Weiwei Lin and Man Wai Mak and Lu Yi},
  title={{Learning Mixture Representation for Deep Speaker Embedding Using Attention}},
  year=2020,
  booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2020)},
  pages={210--214},
  doi={10.21437/Odyssey.2020-30}
}