Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering

Heck, Michael; Sakti, Sakriani; Nakamura, Satoshi

doi:10.21437/Interspeech.2016-988

Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering

Michael Heck, Sakriani Sakti, Satoshi Nakamura

In this work we utilize a supervised acoustic model training pipeline without supervision to improve Dirichlet process Gaussian mixture model (DPGMM) based feature vector clustering. We exploit methods common in supervised acoustic modeling to unsupervisedly learn feature transformations for application to the input data prior to clustering. The idea is to automatically find mappings of feature vectors into sub-spaces that are more robust to channel, context and speaker variability. The need of labels for these techniques makes it difficult to use them in a zero resource setting. To overcome this issue we utilize a first iteration of DPGMM clustering to generate frame based class labels for the target data. The labels serve as basis for learning an acoustic model in the form of hidden Markov models (HMMs) using linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT) and speaker adaptive training (SAT). We show that the learned transformations lead to features that consistently outperform untransformed features on the ABX sound class discriminability task. We also demonstrate that the combination of multiple clustering runs is a suitable method to further enhance sound class discriminability.

doi: 10.21437/Interspeech.2016-988

Cite as: Heck, M., Sakti, S., Nakamura, S. (2016) Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering. Proc. Interspeech 2016, 1310-1314, doi: 10.21437/Interspeech.2016-988

@inproceedings{heck16_interspeech,
  author={Michael Heck and Sakriani Sakti and Satoshi Nakamura},
  title={{Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={1310--1314},
  doi={10.21437/Interspeech.2016-988}
}