Generating Complementary Acoustic Model Spaces in DNN-Based Sequence-to-Frame DTW Scheme for Out-of-Vocabulary Spoken Term Detection

Lee, Shi-wook; Tanaka, Kazuyo; Itoh, Yoshiaki

doi:10.21437/Interspeech.2016-838

Generating Complementary Acoustic Model Spaces in DNN-Based Sequence-to-Frame DTW Scheme for Out-of-Vocabulary Spoken Term Detection

Shi-wook Lee, Kazuyo Tanaka, Yoshiaki Itoh

This paper proposes a sequence-to-frame dynamic time warping (DTW) combination approach to improve out-of-vocabulary (OOV) spoken term detection (STD) performance gain. The goal of this paper is twofold: first, we propose a method that directly adopts the posterior probability of deep neural network (DNN) and Gaussian mixture model (GMM) as the similarity distance for sequence-to-frame DTW. Second, we investigate combinations of diverse schemes in GMM and DNN, with different subword units and acoustic models, estimate the complementarity in terms of performance gap and correlation of the combined systems, and discuss the performance gain of the combined systems. The results of evaluations conducted of the combined systems on an out-of-vocabulary spoken term detection task show that the performance gain of DNN-based systems is better than that of GMM-based systems. However, the performance gain obtained by combining DNN- and GMM-based systems is insignificant, even though DNN and GMM are highly heterogeneous. This is because the performance gap between DNN-based systems and GMM-based systems is quite large. On the other hand, score fusion of two heterogeneous subword units, triphone and sub-phonetic segments, in DNN-based systems provides significantly improved performance.

doi: 10.21437/Interspeech.2016-838

Cite as: Lee, S.-w., Tanaka, K., Itoh, Y. (2016) Generating Complementary Acoustic Model Spaces in DNN-Based Sequence-to-Frame DTW Scheme for Out-of-Vocabulary Spoken Term Detection. Proc. Interspeech 2016, 755-759, doi: 10.21437/Interspeech.2016-838

@inproceedings{lee16b_interspeech,
  author={Shi-wook Lee and Kazuyo Tanaka and Yoshiaki Itoh},
  title={{Generating Complementary Acoustic Model Spaces in DNN-Based Sequence-to-Frame DTW Scheme for Out-of-Vocabulary Spoken Term Detection}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={755--759},
  doi={10.21437/Interspeech.2016-838}
}