Language Identification Based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs

Masumura, Ryo; Asami, Taichi; Masataki, Hirokazu; Aono, Yushi; Sakauchi, Sumitaka

doi:10.21437/Interspeech.2016-719

Language Identification Based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs

Ryo Masumura, Taichi Asami, Hirokazu Masataki, Yushi Aono, Sumitaka Sakauchi

This paper aims to enhance spoken language identification methods based on direct discriminative modeling of language labels using deep neural networks (DNNs) and long short-term memory recurrent neural networks (LSTM-RNNs). In conventional methods, frame-by-frame DNNs or LSTM-RNNs are used for utterance-level classification. Although they have strong frame-level classification performance and real-time efficiency, they are not optimized for variable length utterance-level classification since the classification is conducted by simply averaging frame-level prediction results. In addition, the simple classification methodology cannot fully utilize the combination of DNNs and LSTM-RNNs. To address these issues, our idea is to combine the frame-by-frame DNNs and LSTM-RNNs with a sequential generative model based classifier. In the proposed method, we regard posteriorgram sequences generated from a frame-by-frame classifier as feature sequences, and model them with respect to each language using language modeling technologies. The generative model based classifier does not model an identification boundary, so we can flexibly deal with variable length utterances without loss of conventional advantages. Furthermore, the proposed method can support the combination of DNNs and LSTMs using joint posteriorgram sequences, those of generative modeling can capture differences between two posteriorgram sequences. Experiments conducted using the GlobalPhone database demonstrate the proposed method’s effectiveness.

doi: 10.21437/Interspeech.2016-719

Cite as: Masumura, R., Asami, T., Masataki, H., Aono, Y., Sakauchi, S. (2016) Language Identification Based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs. Proc. Interspeech 2016, 3275-3279, doi: 10.21437/Interspeech.2016-719

@inproceedings{masumura16_interspeech,
  author={Ryo Masumura and Taichi Asami and Hirokazu Masataki and Yushi Aono and Sumitaka Sakauchi},
  title={{Language Identification Based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={3275--3279},
  doi={10.21437/Interspeech.2016-719}
}