ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks

Zheng-Chen Liu, Zhen-Hua Ling, Li-Rong Dai

This paper presents an articulatory-to-acoustic conversion method using electromagnetic midsagittal articulography (EMA) measurements as input features. Neural networks, including feed-forward deep neural networks (DNNs) and recurrent neural networks (RNNs) with long short-term term memory (LSTM) cells, are adopted to map EMA features towards not only spectral features (i.e. mel-cepstra) but also excitation features (i.e. power, U/V flag and F0). Then speech waveforms are reconstructed using the predicted spectral and excitation features. A cascaded prediction strategy is proposed to utilize the predicted spectral features as auxiliary input to boost the prediction of excitation features. Experimental results show that LSTM-RNN models can achieve better objective and subjective performance in articulatory-to-spectral conversion than DNNs and Gaussian mixture models (GMMs). The strategy of cascaded prediction can increase the accuracy of excitation feature prediction and the neural network-based methods also outperform the GMM-based approach when predicting power features.


doi: 10.21437/Interspeech.2016-715

Cite as: Liu, Z.-C., Ling, Z.-H., Dai, L.-R. (2016) Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks. Proc. Interspeech 2016, 1502-1506, doi: 10.21437/Interspeech.2016-715

@inproceedings{liu16g_interspeech,
  author={Zheng-Chen Liu and Zhen-Hua Ling and Li-Rong Dai},
  title={{Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={1502--1506},
  doi={10.21437/Interspeech.2016-715}
}