We introduce a new model for emotion conversion in speech based on highway neural networks. Our model uses the contextual pitch, energy and spectral information of a source emotional utterance to predict the framewise fundamental frequency and signal intensity under a target emotion. We also incorporate a latent gender representation to promote cross-speaker generalizability. Our neural network is trained to maximize the error log-likelihood under an assumed Laplacian distribution. We validate our model on the VESUS repository collected at Johns Hopkins University, which contains parallel emotional utterances from 10 actors across 5 emotional classes. The proposed algorithm outperforms three state-of-the-art baselines in terms of the mean absolute error and correlation between the predicted and target values. We evaluate the quality of our emotion manipulations via crowd-sourcing. Finally, we apply our emotion morphing model to utterances generated by Wavenet to demonstrate our unique ability to inject emotion into synthetic speech.
Cite as: Shankar, R., Sager, J., Venkataraman, A. (2019) A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective. Proc. Interspeech 2019, 2848-2852, doi: 10.21437/Interspeech.2019-2512
@inproceedings{shankar19b_interspeech, author={Ravi Shankar and Jacob Sager and Archana Venkataraman}, title={{A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={2848--2852}, doi={10.21437/Interspeech.2019-2512} }