ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

A Transfer and Multi-Task Learning based Approach for MOS Prediction

Xiaohai Tian, Kaiqi Fu, Shaojun Gao, Yiwei Gu, Kai Wang, Wei Li, Zejun Ma

Automatic speech quality assessment aims to train a model capable of automatically measuring the performance of synthesis systems. This is a challenging task, especially when the domain of the evaluation data is different to that of the training data. In this paper, we present a multi-task and transfer learning framework for predicting the mean opinion score (MOS) of synthetic speech from different domains. Specifically, the proposed framework consists of a common encoder shared by data from different domains and two domain-specific decoders for in-domain and out-of-domain data, respectively. A wav2vec2 fine-tuned for phone recognition task is utilized as an initialization of the shared encoder to make full use of its learned knowledge from large number of unlabeled data and task-related labeled data. The experiments are conducted on the VoiceMOS Challenge dataset. The results show that the proposed system outperforms the baseline solutions for both in-domain and out-of-domain MOS prediction scenarios. Further, we show that the wav2vec2 encoder fine-tuned for phone recognition can be transferred to boost the performance of the MOS prediction.


doi: 10.21437/Interspeech.2022-10022

Cite as: Tian, X., Fu, K., Gao, S., Gu, Y., Wang, K., Li, W., Ma, Z. (2022) A Transfer and Multi-Task Learning based Approach for MOS Prediction. Proc. Interspeech 2022, 5438-5442, doi: 10.21437/Interspeech.2022-10022

@inproceedings{tian22d_interspeech,
  author={Xiaohai Tian and Kaiqi Fu and Shaojun Gao and Yiwei Gu and Kai Wang and Wei Li and Zejun Ma},
  title={{A Transfer and Multi-Task Learning based Approach for MOS Prediction}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={5438--5442},
  doi={10.21437/Interspeech.2022-10022}
}