Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label [1]. Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system [2]. In this paper, we answer the following two questions: First, how to effectively perform rapid speaker adaptation in an RNN-T. Second, whether our previously proposed approach is still beneficial for the RNN-T and what are the modification and distinct observations. We apply the proposed methodology to a speaker adaptation task in a state-of-art presentation transcription RNN-T system. In the 1 min setup, it yields 11.58% or 7.95% relative word error rate (WER) reduction for the sup/unsup adaptation, comparing to the negligible gain when adapting with 1 min source speech. In the 10 min setup, it yields 15.71% or 8.00% relative WER reduction, doubling the gain of the source speech adaptation. We further apply various data filtering techniques and significantly bridge the gap between sup/unsup adaptation.
Cite as: Huang, Y., Li, J., He, L., Wei, W., Gale, W., Gong, Y. (2020) Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator. Proc. Interspeech 2020, 1256-1260, doi: 10.21437/Interspeech.2020-1290
@inproceedings{huang20c_interspeech, author={Yan Huang and Jinyu Li and Lei He and Wenning Wei and William Gale and Yifan Gong}, title={{Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={1256--1260}, doi={10.21437/Interspeech.2020-1290} }