Abstract:
Machine Speech Chain integrates both end-to-end (E2E) automatic speech recognition (ASR) and neural text-to-speech (TTS) into one circle for joint training. It has been p...Show MoreMetadata
Abstract:
Machine Speech Chain integrates both end-to-end (E2E) automatic speech recognition (ASR) and neural text-to-speech (TTS) into one circle for joint training. It has been proven that it can effectively leverage a large amount of unpaired data in the spirit of data augmentation. In this paper, we explore the TTS→ASR pipeline in machine speech chain to perform domain adaptation for both E2E ASR and neural TTS models with only text data from the target domain. We conduct experiments by adapting from audiobook domain (i.e., LibriSpeech) to presentation domain (i.e., TED-LIUM). There is a relative word error rate (WER) reduction of 19.7% for the E2E ASR model on the TED-LIUM test set, and a relative WER reduction of 29.4% in synthetic speech generated by neural TTS in the presentation domain. Moreover, we observe that the gains from the proposed method and conventional adaptation methods of language models are additive.
Published in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 23-27 May 2022
Date Added to IEEE Xplore: 27 April 2022
ISBN Information: