Sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise
high quality speech when large amounts of annotated training data are
available. Transcription errors exist in all data and are especially
prevalent in found data such as audiobooks. In previous generations
of TTS technology, alignment using Hidden Markov Models (HMMs) was
widely used to identify and eliminate bad data. In S2S models, the
use of attention replaces HMM-based alignment, and there is no explicit
mechanism for removing bad data. It is not yet understood how such
models deal with transcription errors in the training data.
We evaluate the quality
of speech from S2S-TTS models when trained on data with imperfect transcripts,
simulated using corruption, or provided by an Automatic Speech Recogniser
(ASR).We find that attention can skip over extraneous words in the
input sequence, providing robustness to insertion errors. But substitutions
and deletions pose a problem because there is no ground truth input
available to align to the ground truth acoustics during teacher-forced
training. We conclude that S2S-TTS systems are only partially robust
to training on imperfectly-transcribed data and further work is needed.
Cite as: Fong, J., Gallegos, P.O., Hodari, Z., King, S. (2019) Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data. Proc. Interspeech 2019, 1546-1550, doi: 10.21437/Interspeech.2019-1824
@inproceedings{fong19_interspeech, author={Jason Fong and Pilar Oplustil Gallegos and Zack Hodari and Simon King}, title={{Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={1546--1550}, doi={10.21437/Interspeech.2019-1824} }