Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data

Fong, Jason; Gallegos, Pilar Oplustil; Hodari, Zack; King, Simon

doi:10.21437/Interspeech.2019-1824

Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data

Jason Fong, Pilar Oplustil Gallegos, Zack Hodari, Simon King

Sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise high quality speech when large amounts of annotated training data are available. Transcription errors exist in all data and are especially prevalent in found data such as audiobooks. In previous generations of TTS technology, alignment using Hidden Markov Models (HMMs) was widely used to identify and eliminate bad data. In S2S models, the use of attention replaces HMM-based alignment, and there is no explicit mechanism for removing bad data. It is not yet understood how such models deal with transcription errors in the training data.

We evaluate the quality of speech from S2S-TTS models when trained on data with imperfect transcripts, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR).We find that attention can skip over extraneous words in the input sequence, providing robustness to insertion errors. But substitutions and deletions pose a problem because there is no ground truth input available to align to the ground truth acoustics during teacher-forced training. We conclude that S2S-TTS systems are only partially robust to training on imperfectly-transcribed data and further work is needed.

doi: 10.21437/Interspeech.2019-1824

Cite as: Fong, J., Gallegos, P.O., Hodari, Z., King, S. (2019) Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data. Proc. Interspeech 2019, 1546-1550, doi: 10.21437/Interspeech.2019-1824

@inproceedings{fong19_interspeech,
  author={Jason Fong and Pilar Oplustil Gallegos and Zack Hodari and Simon King},
  title={{Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1546--1550},
  doi={10.21437/Interspeech.2019-1824}
}