Abstract:
Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly o...Show MoreMetadata
Abstract:
Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly obtained from in-domain text or generated text using augmentation. However, obtaining large amounts of in-domain text data with diverse lexical contexts is difficult, especially in low-resource scenarios. This paper proposes using text from a large generic-domain source and applying a domain filtering method to choose the relevant text data. This method involves two filtering steps: 1) selecting text based on its semantic similarity to the available in-domain text and 2) diversifying the vocabulary of the selected text using a greedy-search algorithm. Experimental results show that our proposed method outperforms the conventional text augmentation approach, with the relative reduction of word-error-rate ranging from 6% to 25% on the LibriSpeech dataset and 15% on a low-resource Vietnamese dataset.
Published in: 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Date of Conference: 31 October 2023 - 03 November 2023
Date Added to IEEE Xplore: 20 November 2023
ISBN Information: