Abstract:
Text normalization (TN) is a crucial preprocessing step in text-to-speech synthesis, which pertains to the accurate pronunciation of numbers and symbols within the text. ...Show MoreMetadata
Abstract:
Text normalization (TN) is a crucial preprocessing step in text-to-speech synthesis, which pertains to the accurate pronunciation of numbers and symbols within the text. Existing neural network-based TN methods have shown significant success in rich-resource languages. However, these methods are data-driven and highly rely on a large number of labeled datasets, which are not practical in zero-resource settings. Rule-based weighted finite-state transducers (WFST) are a common measure for zero-shot TN, but WFST-based TN approaches encounter challenges with ambiguous input, particularly in cases where the normalized form is context-dependent. On the other hand, conventional neural TN methods suffer from unrecoverable errors. In this paper, we propose ZSTN, a novel zero-shot TN framework based on cross-lingual knowledge distillation, which utilizes annotated data to train the teacher model on rich-resource language and unlabelled data to train the student model on zero-resource language. Furthermore, it incorporates expert knowledge from WFST into a knowledge distillation neural network. Concretely, a TN model with WFST pseudo-labels augmentation is trained as a teacher model in the source language. Subsequently, the student model is supervised by soft-labels from the teacher model and WFST pseudo-labels from the target language. By leveraging cross-lingual knowledge distillation, we address contextual ambiguity in the text, while WFST mitigates unrecoverable errors of the neural model. Additionally, ZSTN is adaptable to different zero-resource languages by using the joint loss function for the teacher model and WFST constraints. We also release a zero-shot text normalization dataset in five languages. We compare ZSTN with seven zero-shot TN benchmarks on public datasets in four languages for the teacher model and zero-shot datasets in five languages for the student model. The results demonstrate that the proposed ZSTN excels in performance without the need for labeled ...
Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 32)