Abstract:
Vocoders in Text-To-Speech (TTS) systems are responsible for converting acoustic feature representations such as the Mel Spectrogram to the sound waveform. Recent develop...Show MoreMetadata
Abstract:
Vocoders in Text-To-Speech (TTS) systems are responsible for converting acoustic feature representations such as the Mel Spectrogram to the sound waveform. Recent developments in vocoders, such as WaveRNN [1], Parallel WaveGAN [2], HiFi-GAN [3], and Diffusion models [4], [5], mostly have introduced neural architectures outperforming traditional architectures like those using the Griffin-Lim algorithm (GLA)[6]. In this paper, a multi-band Parallel WaveGAN architecture (PWG), the Harmonic-plus-Noise (H+N) vocoder, is trained, implemented, and combined with two types of filters: a) Linear Prediction (LP) filter and b) Perceptual Weighting (PW) filter to improve the TTS quality in Filipino language. Based on the results, HN-PWG garnered the highest total MOS at 4.59 ± 0.10, closely followed by HN-PWG-PW at 4.58 ± 0.07 with no statistically significant difference between the two. All the implemented H+N systems were able to outperform the Tacotron2-based Filipino TTS using a WaveGlow vocoder based on the MOS.
Date of Conference: 25-27 October 2023
Date Added to IEEE Xplore: 15 November 2023
ISBN Information: