Abstract:
Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks ...Show MoreMetadata
Abstract:
Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.
Published in: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
Date of Conference: 13-16 December 2016
Date Added to IEEE Xplore: 19 January 2017
ISBN Information: