Conferences >2016 Asia-Pacific Signal and ...

On the training of DNN-based average voice model for speech synthesis

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks ...Show More

Metadata

Abstract:

Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

Published in: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)

Date of Conference: 13-16 December 2016

Date Added to IEEE Xplore: 19 January 2017

ISBN Information:

DOI: 10.1109/APSIPA.2016.7820818

Conference Location: Jeju, Korea (South)