Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method

Tatsuya MIZUTANI
Takehiko KAGOSHIMA

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E88-D    No.11    pp.2565-2572
Publication Date: 2005/11/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.11.2565
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
speech synthesis,  plural unit selection,  unit fusion,  unit training,  sense of stability and sense of voice,  

Full Text: PDF(919.1KB)>>
Buy this Article



Summary: 
This paper proposes a novel speech synthesis method to generate human-like natural speech. The conventional unit-selection-based synthesis method selects speech units from a large database, and concatenates them with or without modifying the prosody to generate synthetic speech. This method features highly human-like voice quality. The method, however, has a problem that a suitable speech unit is not necessarily selected. Since the unsuitable speech unit selection causes discontinuity between the consecutive speech units, the synthesized speech quality deteriorates. It might be considered that the conventional method can attain higher speech quality if the database size increases. However, preparation of a larger database requires a longer recording time. The narrator's voice quality does not remain constant throughout the recording period. This fact deteriorates the database quality, and still leaves the problem of unsuitable selection. We propose the plural unit selection and fusion method which avoids this problem. This method integrates the unit fusion used in the unit-training-based method with the conventional unit-selection-based method. The proposed method selects plural speech units for each segment, fuses the selected speech units for each segment, modifies the prosody of the fused speech units, and concatenates them to generate synthetic speech. This unit fusion creates speech units which are connected to one another with much less voice discontinuity, and realizes high quality speech. A subjective evaluation test showed that the proposed method greatly improves the speech quality compared with the conventional method. Also, it showed that the speech quality of the proposed method is kept high regardless of the database size, from small (10 minutes) to large (40 minutes). The proposed method is a new framework in the sense that it is a hybrid method between the unit-selection-based method and the unit-training-based method. In the framework, the algorithms of the unit selection and the unit fusion are exchangeable for more efficient techniques. Thus, the framework is expected to lead to new synthesis methods.


open access publishing via