Elsevier

Speech Communication

Volume 57, February 2014, Pages 144-154
Speech Communication

Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis

https://doi.org/10.1016/j.specom.2013.09.014Get rights and content

Highlights

  • We propose a technique for adding prosodic variations in HMM-based speech synthesis.

  • Phrase-level contexts are defined that compensate the lack of prosodic variations.

  • Average prosodic difference between original and synthetic speech is used.

  • Expressive speech of sales talk and fairy tale is used for evaluation.

  • Users can manually change phrase-level pitch without degrading naturalness.

Abstract

This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g., low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence.

Introduction

For synthesizing natural-sounding expressive speech, a variety of techniques have been proposed in the few last decades (Erickson, 2005, Schröder, 2009). Although the expressions appearing in our daily speech communication are highly diverse, most of the studies have focused on a simplified case in which the expressions including emotions, speaking styles, and emphasis expressions, which we will simply call styles, are restricted to a small number of categories and only a single style appears in each utterance. Two major approaches have been proposed for synthesizing expressive speech in such a simplified case. One is concatenative speech synthesis using unit selection (Iida et al., 2003, Pitrelli et al., 2006), and the other is parametric speech synthesis based on hidden Markov models (HMMs), e.g., style modeling (Yamagishi et al., 2003). When a target style consistently appears in the speech data, we can generate synthetic speech with the style using these approaches. In addition, we can reproduce locally appearing styles, such as emphasis expressions, in the synthetic speech by taking into account the location of the styles in the training of the synthesis system. The location is specified using recording scripts (Strom et al., 2007, Morizane et al., 2009) or subjective listening tests (Yu et al., 2010). However, these types of simplifications sometimes narrow the variations of acoustic features in recorded speech and make the synthetic speech less expressive.

Recently, studies for synthesizing expressive speech using speech corpus recorded under more realistic situations, such as storytelling and sales talk, have been conducted to relax the restrictions (Prahallad et al., 2007, Nakajima and Sagisaka, 2009, Braunschweiler et al., 2010, Doukhan et al., 2011). Different from the previous studies in the simplified case, only a certain domain is specified and there are no directions such as which portion to be emphasized or which type of emotional expression to be used during the recording. Although we can reflect the expressions included in the recorded speech by manually annotating the speech using a similar manner shown in the spontaneous speech case (Campbell, 2005), annotation is time-consuming and tends to be expensive. In addition, even if the cost is acceptable, we encounter a difficulty that consistent annotation of styles, such as emotional expressions, is not always possible. For example, in the case of emphasis expressions, there is not always a consistent agreement to which words/phrases are emphasized by the speaker. Similarly, verbalization of emotions or speaking styles appearing in expressive speech is not unique and varies depending on annotators. As a result, the inter-rater agreement is not always sufficient and annotation results are sometimes unreliable. Although there have been HMM-based approaches that use listening test results as contextual factors (Tsuzuki et al., 2004) or explanatory variables in the multiple-regression model (Nose and Kobayashi, 2013), style expressivity of the synthetic speech still depends on the listeners.

To avoid the manual annotation and categorization of styles, expressive speech synthesis techniques using unsupervised clustering of styles has been examined (Sźekely et al., 2011, Eyben et al., 2012). Sźekely et al. (2011) applied self-organizing feature maps to voice quality parameters for clustering audiobook speech data in unit-selection synthesis. Eyben et al. (2012) used hierarchical k-means clustering as a clustering method in HMM-based synthesis and proposed two methods to add the style characteristics of the obtained clustered speech to the synthetic speech. One is to take into account the cluster questions in the decision tree construction, and the other is cluster adaptation using linear transforms. By using the above unsupervised clustering techniques, clustered units or HMMs for synthesis are obtained without any manual or subjective annotation, and prosodic variations are enhanced in the synthetic speech. However, one of the problems in these unsupervised clustering techniques is that there are not always explicit physical or para-linguistic meanings in the resultant clusters; hence, users have a difficulty in choosing an appropriate cluster to output the desired expressive speech in the speech synthesis phase.

In this paper, we propose an alternative technique for enhancing the prosodic variations of synthetic expressive speech without manual annotation of style information for the model training. In the proposed technique, the para-linguistic context labels are not predicted from the input text but directly specified by the user. This type of approach can be used to reduce the cost of recording a huge amount of human voice for some applications such as audiobook. To avoid the manual annotation for the model training, we introduce novel phrase-level prosodic contexts defined by the average difference in prosodic features between original and synthetic speech of the training sentences. Although we proposed a similar technique for unsupervised emphasis/non-emphasis labeling in our previous study (Maeno et al., 2011), the effectiveness of the technique is rather limited. This is because the technique requires a pre-determined classification threshold, which is difficult to optimize depending on the target speakers and styles. Another problem is that emphasis/non-emphasis labels are not always sufficient to represent a wide variety of prosodic variations of expressive speech. For example, when a user synthesizes speech and feels that a certain phrase should have lower pitch, the technique cannot meet this request.

To overcome these problems, we define phrase-level prosodic contexts where three-class categorization is used, e.g., low, neutral, and high in the case of fundamental frequency (F0), instead of emphasis/non-emphasis classification. These contexts enable users to change the prosodic characteristics of each phrase more precisely. For example, users can modify the pitch of synthetic speech using a more intuitive direction such as “with higher/lower pitch at this accent phrase.” Another advantage is that the optimal threshold for classification is automatically determined for each training data set with different speakers and styles. Although modification of prosodic features is also possible by simply using a rule-based conversion, modifying the phrase-level prosody while maintaining the naturalness of the synthetic speech is not always easy in the framework of a heuristic rule-based approach. By contrast, in the proposed technique the prosodic variations appearing in the training data are statistically and automatically modeled using context-dependent HMMs with which the prosodic contexts are taken into account in the model training.

There have been similar studies on labeling speech segments with linguistic and para-linguistic information using difference of prosodic features between generated and original speech. Vainio et al. (2005) proposed a prosodic tagging technique in which accent/prominence intensity in Finnish speech was classified into four levels using a multiple linear regression model. However, this labeling system requires manually labeled training data, which is not required in our unsupervised labeling approach. To reduce the manual labeling cost, “less supervised approach” (Suni et al., 2012) was also proposed for the annotation of phrase-level style and word-level prominence information. In this technique, the differences of prosodic features were summed up using empirical weights and then classified into several levels. In contrast, our approach has an advantage that the classification threshold of F0 differences is automatically optimized depending on the target expressive corpus without using any heuristic manner. In addition, the created labels by the proposed technique has an explicit meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive manner by manually changing the values of the proposed labels.

Section snippets

Database information

In this study, we use Japanese expressive speech data described in a literature (Nakajima et al., 2010). We use two types of speech data: appealing speech in sales talk and fairy tale speech in storytelling. These speech data were recorded under a realistic domain where no certain speech styles were specified and only the target domain (situation) was directed to the speakers. For this purpose, domain-specific sentences were used in the recording. There are three female professional speakers

Unsupervised context labeling for local prosodic variations

Fig. 3 shows an example of F0 contours of original and synthetic samples for appealing speech of speaker #1. The F0 contour of the synthetic speech was generated from HMMs trained using context-dependent labels in which only the linguistic information was taken into account. The first accent phrase in the original speech has a clearly higher average F0 than that in the second phrase. By contrast, such variation is not reproduced in the synthetic speech. This means that the contextual factors of

Prosodic variation enhancement in speech synthesis phase

A problem with the proposed technique is that we cannot automatically obtain the proposed prosodic context labels from the input text in the speech synthesis phase since the labels are determined from the original expressive speech utterance. This type of problem of the data-driven expressive speech classification in speech synthesis studies often remains unsolved and is expected to be discussed in future work (Eyben et al., 2012, Sźekely et al., 2011). In this paper, we provide a practical

Experimental conditions

In the following experiments, we used the appealing and fairy tale speech data described in Section 2.1. We conducted four-fold cross-validation for each speech data and compared the following two techniques.

  • CONVENTIONAL:

    the conventional technique using only the linguistic contexts

  • PROPOSED:

    the proposed technique using both linguistic and F0 contexts

Speech signals were sampled at a rate of 16 kHz and the frame shift was 5 ms. We used STRAIGHT (Kawahara et al., 1999) for speech feature extraction and extracted spectral

Choice of optimal thresholds in model training

First, we examined how the optimal thresholds were determined for respective prosodic contexts in the model training. We determined optimal thresholds for F0, duration, and power contexts, respectively, using the algorithm mentioned in Section 3.2. In the first iteration, we set α = αs, and in the nth iteration, we set α as follows:α=αs+(n-1)·Δα(ααe)where Δα is an increment in each iteration. For each prosodic feature, αs was fixed to zero, and αe was set on the basis of the maximum value of d

Evaluation in ideal condition

Before evaluating the proposed technique in a practical situation, we investigated the potential of the proposed technique under an ideal condition where F0 context labels were created using the same manner as the labeling for the training data. Specifically, we temporarily generated the speech parameters for the test sentences and calculated the differences in average values of log F0 for each accent phrase between synthetic and original speech samples. Then, we created F0 context labels by

Evaluation in practical application

Through the previous experiments of Section 7, we have shown that the proposed technique can generate closer F0 contours than the conventional one when the “appropriate” F0 context labels are used. In Section 7, we obtained such ideal labels by applying the proposed labeling technique to the original speech of the test utterances as well as the training utterances. However, it is obvious that this assumption is not feasible in practical applications. Hence in this section, we evaluated the

Conclusions and future work

We proposed an unsupervised labeling technique for phrase-level prosodic variations. The technique can be used for enhancing HMM-based expressive speech synthesis. The additional prosodic contexts are automatically determined at the phrase level by the classification of the average difference in the prosodic features of the original and synthetic speech using training data. From the experiments on the prosodic context labeling, we found that the variations in the F0 feature of our expressive

References (26)

  • Morizane, K., Nakamura, K., Toda, T., Saruwatari, H., Shikano, K., 2009. Emphasized speech synthesis based on hidden...
  • Nakajima, H., Miyazaki, N., Yoshida, A., Nakamura, T., Mizuno, H., 2010. Creation and analysis of a Japanese speaking...
  • Nakajima, H., Sagisaka, Y., 2009. F0 analysis for Japanese conversational speech synthesis. In: Proc. SNLP’09, pp....
  • Cited by (11)

    • Improving HMM speech synthesis of interrogative sentences by pitch track transformations

      2016, Speech Communication
      Citation Excerpt :

      The result is then combined with a HMM-based TTS system, and according to the evaluation it produces a more natural synthetic speech then the conventional HMM-based method. The study of Maeno et al. (2014) aims at enhancing prosodic variation of HMM-based expressive speech synthesis. They propose an unsupervised labeling technique to capture phrase-level prosodic variations.

    • Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech

      2015, Speech Communication
      Citation Excerpt :

      Results showed that expressive personal speech can be generated with this model. More recently, Maeno et al. (2014) used a prosodic enhancement technique to generate expressive Japanese speech in HTS. They adopted three classes of prosodic context in their work: low, neutral, and high fundamental frequency.

    • A Review on Speech Synthesis Based on Machine Learning

      2022, Communications in Computer and Information Science
    • Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech

      2017, IEEE/ACM Transactions on Audio Speech and Language Processing
    View all citing articles on Scopus
    View full text