Elsevier

Speech Communication

Volume 53, Issue 7, September 2011, Pages 973-985
Speech Communication

Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency

https://doi.org/10.1016/j.specom.2011.05.001Get rights and content

Abstract

This paper describes a speaker-independent HMM-based voice conversion technique that incorporates context-dependent prosodic symbols obtained using adaptive quantization of the fundamental frequency (F0). In the HMM-based conversion of our previous study, the input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speaker’s phonetically and prosodically context-dependent HMM. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using the global mean and variance calculated from the training data. In the current study, these statistical parameters are obtained from each utterance itself, and this adaptive method improves the F0 conversion performance of the conventional one. We also introduce a speaker-independent model for decoding the input speech and model adaptation for training the target speaker’s model in order to reduce the required amount of training data under a condition where the phonetic transcription is available for the input speech. Objective and subjective experimental results for Japanese speech demonstrate that the adaptive quantization method gives better F0 conversion performance than the conventional one. Moreover, our technique with only ten sentences of the target speaker’s adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences.

Highlights

► We propose an SI-HMM-based voice conversion using adaptive F0 quantization. ► Adaptive F0 quantization improved F0 conversion performance. ► SI-HMM-based conversion needs no training data of the source speaker. ► We can also significantly reduce the amount of the target speaker’s training data.

Introduction

Speaker individuality plays an important role as a non-linguistic cue in speech communication. For instance, we often feel comforted by the voice of a family member or lover and pay attention to the speech of someone who is well known to us. These phenomena come from the identification of voice individuality of the speaker, and such individuality gives an additional value or special meaning to the speech. In this perspective, voice conversion is an attractive tool which enables us to convert the speaker individuality of the given speech to that of a different speaker without changing the linguistic information. Voice conversion has capability of various applications such as voice dubbing (Turk and Arslan, 2002), rap singing voice (Turk et al., 2009), and foreign language education (Mashimo et al., 2002).

The basic problem of how to convert the speech signals of a source speaker so as to be closer to those of a target speaker can be decomposed into the conversion of spectral and prosodic features including pitch and rhythm of speech. Of these features, the spectral envelope is represented by a high-dimensional vector, and it is not easy to imitate for arbitrary speakers. Hence, many researchers have focused in particular on the conversion of the spectral feature, and proposed a variety of approaches, e.g., artificial neural network (ANN), Gaussian mixture model (GMM), and hidden Markov model (HMM) (Stylianou, 2009).

Of these approaches, GMM-based statistical spectral conversion (Stylianou et al., 1998, Kain and Macon, 1998) is one of the typical and efficient techniques. The advantage of the GMM-based technique is that it enables the continuous mapping of acoustic features between speakers on the basis of soft clustered conversion functions. Moreover, recently, the problems of local discontinuity and over-smoothing effect in the spectral sequence have been alleviated by incorporating dynamic features and global variance (GV) (Toda et al., 2007), and this has led to a great improvement in the quality of the converted speech. In addition, the simultaneous conversion of spectral and F0 features using multi-space probability distribution (MSD) model was proposed to improve the prosody conversion performance (Yutani et al., 2009). However, there is still a limitation that only the relationship for limited number of neighboring frames of pre-aligned source and target speakers’ joint feature vectors is modeled, and it is difficult to convert the dynamic characteristics of speaker individuality appearing at the segmental or supra-segmental level such as in or between phones. Consequently, the conversion performance is not always satisfactory and highly depends on the combination of the source and target speakers.

For this issue, a segment-based voice conversion using unit selection is one of the promising approaches where the dynamic characteristics of speaker individuality is converted as well as the static ones (Abe, 1991). In the technique, phone units are used as the segments, and a mapping table is generated between the triphones of the source and target speakers. Although this unit-selection-based approach gives satisfactory quality to the converted speech when a large amount of parallel training data is available, the conversion performance would be substantially degraded if the available training data is limited. Recently, alternative segment-based techniques using probabilistic modeling with GMM or HMM have been proposed (Verma and Kumar, 2005, Ye and Young, 2006). However, these techniques focus on the conversion without parallel training data of the source and target speakers and have not outperformed the GMM-based frame-by-frame mapping with parallel corpora.

To overcome the above problem, we have proposed a segment-based voice conversion technique using phonetically and prosodically context-dependent HMM (Nose et al., 2010b). The basic idea comes from the HMM-based phonetic vocoder (Tokuda et al., 1998), which was proposed for very low bit-rate speech coding. In the segment vocoder, a phoneme decoder extracts a phoneme sequence with durations from spectral features of the input speech and then the HMM-based speech synthesis framework (Yoshimura et al., 1999) is used to generated a spectral feature trajectory. This technique can be easily applicable to the voice conversion by replacing the acoustic model for the synthesis with that of another speaker, and it can convert the spectral feature at the segmental level. By means of HMM-based statistical parametric speech synthesis, a smooth speech feature trajectory and stable synthetic speech can be generated with a relatively small amount of training data compared to the case of the unit-selection-based synthesis. In the HMM-based voice conversion, to convert not only the spectral feature but also the prosodic features, the spectral and prosodic features are simultaneously modeled and converted using MSD-HMM (Tokuda et al., 2002) where quantized fundamental frequency (F0) symbols are used as the prosodic context (Nose et al., 2010a). Although the MSD-HMM is also used in (Yutani et al., 2009), their technique used the frame-based joint feature and have not outperformed the GMM-based technique.

In this paper, we describe a novel HMM-based voice conversion which enables the voice conversion between arbitrary speakers at a low cost. A sufficient amount of training data is necessary in the conventional speaker-dependent conversion. To reduce the amount of training data, we propose a speaker-independent approach (Nose and Kobayashi, 2010b) by introducing a speaker-independent model and speaker adaptation technique, which are used for the model training in decoding and synthesis processes. We also propose utterance-adaptive F0 quantization (Nose and Kobayashi, 2010a) to alleviate the mismatch occurring in the quantization of F0 values for an input utterance.

There are two major contributions in the proposed voice conversion technique. The first is the improvement of the F0 conversion accuracy caused by the adaptive F0 quantization. The adaptive F0 quantization is shown to be effective in the conventional speaker-dependent voice conversion and also plays a primary role in the speaker-independent one. The F0 conversion is sometimes very important in the cross-gender voice conversion where the source and target speakers’ prosodic characteristics are quite different. The second contribution is the speaker-independent voice conversion which requires only a small amount of a target speaker’s data. By using a correct phonetic transcription for the input speech in the voice conversion process, the speaker-independent HMM-based technique is shown to significantly outperform the GMM-based conversion though no source speaker’s training data is required. This would lead to reduce the user’s burden in practical applications. An example of the typical application is voice-over and voice dubbing where the phonetic transcriptions are generally available from the script in advance.

Section snippets

Baseline system

As mentioned above, it is difficult to convert the supra-frame speaker characteristics in the frame-based mapping with joint feature vectors of source and target speakers. To appropriately convert the speaker individuality, a segment-based conversion with phonetic and prosodic context, e.g., phoneme and accent, stress, or tone, is indispensable, and context-dependent HMM-based voice conversion is an effective approach. Before describing the technique proposed in this paper, we explain our

Proposed system

The conventional speaker-dependent HMM-based technique cannot consistently achieve an acceptable speech quality and speaker similarity from only a few minutes training data. On the other hand, in practical applications, the source speaker is often a user of the voice conversion system, and it is preferable that the amount of user’s pre-recorded speech is as small as possible. Besides this limitation, the conventional technique has another problem regarding an assumption in F0 quantization. In

Evaluation of adaptive F0 quantization

In this section, we evaluated the performance of the adaptive F0 quantization using speaker-dependent models to reveal the intrinsic nature of the F0 quantization approach.

Evaluation of speaker-independent HMM-based voice conversion

In this section, we evaluated the performance of our speaker-independent HMM-based technique under the following conditions: (1) The source speaker’s training data is not available but the amount of training data for the target speaker is sufficient. (2) The source speaker’s training data is not available and the amount of training data for the target speaker is less than a few minutes.

Labeling cost

In our voice conversion technique, we must label the training and input speech data with phonetic and prosodic contexts. Through the model training and voice conversion processes, we do not need manual prosodic labeling of F0 symbols because these symbols are automatically determined by the F0 quantization as described in Section 3.2. For the training of speaker-independent model, the phonetic transcriptions with phone boundaries are used. An automatic labeling by phoneme alignment using a

Conclusion

We have proposed a context-dependent HMM-based voice conversion in which an adaptive F0 quantization was used to make prosodic contexts. The prosodic contexts are automatically generated by quantizing the average log F0 values of each phone using mean and variance which are calculated utterance by utterance. Objective and subjective evaluation results showed the advantage of the proposed technique against our conventional technique based on global F0 quantization in the F0 conversion

References (30)

  • H. Kawahara et al.

    Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds

    Speech Comm.

    (1999)
  • Abe, M., 1991. A segment-based approach to voice conversion. In: ICASSP 91, pp....
  • J.L. Gauvain et al.

    Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

    IEEE Trans. Speech Audio Process.

    (1994)
  • S. Imai et al.

    Mel log spectrum approximation (MLSA) filter for speech synthesis

    IECE Trans. A (Japanese Edition)

    (1983)
  • Kain, A., Macon, M., 1998. Spectral voice conversion for text-to-speech synthesis. In: Proc. ICASSP, pp....
  • Mashimo, M., Toda, T., Kawanami, H., Kashioka, H., Shikano, K., Campbell, N., 2002. Evaluation of cross-language voice...
  • Nakano, Y., Tachibana, M., Yamagishi, J., Kobayashi, T., 2006. Constrained structural maximum a posteriori linear...
  • Nose, T., Kobayashi, T., 2010a. HMM-based robust voice conversion using adaptive F0 quantization. In: Proc. 7th ISCA...
  • Nose, T., Kobayashi, T., 2010b. Speaker-independent HMM-based voice conversion using quantized fundamental frequency....
  • Nose, T., Ooki, K., Kobayashi, T., 2010a. HMM-based speech synthesis with unsupervised labeling of accentual context...
  • T. Nose et al.

    HMM-based voice conversion using quantized F0 context

    IEICE Trans. Inf. Systems

    (2010)
  • L. Rabiner et al.

    Fundamentals of Speech Recognition

    (1993)
  • K. Shinoda et al.

    MDL-based context-dependent subword modeling for speech recognition

    J. Acoust. Soc. Jpn. (E)

    (2000)
  • Stylianou, Y., 2009. Voice transformation: a survey. In: Proc. ICASSP 2009, pp....
  • Y. Stylianou et al.

    Continuous probabilistic transform for voice conversion

    IEEE Trans. Speech Audio Process.

    (1998)
  • Cited by (13)

    View all citing articles on Scopus
    View full text