Elsevier

Speech Communication

Volume 106, January 2019, Pages 57-67
Speech Communication

Voice conversion with SI-DNN and KL divergence based mapping without parallel training data

https://doi.org/10.1016/j.specom.2018.11.007Get rights and content

Abstract

We propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acoustic input frames to the corresponding symbols in the phonetic space. KLD is chosen as an ideal distortion measure to find an appropriate mapping from each input source speaker’s frame to that of the target speaker. The mapped acoustic segments of the target speaker form the construction bases for voice conversion. With or without word transcriptions of the target speaker’s training data, the approach can be either supervised or unsupervised. In a supervised mode where adequate training data can be utilized to train a conventional, statistical parametric TTS of the target speaker, each input frame of the source speaker is converted to its nearest sub-phonemic “senone”. In an unsupervised mode, the frame is converted to the nearest clustered phonetic centroid or a raw speech frame, in the minimum KLD sense. The acoustic trajectory of the converted voice is rendered with the maximum probability trajectory generation algorithm. Both objective and subjective measures used for evaluating voice conversion performance show that the new algorithm performs better than the sequential error minimization based DNN baseline trained with parallel training data.

Introduction

Voice Conversion is a speech processing technique which modifies non-linguistic information of speech while keeping the linguistic information intact (Mohammadi and Kain, 2017). There are many applications of voice conversion, such as personalized text-to-speech (TTS) system trained with only limited data from a target speaker (Kain and Macon, 1998), speech conversion from narrow-band to wide-band (Park, Kim, 2000, Seltzer, Acero, Droppo, 2005), acoustic-to-articulatory inversion mapping (Richmond, King, Taylor, 2003, Toda, Black, Tokuda, 2008), speech enhancement (Mouchtaris et al., 2007) and body-transmitted speech enhancement Toda et al. (2009).

A typical voice conversion application is speaker conversion in which the speech from a source speaker is converted to that of a target speaker without changing its word content. In this paper we concentrate on the spectral, or the voice timbre, conversion. Conventional voice conversion for speaker conversion (Toda, Nakamura, Sekimoto, Shikano, 2009, Stylianou, Cappe, Moulines, 1998, Toda, Black, Tokuda, 2007, Narendranath, Murthy, Rajendran, Yegnanarayana, 1995, Desai, Black, Yegnanarayana, Prahallad, 2010, Xie, Qian, Fan, Soong, Li, 2014, Chen, Ling, Song, Dai, 2013, Wu, Chng, Li, 2013, Valbret, Moulines, Tubach, 1992, Abe, Nakamura, Shikano, Kuwabara, 1988, Wu, Virtanen, Chng, Li, 2014, Ming, Huang, Xie, Zhang, Dong, Li, 2016, Sun, Kang, Meng, 2016) usually needs parallel data from both source and target speakers. The parallel data are first aligned by dynamic programming, then a mapping function can be trained to convert speech from the source to target speaker. With available parallel speech data, the joint density Gaussian mixture model (JD-GMM) (Toda et al., 2007) and neural network (NN) based mapping (Desai et al., 2010) are most widely used. Although JD-GMM can effectively convert source speech to target speech with a decent quality, over smoothing due to the statistical averaging in estimating the mean and covariance of Gaussian components typically makes the converted speech muffled and the speaker similarity quality degraded. To address the over smoothing issue, the authors in Kobayashi et al. (2016) proposed a direct waveform modification technique based on spectral differential filtering and the spectral difference is modeled with differential Gaussian mixture model (DIFFGMM). In the NN based approach (Desai, Black, Yegnanarayana, Prahallad, 2010, Xie, Qian, Fan, Soong, Li, 2014, Chen, Ling, Song, Dai, 2013, Sun, Kang, Meng, 2016), the conditional probability which converts source speech to target speech is trained. The conversion function of NN based approach has the capability to simulate non-linear function in the mapping process, hence it has the potential to achieve a better performance than GMM based approach Desai et al. (2010). To further improve the NN based voice conversion performance, sequence error minimization (SEM) training (Xie et al., 2014) was proposed to solve the intrinsic mismatch in speech generation between training and test. A recurrent neural network with bidirectional long short term memory (BLSTM) has also been shown to achieve better performance than conventional feed forward neural network Sun et al. (2016a). In Chen et al. (2014), DNN is generatively trained by cascading two restricted Boltzmann machines (RBM) which model the distribution of spectral envelopes using a Bernoulli bidirectional associative memory (BAM) by taking the advantages of powerful modeling capability of RBM and BAM. Exemplar-based sparse representation for voice conversion is investigated in Wu et al. (2014), Ming et al. (2016), where the magnitude spectrum is modeled as a linear combination of a set of basis spectra, or the exemplars, with non-negative matrix factorization, the exemplars are then used for conversion.

However, the precondition of having parallel data is inconvenient. At least 30–50 parallel training utterances are needed to train a decent GMM or NN based voice conversion system considering both the speech naturalness and speaker similarity. There exist some approaches which do not require parallel training data. In Sundermann et al. (2004), acoustic clusters are constructed first for both source and target speakers’ acoustic features, respectively, and mappings between them are established. In Sundermann et al. (2006) a unit selection based method is used to select the acoustically “nearest” target frame by taking the acoustic continuity into account. Some approaches tried to find alignments between source and target speakers’ non-parallel data. For example, in Ye and Young (2004) a speech recognizer is used to index each frame of the source and the target speaker’s speech with the corresponding state labels, and those labeled subsequences are extracted from the set of target sequences to match the given source state indexed sequences. The parallel data between source and target speakers is thus constructed and the conventional linear parameter transformation training can be applied. Erro et al. (2010) proposed an iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances of different speakers, in the same or different languages. In Tao et al. (2010) a supervised self-organizing learning algorithm by imposing phonetic restriction is proposed for improving the alignment iteratively. In Benisty et al. (2014) the authors formulated the training stage as a joint-cost minimization problem, by considering both context-based alignment and conversion function. There are also adaptation techniques for non-parallel training data based voice conversion (Mouchtaris, Spiegel, Mueller, 2006, Lee, Wu, 2006). In Mouchtaris et al. (2006) a JDGMM is trained on a pre-defined set of source and target speakers that have parallel recordings. To build the mapping function using non-parallel recordings, the means and covariances of the GMMs are adapted to the new source and target speakers. However, all the above approaches without parallel training data can not achieve as good performance as GMM or NN based voice conversion which utilizes parallel training data.

Deep neural networks (DNN) has been successfully applied to speech recognition (Hinton et al., 2012) in recent years and significantly improved recognition performance. DNN architectures generate compositional models, where extra layers can enable composition of features from lower to higher layers, giving them a powerful learning capacity to model the complex patterns of speech data. The long window of frames in DNN input also can incorporate more temporal and contextual information into the modeling process.

Recently we proposed a voice conversion method for any unknown source speaker with or without pre-recorded speech training data Xie et al. (2016b). It is motivated by our cross-lingual TTS work (Xie et al., 2016a) which uses a speaker independent, deep neural network (SI-DNN) to equalize speaker difference between different languages and the Kullback-Leibler divergence to measure the phonetic distortion between two acoustic segments. A SI-DNN automatic speech recognition (ASR) system is trained and the corresponding ASR senones are used to represent the whole phonetic space which is speaker independent. Speaker differences can then be equalized with the SI-DNN in the ASR acoustic-phonetic space. Sun proposed a similar approach (Sun et al., 2016b). The main difference between this paper and Sun et al. (2016b) is: 1) we use a senone based SI-DNN while the authors in Sun et al. (2016b) use a phoneme based SI-DNN; 2) We use acoustic unit selection with minimum KL divergence method to generate the final acoustic trajectory while they use RNN-LSTM (recurrent neural network long short-term memory) to generate the acoustic trajectory.

In this paper, we extend our previous work (Xie et al., 2016b) in voice conversion without using parallel data. The main difference between this paper and Xie et al. (2016b) is that we propose to use raw acoustic frame to facilitate the mapping and utilize a window based trajectory generation to generate a smooth, non-warbling acoustic trajectory. We compare our proposed approach with previous methods on the 12 voice conversion pairs in two speech databases, i.e., CMU ARCTIC database (Kominek and Black, 2003) and VCC 2016 database (Toda et al., 2016). The rest of the paper is organized as follows. In Section 2 the framework of the KLD-DNN approach to voice conversion is proposed. In Section 3 we present experiments used to evaluate the performance of the proposed methods. Further enhancement of our proposed method is studied in Section 4. Finally, we give our conclusions and discussion in Section 5.

Section snippets

Symmetrised KL divergence

The Kullback-Leibler divergence (Kullback and Leibler, 1951) is a non-symmetric, information-theoretic measure of the difference between two probability distributions, P and Q, which can be a discrete or continuous density. For discrete probability distributions, P and Q, the KL divergence of P from Q is defined as,DKL(PQ)=iP(i)lnP(i)Q(i)For continuous distributions, P and Q in continuous random variables, the KL divergence is defined as,DKL(PQ)=p(x)lnp(x)q(x)dxwhere p and q denote the

Database

The Wall Street Journal (WSJ) CSR database (Paul and Baker, 1992) is used to train a SI-DNN acoustic model. Training set (SI-284) contains 78 h utterances recorded by 284 native American English speakers. Two additional databases: 1) the CMU ARCTIC American English database (Kominek and Black, 2003); 2) the Voice Conversion Challenge (VCC) 2016 database (Toda et al., 2016) are used for constructing voice conversion speaker pairs. The CMU ARCTIC corpus consists of four primary sets of recordings

Enhancements to KLD-DNN based voice conversion

In order to enhance the performance of SI-DNN and KL divergence based voice conversion without parallel training data, we investigate two methods.

Conclusions

In our proposed KLD-DNN based approach to voice conversion, an SI-DNN acoustic model is used to compute the senone posteriors of a given acoustic speech segment whose acoustic variability between source and target speakers can be equalized; KLD is used to measure the phonetic distortion between two posterior vectors in order to match the corresponding acoustic units of the two speakers. Depending upon whether the target speaker’s speech data with or without transcriptions, the approach can

References (50)

  • J. Du et al.

    A new minimum divergence approach to discriminative training

    Proc. ICASSP

    (2007)
  • D. Erro et al.

    INCA algorithm for training voice conversion systems from nonparallel corpora

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • G. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups

    IEEE Signal Process. Mag.

    (2012)
  • W. Hu et al.

    A new DNN-based high quality pronunciation evaluation for computer-aided language learning(CALL)

    Proc. INTERSPEECH

    (2013)
  • A. Kain et al.

    Spectral Voice Conversion for Text-to-speech Synthesis

    Proc. ICASSP

    (1998)
  • K. Kobayashi et al.

    The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016

    Proc. INTERSPEECH

    (2016)
  • J. Kominek et al.

    The CMU ARCTIC databases for speech synthesis

    Tech. Rep. CMU-LTI-03-177, Language Technologies Institute

    (2003)
  • S. Kullback et al.

    On information and sufficiency

    Anal. Math. Stat.

    (1951)
  • C.H. Lee et al.

    Map-based adaptation for speech conversion using adaptation data selection and non-parallel training

    Proc. INTERSPEECH

    (2006)
  • Z.H. Ling et al.

    Minimum kullback-leibler divergence parameter generation for HMM-based speech synthesis

    IEEE Trans. Audio Speech Lang. Process.

    (2012)
  • H. Ming et al.

    Exemplar-based sparse representation of timbre and prosody for voice conversion

    Proc. ICASSP

    (2016)
  • S. Mohammadi et al.

    An overview of voice conversion systems

    Speech Commun.

    (2017)
  • A. Mouchtaris et al.

    A spectral conversion approach to single-channel speech enhancement

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • T.A. Myrvoll et al.

    Optimal clustering of multivariate normal distributions using divergence and its application to HMM adaptation

    Proc. ICASSP

    (2003)
  • M. Narendranath et al.

    Transformation of formants for voice conversion using artificial neural networks

    Speech Commun.

    (1995)
  • View full text