Elsevier

Neurocomputing

Volume 71, Issues 1–3, December 2007, Pages 174-180
Neurocomputing

A neural-wavelet architecture for voice conversion

https://doi.org/10.1016/j.neucom.2007.08.010Get rights and content

Abstract

In this letter we propose a new architecture for voice conversion that is based on a joint neural-wavelet approach. We also examine the characteristics of many wavelet families and determine the one that best matches the requirements of the proposed system. The conclusions presented in theory are confirmed in practice with utterances extracted from TIMIT speech corpus.

Introduction

Voice conversion, also known as voice morphing, enables a source speaker to transform his speech pattern to sound as if it were spoken by another person, that is the target speaker, preserving the original content of the spoken message [18].

Much literature has recently appeared addressing the issue of voice conversion [12], [5], [17]. Most methods are single-scale methods based on the interpolation of speech parameters and modeling of the speech signals using formant frequencies (FF) [1], linear prediction coding (LPC) and cepstrum coefficients (CC) [6], line spectral frequencies (LSF) [11], segmental codebooks (SC) [16], besides many others. We can also find some techniques that are based on hidden Markov models (HMMs) and Gaussian mixture models (GMM) [4], that are well-known methods in the speech community. Most of the techniques we mentioned suffer from absence of detailed information during the extraction of formant coefficients and the excitation signal. This results in the limitation of being able to accurately estimate parameters as well as distortion caused during the synthesis of the target speech. Turk and Arslan [16] introduced the discrete wavelet transform (DWT) [2], [14] for voice conversion and got encouraging results. Following the ideas of Turk and Arslan, we can find other interesting contributions, as the one of Orphanidou et al. [13].

Particularly, this paper proposes a new algorithm for voice conversion that is based on wavelet transforms and radial basis function (RBF) neural networks [8]. This is the main contribution of this work, that extends the considerations of Turk and Arslan, and Orphanidou et al., on the use of wavelets for voice conversion.

This paper is organized as follows. Section 2 presents a brief overview on how the wavelet-based algorithms for voice conversion work. The proposed approach is presented in Section 3. A study on wavelets and their important characteristics for voice conversion, in order to determine the best wavelet family to be used, is available in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents the conclusions.

Section snippets

A brief review on wavelet-based voice conversion

The basic idea behind the use of the DWT for voice conversion is the sub-band separation. With that, the pitch period of voiced sounds, FF [4], plus other information, can be effectively treated and converted into separate manners. In order to convert the source speaker's pattern into the target speaker's pattern, artificial neural networks can be used, as in [13]. Usually, these networks, that are important components of the system, are multilayer perceptron or RBF networks [8].

For training

The proposed approach

The proposed approach is divided into two parts: training (TR) and testing (TE). They are completely described in Table 1, Table 2, respectively, and further explanations follow.

All the RBFs use the Gaussian's equation [13] as activation function in the hidden layers. On the other hand, the output neurons use a simple linear weighted function. To train the RBFs, a two-step procedure was adopted. In the first step, a non-supervised training is used to determine the centers and variances of the

Exploring the characteristics of wavelets

According to the DWT theory [2], the jth-level decomposition of a given discrete (speech) signal, f[n], can be written as [7]f[n]=k=0n/2j-1Rj,k[n]φj,k[n]+t=1jk=0n/2j-1St,k[n]ψt,k[n],φ[n]=kh[k]φ[2n-k] and ψ[n]=kg[k]φ[2n-k] being the scaling and wavelet functions, respectively, that form a Riesz basis [2] to write signal f, Rj,k[n]f,φj,k[n], St,k[n]f,ψt,k[n], and h[k] and g[k]=(-1)kh[N-k-1] being the quadrature mirror (QMF) low-pass and high-pass analysis filters [2], respectively.

Tests and results

We extracted speech data from TIMIT corpus [15] and used them to convert some voice patterns. Particularly, we report the results obtained for converting two sentences for each one of the following patterns: male speaker to female speaker, female speaker to male speaker, male speaker to male speaker, and female speaker to female speaker. More sentences were used during the tests, but since the results are quite similar and a considerable space is required to list them, they are not reported.

The

Conclusions and future work

We presented a new architecture for voice conversion based on RBFs and wavelets, including a study on the best wavelet for the proposed algorithm. The study considered the perceptual quality of the morphed speech. Based on our own theoretical assumptions, confirmed in practice, we concluded that Symmlets with N around 24 are the best candidates for the proposed architecture. Our future work will include the use of matched wavelets to increase quality, reduce the computational complexity, and

Acknowledgements

We wish to thank the State of São Paulo Research Foundation for the Grants given to this work under process no. 2005/00015-1.

References (18)

  • S. Furui

    Research on individuality features in speech waves and automatic speaker recognition techniques

    Speech Commun.

    (1986)
  • H. Valbret

    Voice transformation using PSOLA technique

    Speech Commun.

    (1992)
  • M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization. in: Proceedings of the IEEE...
  • P.S. Addison

    The Illustrated Wavelet Transform Handbook: Introductory Theory and Applications in Science, Engineering, Medicine and Finance

    (2002)
  • M. Bosi et al.

    Introduction to Digital Audio Coding and Standards

    (2003)
  • L. Deng et al.

    Speech Processing: A Dynamic and Optimization-Oriented Approach

    (2003)
  • C. Drioli, Radial basis function networks for conversion of sound speech spectra, EURASIP J. Appl. Signal Process. (1)...
  • R.C. Guido, et al., A study on the best wavelet for audio compression, in: 40th IEEE ASILOMAR International Conference...
  • S. Haykin

    Neural Networks: A Comprehensive Foundation

    (1998)
There are more references available in the full text version of this article.

Cited by (24)

  • Novel approach of MFCC based alignment and WD-residual modification for voice conversion using RBF

    2017, Neurocomputing
    Citation Excerpt :

    The feature vector containing LSF parameters and coefficients representing the residual signal of corresponding source and target speakers is time aligned using MFCC based DTW technique. Mapping functions based on the RBF transformation model are developed to modify the feature vectors [9,27,26,30]. In the online employs the mapping rules obtained in the training phase to modify test speaker feature vectors.

  • ZCR-aided neurocomputing: A study with applications

    2016, Knowledge-Based Systems
    Citation Excerpt :

    There are many subclassifications for speech data, however, voiced, unvoiced and silent, respectively originated from quasi-periodic, non-periodic and inactive sources, are the root ones [38]-pp.77, 78. Usual applications in which the differentiation between voiced, unvoiced and silent segments (VUSS) is relevant include large-vocabulary speech recognition [39], speaker identification [40], voice conversion [41] and speech coding [42]. Thus, I dedicate this section to initially present a ZCR-based algorithm for the distinction among VUSS and, upon taking advantage of that formulation, to introduce my proposal for isolated-sentence word segmentation.

  • Kernel machines for epilepsy diagnosis via EEG signal classification: A comparative study

    2011, Artificial Intelligence in Medicine
    Citation Excerpt :

    Before concluding this section, it is worth discussing briefly about the impact of the choice of the wavelet basis on the performance of the kernel machines considered. Instead of following the guidelines proposed by Guido et al. [29], which seem very useful, we have decided to choose the type of the wavelet filter by conducting preliminary experiments (as suggested by Subasi [28]), whose results are condensed in Table 2. In this table, we report the accuracy values achieved by three of the kernel machines, taking into account different wavelet families (see [29]) and some of the statistical features derived over the wavelet coefficients.

  • A wavelet-based speaker verification algorithm

    2010, International Journal of Wavelets, Multiresolution and Information Processing
View all citing articles on Scopus
View full text