Skip to main content
Log in

STRAIGHT-Based Emotion Conversion Using Quadratic Multivariate Polynomial

  • Short Paper
  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech is the natural mode of communication and the easiest way of expressing human emotions. Emotional speech is expressed in terms of features like f0 contour, intensity, speaking rate, and voice quality. The group of these features is called prosody. Generally, prosody is modified by pitch and time scaling. Emotional speech conversion is more sensitive to prosody unlike voice conversion, where spectral conversion is the main concern. Several techniques, linear as well as nonlinear, have been used for transforming the speech. Our hypothesis is that quality of emotional speech conversion can be improved by estimating nonlinear relationship between the neutral and emotional speech feature vectors. In this research work, quadratic multivariate polynomial (QMP) has been explored for transforming neutral speech to emotional target speech. Both subjective and objective analyses were carried out to evaluate the transformed emotional speech using comparison mean opinion scores (CMOS), mean opinion scores (MOS), identification rate, root-mean-square error, and Mahalanobis distance. For Toronto emotional database, except for neutral/sad conversion, the CMOS analysis indicates that the transformed speech can partly be perceived as target emotion. Moreover, the MOS and spectrogram indicate good quality of transformed speech. For German database except for neutral/boredom conversion, the CMOS value of proposed technique has better score than gross and initial–middle–final methods but less than syllable method. However, QMP technique is simple, is easy to implement, has better quality of transformed speech, and estimates transformation function using limited number of utterances of training set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization. J. Acoust. Soc. Jpn. (E) 11(2), 71–76 (1990)

    Article  Google Scholar 

  2. Y. Adachi, S. Kawamoto, S. Morishima, S. Nakamura, Perceptual similarity measurement of speech by combination of acoustic features, in Proceedings IEEE International Conference Acoustics, Speech and Signal Processing, (2008), pp. 4861–4864

  3. R. Aihara, R. Takashima, T. Takiguchi, Y. Ariki, GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2, 134–138 (2012)

    Article  Google Scholar 

  4. R. Aihara, R. Ueda, T. Takiguchi, Y. Ariki, Exemplar-based emotional voice conversion using non-negative matrix factorization, in Proceedings IEEE Asia-Pacific Signal and Information Processing Association, (2014), pp. 1-7

  5. M. Bulut, et al., Investigating the role of phoneme-level modifications in emotional speech resynthesis, in Proceedings INTERSPEECH, (2005), pp. 801–804

  6. F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 151–156

  7. F. Burkhardt, N. Campbell, Emotional speech synthesis, in Oxford Handbook of Affective Computing, ed. By R.A. Calvo, S.K. D’Mello, J. Gratch, A. Kappas (Oxford University Press, 2014), p. 286

  8. F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in Proceedings INTERSPEECH, (2005), pp. 1517–1520

  9. L. Cen, P. Chan, M. Dong, H. Li, Generating emotional speech from neutral speech, in Proceedings 7th International Symposium on Chinese Spoken Language Processing, (2010), pp. 383–386

  10. R.R. Chang, X.Q. Yu, Y.Y. Yuan, W.G. Wan, Emotional analysis and synthesis of human voice based on STRAIGHT. Appl. Mech. Mater. 536, 105–110 (2014)

    Google Scholar 

  11. Y. Chen, M. Chu, E. Chang, J. Liu, R. Liu, Voice conversion with smoothed GMM and MAP adaptation, in Eurospeech, (2003), pp. 2413–2416

  12. R. Cowie et al., Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18, 32–80 (2001)

    Article  Google Scholar 

  13. E.A. Cudney et al., An evaluation of Mahalanobis–Taguchi system and neural network for multivariate pattern recognition. J. Ind. Syst. Eng. 1, 139–150 (2007)

    Google Scholar 

  14. S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)

    Article  Google Scholar 

  15. K. Dupuis, M.K. Pichora-Fuller, Toronto Emotional Speech Set (TESS) (Psychology Department, University of Toronto, Toronto, 2010)

    Google Scholar 

  16. T. En-Najjary, O. Rosec, T. Chonavel, A voice conversion method based on joint pitch and spectral envelope transformation, in Proceedings INTERSPEECH (2004)

  17. D. Erro, A. Moreno, A. Bonafonte, Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18, 922–931 (2010)

    Article  Google Scholar 

  18. H. Fujisaki, Information, prosody, and modeling with emphasis on tonal features of speech, in Speech Prosody, (2004), pp. 1–10

  19. K.I. Funahashi, On the approximate realization of continuous mappings by neural networks. Neural Netw. 2, 183–192 (1989)

    Article  Google Scholar 

  20. D. Govind, S.R.M. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings INTERSPEECH, (2011), pp. 2969–2972

  21. R.C. Guido et al., A neural-wavelet architecture for voice conversion. Neurocomputing 71, 174–180 (2007)

    Article  Google Scholar 

  22. A. Haque, K. S. Rao, Analysis and modification of spectral energy for neutral to sad emotion conversion, in Proceedings IEEE 8th International Contemporary Computing, (2015), pp. 263–268

  23. A. Haque, K.S. Rao, Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech, in International Journal of Speech Technology, (2016), pp. 1–11

  24. E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)

    Article  Google Scholar 

  25. W.J. Holmes, J.N. Holmes, M.W. Judd, Extension of the bandwidth of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech, in Proceedings IEEE International Conference Acoustics, Speech, and Signal Processing, (1990), pp. 313–316

  26. A. Iida, N. Campbell, S. Iga, F. Higuchi, M. Yasumura, A speech synthesis system with emotion for assisting communication, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 167–172

  27. T. Irino, Y. Minami, T. Nakatani, M. Tsuzaki, H. Tagawa, Evaluation of a speech recognition/generation method based on HMM and STRAIGHT, in Proceedings INTERSPEECH (2002)

  28. H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana Acad. Proc. Eng. Sci. 36, 713–727 (2011)

    Google Scholar 

  29. H. Kawahara, I. Masuda-Katsuse, A. De Cheveigne, Restructuring speech representations using a pitch adaptive time frequency smoothing and an instantaneous frequency based f0 extraction: possible role of repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)

    Article  Google Scholar 

  30. R. Lawrence, Fundamentals of Speech Recognition (Pearson Education India, Delhi, 2008)

    Google Scholar 

  31. P.K. Lehana, P.C. Pandey, Transformation of short-term spectral envelope of speech signal using multivariate polynomial modeling, in Proceedings National Conference on Communications, NCC, (2011)

  32. P.K. Lehana, Spectral mapping using multivariate polynomial modeling for voice conversion, Ph.D. Thesis, Department of Electrical Engineering, IIT Bombay, India (2013)

  33. Z.H. Ling, L. Deng, D. Yu, Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21, 2129–2139 (2013)

    Article  Google Scholar 

  34. K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin, in Proceedings 4th International Conference on Fuzzy Systems and Knowledge Discovery, (2007), pp. 410–414

  35. Z. Luo, J. Chen, T. Nakashika, T. Takiguchi, Y. Ariki, Emotional voice conversion using neural networks with different temporal scales of f0 based on wavelet transform, in Proceedings 9th ISCA Speech Synthesis Workshop (2016), pp. 140–145

  36. Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with MCC and F0 features, in Proceedings IEEE 15th International Conference Computer and Information Science, (2016), pp. 1–5

  37. P.C. Mahalanobis, On the generalized distance in statistics, in Proceedings of the National Institute of Sciences of India, (1936), pp. 49–55

  38. T. Masuko, K. Tokuda, T. Kobayashi, S. Imai, Voice characteristics conversion for HMM-based speech synthesis system. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1611–1614 (1997)

    Google Scholar 

  39. A. Mouchtaris, S.S. Narayanan, C. Kyriakakis, Multichannel audio synthesis by subband-based spectral conversion and parameter adaptation. IEEE Trans. Speech Audio Process. 13, 263–274 (2005)

    Article  Google Scholar 

  40. T. Nakashika, R. Takashima, T. Takiguchi, Y. Ariki, Voice conversion in high-order Eigen space using deep belief nets, in Proceedings INTERSPEECH, (2013), pp. 369–372

  41. J. Nirmal, M. Zaveri, S. Patnaik, P. Kachare, Voice conversion using general regression neural network. Appl. Soft Comput. 24, 1–12 (2014)

    Article  Google Scholar 

  42. H.K. Palo, M.N. Mohanty, M. Chandra, Efficient feature combination techniques for emotional speech classification. Int. J. Speech Technol. 19, 135–150 (2016)

    Article  Google Scholar 

  43. B.S. Pathak, M. Sayankar, A. Panat, Emotion transformation from neutral to 3 emotions of speech signal using DWT and adaptive filtering techniques, in Proceedings IEEE 11th India Conference: Emerging Trends and Innovation in Technology, (2014)

  44. K.R. Scherer, Vocal communication of emotion: a review of research paradigms. Speech Commun. 40, 227–256 (2003)

    Article  MATH  Google Scholar 

  45. M. Schröder, Emotional speech synthesis: a review, in Proceedings INTERSPEECH, (2001), pp. 561–564

  46. J.B. Singh, R. Khanna, P. Lehana, Effect of MFCC based features for speech signal alignments, in Proceedings International Journal on Natural Language Computing, vol. 2 (2013)

  47. Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 131–142 (1998)

    Article  Google Scholar 

  48. D. Sundermann, A. Bonafonte, H. Ney, A study on residual prediction techniques for voice conversion. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1, 1–13 (2005)

    Google Scholar 

  49. T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 2, 841–844 (2001)

    Google Scholar 

  50. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1315–1318 (2000)

    Google Scholar 

  51. O.Türk, M. Schröder, A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis, in Proceedings INTERSPEECH, (2008), pp. 2282–2285

  52. O.Türk, L.M. Arslan, Voice conversion methods for vocal tract and pitch contour modification, in Proceedings INTERSPEECH (2003)

  53. O.Türk, Cross-lingual voice conversion. Ph.D. dissertation, Bogaziçi University, (2007)

  54. H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)

    Article  Google Scholar 

  55. C. Veaux, X. Rodet, Intonation conversion from neutral to expressive speech, in Proceedings INTERSPEECH (2011), pp. 2765–2768

  56. F. Villavicencio, A. Röbel, X. Rodet, Extending efficient spectral envelope modeling to Mel-frequency based representation, in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, (2008), pp. 1625–1628

  57. Z. Wu, Spectral mapping for voice conversion, Ph.D. dissertation, St. School of Computer Engineering, Nanyang Technological University, (2015)

  58. J. Yadav, K.S. Rao, Prosodic mapping using neural networks for emotion conversion in Hindi Language. Circuits Systems Signal Process. 35, 139–162 (2016)

    Article  MathSciNet  Google Scholar 

  59. H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Prof. Hideki Kawahara, Wakayama University, for his assistance for STRAIGHT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jang Bahadur Singh.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, J.B., Lehana, P. STRAIGHT-Based Emotion Conversion Using Quadratic Multivariate Polynomial. Circuits Syst Signal Process 37, 2179–2193 (2018). https://doi.org/10.1007/s00034-017-0660-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-017-0660-0

Keywords

Navigation