Skip to main content
Log in

Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The primary objective of this work is to compare patterns for vocal expression across distinct linguistic contexts. Five language (datasets) are taken for experimentation viz. German (EmoDB), English (SAVEE), and Indian languages: Telugu (IITKGP), Malayalam and Tamil, each varying systematically with reference to typology and linguistic proximity. The hypothesis put forth for experimentation is that though the languages selected exploit the prosodic parameters in distinct measure to express a set of basic emotions, viz. anger, fear and happiness, there exist certain underlying similarities in terms of prosodic perception. A methodology for estimating and incorporating supra-segmental parameters contributing to emotional expression viz. pitch, duration and intensity is developed and tested against all five datasets. The main contribution in this work is the use of same prosodic transformation scales for emotion conversion across multi-lingual test cases for generation of vocal affect in multiple languages. Objective evaluation revealed maximum correlation for anger expression synthesised by adapting transformation scales from Tamil (0.95), that for fear from Telugu (0.89) while for happiness, scales from English dataset yielded superior conversion results (0.94). They are re-emphasised with perception test using comparative mean opinion scores (CMOS). Maximum CMOS of 3.8 is obtained for anger and fear emotions while conversion to happiness yielded a score of 3.3. Experimental findings indicate that though significant information embedded in prosodic parameters is dependent on language structure, common trends can be observed across certain languages in the context of emotion perception which can provide insights into development of emotion conversion systems in a multilingual context.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Figure and equation used after obtaining written permission from Verhelst and Roelands (1993).

References

  • Aihara, R., Takashima, R., Takiguchi, T., & Ariki, Y. (2012). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5), 134–138.

    Article  Google Scholar 

  • Aihara, R., Ueda, R., Takiguchi, T., & Ariki, Y. (2014). Exemplar-based emotional voice conversion using non-negative matrix factorization. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–7).

  • Akagi, M., Han, X., Elbarougy, R., Hamada, Y., & Li, J. (2014). Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–10).

  • Bakshi, P. M., & Kashyap, S. C. (1982). The constitution of India. Prayagraj: Universal Law Publishing.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology (pp. 1517–1520).

  • Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In ISCA Tutorial and Research Workshop (ITRW) on speech and emotion (pp. 151–156).

  • Cabral, J. P., & Oliveira, L. C. (2006). Emovoice: a system to generate emotions in speech. In Ninth International Conference on Spoken Language Processing (pp. 1798–1801).

  • Cahn, J. E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8(1), 1–19.

    MathSciNet  Google Scholar 

  • Cen, L., Chan, P., Dong, M., & Li, H. (2010). Generating emotional speech from neutral speech. In 7th International Symposium on Chinese Spoken Language Processing (pp. 383–386).

  • Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.

    Article  Google Scholar 

  • Govind, D., & Joy, T. T. (2016). Improving the flexibility of dynamic prosody modification using instants of significant excitation. Circuits, Systems, and Signal Processing, 35(7), 2518–2543.

    Article  Google Scholar 

  • Govind, D., & Prasanna, S. R. M. (2012). Epoch extraction from emotional speech. In International Conference on Signal Processing and Communications (SPCOM) (pp. 1–5).

  • Govind, D., & Prasanna, S. M. (2013). Dynamic prosody modification using zero frequency filtered signal. International Journal of Speech Technology, 16(1), 41–54.

    Article  Google Scholar 

  • Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Twelfth Annual Conference of the International Speech Communication Association (pp. 2969–2972).

  • Haq, S., Jackson, P. J., & Edge, J. (2009). Speaker-dependent audio-visual emotion recognition. In AVSP (pp. 53–58).

  • Helander, E., Silén, H., Virtanen, T., & Gabbouj, M. (2011). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.

    Article  Google Scholar 

  • Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 1, pp. 373–376).

  • Kadiri, S. R., & Yegnanarayana, B. (2015). Analysis of singing voice for epoch extraction using zero frequency filtering method. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4260–4264).

  • Kadiri, S. R., & Yegnanarayana, B. (2017). Epoch extraction from emotional speech using single frequency filtering approach. Speech Communication, 86, 52–63.

    Article  Google Scholar 

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. In International Conference on Contemporary Computing (pp. 485–492). Springer, Berlin.

  • Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. In IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS) (pp. 1–5).

  • Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., & Li, H. (2016a). Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In Proceeding of INTERSPEECH

  • Ming, H., Huang, D., Xie, L., Zhang, S., Dong, M., & Li, H. (2016b). Exemplar-based sparse representation of timbre and prosody for voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5175–5179).

  • Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4), 369–390.

    Article  Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.

    Article  Google Scholar 

  • Nguyen, H. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. S. (2016). High quality voice conversion using prosodic and high-resolution spectral features. Multimedia Tools and Applications, 75(9), 5265–5285.

    Article  Google Scholar 

  • Pravena, D., & Govind, D. (2016). Expressive speech analysis for epoch extraction using zero frequency filtering approach. In IEEE Students’ Technology Symposium (TechSym) (pp. 240–244).

  • Pravena, D., & Govind, D. (2017). Development of simulated emotion speech database for excitation source analysis. International Journal of Speech Technology, 20(2), 327–338.

    Article  Google Scholar 

  • Rachman, L., Liuni, M., Arias, P., Lind, A., Johansson, P., Hall, L., et al. (2018). DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech. Behavior Research Methods, 50(1), 323–343.

    Article  Google Scholar 

  • Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55(6), 745–756.

    Article  Google Scholar 

  • Sarkar, P., Haque, A., Dutta, A. K., Reddy, G., Harikrishna, D. M., Dhara, P., & Rao, K. S. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 473–477).

  • Schröder, M. (2009). Expressive speech synthesis: past, present, and possible futures (pp. 111–126)., Affective information processing London: Springer.

    Google Scholar 

  • Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.

    Article  Google Scholar 

  • Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.

    Article  Google Scholar 

  • Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.

    Article  Google Scholar 

  • Vekkot, S., Gupta, D., Zakariah, M., & Alotaibi, Y. A. (2019). Hybrid framework for speaker-independent emotion conversion using i-vector PLDA and neural network. IEEE Access, 7, 81883–81902.

    Article  Google Scholar 

  • Vekkot, S., & Tripathi, S. (2016a). Significance of glottal closure instants detection algorithms in vocal emotion conversion. In International Workshop Soft Computing Applications (pp. 462–473). Springer, Cham.

  • Vekkot, S., & Tripathi, S. (2016b). Inter-emotion conversion using dynamic time warping and prosody imposition. In International Symposium on Intelligent Systems Technologies and Applications (pp. 913–924). Springer, Cham.

  • Vekkot, S., & Tripathi, S. (2017). Vocal emotion conversion using WSOLA and linear prediction. In International Conference on Speech and Computer (pp. 777–787). Springer, Cham.

  • Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 554–557).

  • Verma, R., Sarkar, P., & Rao, K. S. (2015). Conversion of neutral speech to storytelling style speech. In Eighth International Conference on Advances in Pattern Recognition (ICAPR) (pp. 1–6).

  • Vuppala, A. K., & Kadiri, S. R. (2014). Neutral to anger speech conversion using non-uniform duration modification. In 9th International Conference on Industrial and Information Systems (ICIIS) (pp. 1–4)

  • Vydana, H. K., Kadiri, S. R., & Vuppala, A. K. (2016). Vowel-based non-uniform prosody modification for emotion conversion. Circuits, Systems, and Signal Processing, 35(5), 1643–1663.

    Article  Google Scholar 

  • Vydana, H. K., Raju, V. V., Gangashetty, S. V., & Vuppala, A. K. (2015). Significance of emotionally significant areas of speech for emotive to neutral conversion. In International Conference on Mining Intelligence and Knowledge Exploration (pp. 287–296). Springer, Cham.

  • Wu, C. H., Hsia, C. C., Lee, C. H., & Lin, M. C. (2009). Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1394–1405.

    Google Scholar 

  • Wu, Z., Virtanen, T., Chng, E. S., & Li, H. (2014). Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1506–1521.

    Article  Google Scholar 

  • Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research is supported by Govt. of India’s Visveswaraya Ph.D. scheme by means of scholarship for the first author towards completion of her Ph.D.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepa Gupta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vekkot, S., Gupta, D. Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study. Int J Speech Technol 22, 533–549 (2019). https://doi.org/10.1007/s10772-019-09626-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-019-09626-5

Keywords

Navigation