Skip to main content
Log in

Neural network based feature transformation for emotion independent speaker identification

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20 %. Feature transformation at the syllable level has shown the better performance, compared to sentence level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1988 (Vol. 1, pp. 655–658).

    Google Scholar 

  • Benesty, J., Sondhi, M. M. & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. Berlin: Springer.

    Google Scholar 

  • Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., & Reynolds, D. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 4, 430–451.

    Google Scholar 

  • Bou-Ghazale, S. E., & Hansen, J. H. L. (1996). Generating stressed speech from neutral speech using a modified celp vocoder. Speech Communication, 20, 93–110.

    Article  Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In INTERSPEECH-2005 (pp. 1517–1520).

    Google Scholar 

  • Campbell, N. (2004). Perception of affect in speech—towards an automatic processing of paralinguistic information in spoken conversation. In ICSLP, Jeju, October 2004.

    Google Scholar 

  • Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In Proc. European conf. speech commun. technol. (pp. 2665–2668).

    Google Scholar 

  • Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18, 954–964.

    Article  Google Scholar 

  • Dunn, R., Reynolds, D., & Quatieri, T. (2000). Approaches to speaker detection and tracking in multi-speaker audio. Digital Signal Processing, 10(1), 93–112.

    Article  Google Scholar 

  • Fine, S., Navaratil, J., & Gopinath, R. (2001). A hybrid GMM/SVM approach to speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing, Utah, USA, May 2001 (Vol. 1).

    Google Scholar 

  • Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2004). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011, Florence, Italy, August 2004 (pp. 73–76).

    Google Scholar 

  • Gupta, C. S. (2003). Significance of source features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.

  • Hansen, J. H. L., & Womack, B. D. (1996). Feature analysis and neural network-based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 4, 307–313.

    Article  Google Scholar 

  • Haykin, S. (1999). Neural networks: a comprehensive foundation. New Delhi: Pearson Education Aisa, Inc.

    MATH  Google Scholar 

  • Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall, Inc.

    Google Scholar 

  • Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In Int. joint conf. neural networks, Washington, USA, Aug. 2001 (pp. 1548–1553).

    Google Scholar 

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In LNCS. Communications in computer and information science, Aug. 2009. Berlin: Springer.

    Google Scholar 

  • Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011). Iitkgp-sehsc: Hindi speech corpus for emotion analysis. In International conference on devices and communication, Mesra, India, Birla Institute of Technology, Feb. 2011. New York: IEEE Press.

    Google Scholar 

  • Li, D., Yang, Y., Wu, Z., & Wu, T. (2005). Lecture notes in computer science.: Vol. 3784. Emotion-state conversion for speaker recognition. Berlin: Springer. ISBN: 978-3-540-29621-8.

    Google Scholar 

  • Marshimi, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of EUROSPEECH 2001, Aalborg, Denmark, Sept. 2001 (pp. 361–364).

    Google Scholar 

  • Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, June.

  • Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proc. int. conf. spoken language processing, Pittsburgh, PA, USA, Sep. 2006 (pp. 917–920).

    Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.

    Article  Google Scholar 

  • Mary, L., Rao, K. S., Gangashetty, S. V., & Yegnanarayana, B. (2004). Neural network models for capturing duration and intonation knowledge for language and speaker identification. In Int. conf. cognitive and neural systems, Boston, MA, USA, May 2004.

    Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.

    Article  Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.

    Article  Google Scholar 

  • Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text-to-speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–182.

    Article  Google Scholar 

  • Narendranadh, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16, 206–216.

    Google Scholar 

  • Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Speech and Audio Processing, 19, 2552–2565.

    Article  Google Scholar 

  • Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Raja, G. S., & Dandapat, S. (2010). Speaker recognition under stressed condition. International Journal of Speech Technology, 13, 141–161.

    Article  Google Scholar 

  • Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24.

  • Rao, K. S. (2011). Role of neural networks for developing speech systems. Sadhana, 36, 783–836.

    Article  Google Scholar 

  • Rao, K. S., & Koolagudi, S. G. (2007). Transformation of speaker characteristics in speech using support vector machines. In 15th international conference on advanced computing and communication (ADCOM-2007), Guwahati, India, Dec. 2007.

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006a). Voice conversion by prosody and vocal tract modification. In 9th int. conf. information technology, Bhubaneswar, Orissa, India.

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006b). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2008). Intonation modeling for Indian languages. Computer Speech and Language.

  • Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.

    Article  Google Scholar 

  • Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007a). Voice transformation by mapping the features at syllable level. In 2nd international conference on pattern recognition and machine intelligence (Premi-2007), Kolkata, India, Dec. 2007.

    Google Scholar 

  • Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007b). Voice transformation by mapping the features at syllable level. In R. K. D. A. Ghosh & S. K. Pal (Eds.), LNCS, ISI Kolkata (pp. 479–486). Heidelberg: Springer.

    Google Scholar 

  • Reddy, K. S. (2004). Source and system features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.

  • Reynolds, D., & Rose, R. (1995). Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Transactions Speech and Audio Processing, 72–83.

  • Reynolds, D., Quatieri, T., & Dunn, R. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.

    Article  Google Scholar 

  • Scherer, K. R., Johnstone, T., & Banziger, T. (1998). Automatic verification of emotionally stressed speakers: the problem of individual differences. In Proceedings of the international workshop on speech and computer, St. Petersburg.

    Google Scholar 

  • Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48, 1047–1055.

    Article  Google Scholar 

  • Shahin, I. (2008). Using emotions to identify speakers. In 5th int. workshop on signal processing and its applications.

    Google Scholar 

  • Shahin, I. (2009). Speaker identification in emotional environments. Iranian Journal of Electrical and Computer Engineering, 8(1), 41–46.

    MathSciNet  Google Scholar 

  • Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proc. IEEE int. conf. acoust., speech, signal processing, May 2001 (Vol. 2, pp. 841–844).

    Google Scholar 

  • Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2011). Improved consonant-vowel recognition for low bit-rate coded speech. International Journal of Adaptive Control Signal Processing, doi:10.1002/acs.1286.

  • Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.

    Article  Google Scholar 

  • Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Speech and Audio Processing, 20, 1894–1903.

    Article  Google Scholar 

  • Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J. (2006). Study on speaker verification on emotional speech. In Proc. of int. conf. on spoken language processing (ICSLP-2006) (pp. 2102–2105).

    Google Scholar 

  • Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall.

    Google Scholar 

  • Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.

    Article  Google Scholar 

  • Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001). Source and system features for speaker recognition using AANN models. In Proc. IEEE int. conf. acoust., speech, signal processing, Salt Lake City, Utah, USA, May 2001 (pp. 409–412).

    Google Scholar 

  • Zachariah, J. M. (2002). Text-dependent speaker verification using segmental suprasegmental and source features. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India, March.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sreenivasa Rao Krothapalli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krothapalli, S.R., Yadav, J., Sarkar, S. et al. Neural network based feature transformation for emotion independent speaker identification. Int J Speech Technol 15, 335–349 (2012). https://doi.org/10.1007/s10772-012-9148-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-012-9148-2

Keywords

Navigation