Abstract
In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20 %. Feature transformation at the syllable level has shown the better performance, compared to sentence level.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1988 (Vol. 1, pp. 655–658).
Benesty, J., Sondhi, M. M. & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. Berlin: Springer.
Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., & Reynolds, D. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 4, 430–451.
Bou-Ghazale, S. E., & Hansen, J. H. L. (1996). Generating stressed speech from neutral speech using a modified celp vocoder. Speech Communication, 20, 93–110.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In INTERSPEECH-2005 (pp. 1517–1520).
Campbell, N. (2004). Perception of affect in speech—towards an automatic processing of paralinguistic information in spoken conversation. In ICSLP, Jeju, October 2004.
Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In Proc. European conf. speech commun. technol. (pp. 2665–2668).
Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18, 954–964.
Dunn, R., Reynolds, D., & Quatieri, T. (2000). Approaches to speaker detection and tracking in multi-speaker audio. Digital Signal Processing, 10(1), 93–112.
Fine, S., Navaratil, J., & Gopinath, R. (2001). A hybrid GMM/SVM approach to speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing, Utah, USA, May 2001 (Vol. 1).
Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2004). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011, Florence, Italy, August 2004 (pp. 73–76).
Gupta, C. S. (2003). Significance of source features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.
Hansen, J. H. L., & Womack, B. D. (1996). Feature analysis and neural network-based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 4, 307–313.
Haykin, S. (1999). Neural networks: a comprehensive foundation. New Delhi: Pearson Education Aisa, Inc.
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall, Inc.
Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In Int. joint conf. neural networks, Washington, USA, Aug. 2001 (pp. 1548–1553).
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In LNCS. Communications in computer and information science, Aug. 2009. Berlin: Springer.
Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011). Iitkgp-sehsc: Hindi speech corpus for emotion analysis. In International conference on devices and communication, Mesra, India, Birla Institute of Technology, Feb. 2011. New York: IEEE Press.
Li, D., Yang, Y., Wu, Z., & Wu, T. (2005). Lecture notes in computer science.: Vol. 3784. Emotion-state conversion for speaker recognition. Berlin: Springer. ISBN: 978-3-540-29621-8.
Marshimi, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of EUROSPEECH 2001, Aalborg, Denmark, Sept. 2001 (pp. 361–364).
Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, June.
Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proc. int. conf. spoken language processing, Pittsburgh, PA, USA, Sep. 2006 (pp. 917–920).
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Mary, L., Rao, K. S., Gangashetty, S. V., & Yegnanarayana, B. (2004). Neural network models for capturing duration and intonation knowledge for language and speaker identification. In Int. conf. cognitive and neural systems, Boston, MA, USA, May 2004.
Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text-to-speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–182.
Narendranadh, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16, 206–216.
Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Speech and Audio Processing, 19, 2552–2565.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.
Raja, G. S., & Dandapat, S. (2010). Speaker recognition under stressed condition. International Journal of Speech Technology, 13, 141–161.
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24.
Rao, K. S. (2011). Role of neural networks for developing speech systems. Sadhana, 36, 783–836.
Rao, K. S., & Koolagudi, S. G. (2007). Transformation of speaker characteristics in speech using support vector machines. In 15th international conference on advanced computing and communication (ADCOM-2007), Guwahati, India, Dec. 2007.
Rao, K. S., & Yegnanarayana, B. (2006a). Voice conversion by prosody and vocal tract modification. In 9th int. conf. information technology, Bhubaneswar, Orissa, India.
Rao, K. S., & Yegnanarayana, B. (2006b). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21, 282–295.
Rao, K. S., & Yegnanarayana, B. (2008). Intonation modeling for Indian languages. Computer Speech and Language.
Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.
Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007a). Voice transformation by mapping the features at syllable level. In 2nd international conference on pattern recognition and machine intelligence (Premi-2007), Kolkata, India, Dec. 2007.
Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007b). Voice transformation by mapping the features at syllable level. In R. K. D. A. Ghosh & S. K. Pal (Eds.), LNCS, ISI Kolkata (pp. 479–486). Heidelberg: Springer.
Reddy, K. S. (2004). Source and system features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.
Reynolds, D., & Rose, R. (1995). Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Transactions Speech and Audio Processing, 72–83.
Reynolds, D., Quatieri, T., & Dunn, R. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.
Scherer, K. R., Johnstone, T., & Banziger, T. (1998). Automatic verification of emotionally stressed speakers: the problem of individual differences. In Proceedings of the international workshop on speech and computer, St. Petersburg.
Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48, 1047–1055.
Shahin, I. (2008). Using emotions to identify speakers. In 5th int. workshop on signal processing and its applications.
Shahin, I. (2009). Speaker identification in emotional environments. Iranian Journal of Electrical and Computer Engineering, 8(1), 41–46.
Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proc. IEEE int. conf. acoust., speech, signal processing, May 2001 (Vol. 2, pp. 841–844).
Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2011). Improved consonant-vowel recognition for low bit-rate coded speech. International Journal of Adaptive Control Signal Processing, doi:10.1002/acs.1286.
Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.
Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Speech and Audio Processing, 20, 1894–1903.
Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J. (2006). Study on speaker verification on emotional speech. In Proc. of int. conf. on spoken language processing (ICSLP-2006) (pp. 2102–2105).
Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall.
Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001). Source and system features for speaker recognition using AANN models. In Proc. IEEE int. conf. acoust., speech, signal processing, Salt Lake City, Utah, USA, May 2001 (pp. 409–412).
Zachariah, J. M. (2002). Text-dependent speaker verification using segmental suprasegmental and source features. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India, March.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Krothapalli, S.R., Yadav, J., Sarkar, S. et al. Neural network based feature transformation for emotion independent speaker identification. Int J Speech Technol 15, 335–349 (2012). https://doi.org/10.1007/s10772-012-9148-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9148-2