Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment

Yadav, Jainath; Rao, K. Sreenivasa

doi:10.1007/s10772-017-9478-1

Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment

Published: 20 November 2017

Volume 21, pages 421–433, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Jainath Yadav¹ &
K. Sreenivasa Rao¹

281 Accesses
2 Citations
Explore all metrics

Abstract

In this work, we propose a mapping function based feature transformation framework for developing consonant–vowel (CV) recognition system in the emotional environment. An effective way of conveying messages is by expressing emotions during human conversations. The characteristics of CV units differ from one emotion to other emotions. The performance of existing CV recognition systems is degraded in emotional environments. Therefore, we have proposed mapping functions based on artificial neural network and GMM models for increasing the accuracy of CV recognition in the emotional environment. The CV recognition system has been explored to transform emotional features to neutral features using proposed mapping functions at CV and phone levels to minimize mismatch between training and testing environments. Vowel onset and offset points have been used to identify vowel, consonant and transition segments. Transition segments are identified by considering initial 15% speech samples between vowel onset and offset points. The average performance of CV recognition system is increased significantly using feature mapping technique at phone level in three emotional environments (anger, happiness, and sadness).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Article 02 November 2023

Assamese Vowel Speech Recognition Using GMM and ANN Approaches

Mismatched feature detection with finer granularity for emotional speaker recognition

Article 15 October 2014

References

Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. Proceedings of IEEE International Conference on Acoustics, Speech Signal Processing, 1, 655–658.
Google Scholar
Buera, L., Lleida, E., Miguel, A., & Ortega, A. (2004). Multi-environment models based linear normalization for speech recognition in car conditions. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.
Buera, L., Lleida, E., Miguel, A., Ortega, A., & Saz, S. (2007). Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1098–1113.
Article Google Scholar
Buera, L., Miguel, A., Saz, S., Ortega, A., & Lleida, E. (2010). Unsupervised data-driven feature vector normalization with acoustic model adaptation for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(2), 296–309.
Article Google Scholar
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Article Google Scholar
Chauhan, R., Yadav, J., & Koolagudi, S. (2011). Text independent emotion recognition using spectral features. In International Conference on Contemporary Computing, vol. 168, pp. 359–370.
Deng, L., Acero, A., Jiang, L., Droppo, J., & Huang, X. (2001). High-performance robust speech recognition using stereo training data. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 301–304.
Desai, S., Black, A . W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.
Article Google Scholar
Gangashetty, S. V. (2004). Neural network models for recognition of consonant–vowel units of speech In: multiple languages, Ph.D. dissertation, IIT Madras.
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2005). Combining evidence from multiple classifiers for recognition of consonant–vowel units of speech in multiple languages. In Proceedings of IEEE International Conference on Intelligent Sensing and Information Processing, pp. 387–391.
Haykin, S. (1999). Neural networks: A comprehensive foundation. New Delhi: Pearson Education Asia Inc.
MATH Google Scholar
Himawan, I., Motlicek, P., Imseng, D., Potard, B., Kim, N., & Lee, J. (2015). Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4540–4544.
Himawan, I., Motlicek, P., Imseng, D., & Sridharan, S. (2016). Feature mapping using far-field microphones for distant speech recognition. Speech Communication, 83, 1–9.
Article Google Scholar
Ho, T . K., Hull, J . J., & Srihari, S . N. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 66–75.
Article Google Scholar
Koolagudi, S., Reddy, R., Yadav, J., & Rao, K. S. (2011). IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In International Conference on Devices and Communications, BIT Mesra, India , pp. 1–5.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15, 495–511.
Article Google Scholar
Krothapalli, S. R., Yadav, J., Sarkar, S., Koolagudi, S. G., & Vuppala, A. K. (2012). Neural network based feature transformation for emotion independent speaker identification. International Journal of Speech Technology, 15(3), 335–349.
Article Google Scholar
Marshimi, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of EUROSPEECH, Aalborg, Denmark, pp. 361–364.
Mary, L., Rao, K. S., Gangashetty, S. V., & Yegnanarayana, B. (2004). Neural network models for capturing duration and intonation knowledge for language and speaker identification. In International Conference on Cognitive and Neural Systems, Boston, MA, USA.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Article Google Scholar
Nandi, D., Dutta, A., & Rao, K. S. (2014). Significance of CV transition and steady vowel regions for language identification. In International Conference on Contemporary Computing, Noida, India, pp. 513–517.
Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text-to-speech synthesis system in bengali. International Journal of Speech Technology, 14, 167–182.
Article Google Scholar
Narendranadh, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 206–216.
Google Scholar
Picone, J . W. (1993). Signal modeling techniques in speech recognition. Proceedings of IEEE, 81(9), 1215–1247.
Article Google Scholar
Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Article Google Scholar
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24(3), 474–494.
Article Google Scholar
Rao, K. S. (2011). Role of neural networks for developing speech systems. Sadhana, 36, 783–836.
Article Google Scholar
Rao, K. S., & Koolagudi, S. G. (2007). Transformation of speaker characteristics in speech using support vector machines. In 15th International Conference on Advanced Computing and Communication (ADCOM-2007), Guwahati, India.
Rao, K. S., & Yegnanarayana, B. (2006). Voice conversion by prosody and vocal tract modification. In 9th International Conference on Information Technology, Bhubaneswar, Orissa, India.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for indian languages. Computer Speech and Language, 23(2), 240–256.
Article Google Scholar
Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007). Voice transformation by mapping the features at syllable level. In 2nd International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India.
Sarkar, S., & Rao, K. S. (2013). Speaker verification in noisy environment using gmm supervectors. In National Conference on Communications (NCC), pp. 1–5.
Sarkar, S., & Rao, K . S. (2014a). Robust speaker recognition in noisy environments. New York: Springer International Publishing.
Google Scholar
Sarkar, S., & Rao, K. S. (2014b). Stochastic feature compensation methods for speaker verification in noisy environments. Applied Soft Computing, 19, 198–214.
Article Google Scholar
Sarkar, S., & Rao, K. S. (2017). Supervector-based approaches in a discriminative framework for speaker verification in noisy environments. International Journal of Speech Technology, 20(2), 387–416.
Article Google Scholar
Sekhar, C. C. (1996). Neural network models for recognition of stop consonant–vowel (SCV) segments in continuous speech, Ph.D. dissertation, Department of Computer Science and Engineering, IIT Madras, India.
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Article Google Scholar
Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, vol. 2, pp. 841–844.
Vuppala, A., Rao, K., & Chakrabarti, S. (2012a). Improved consonanti–vowel recognition for low bit-rate coded speech. Wiley International Journal of Adaptive Control and Signal Processing, 26(4), 333–349.
Article Google Scholar
Vuppala, A., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012b). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1894–1903.
Article Google Scholar
Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2012c). Spotting and recognition of consonant–vowel units from continuous speech using accurate detection of vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474.
Article Google Scholar
Vuppala, A. K., Rao, K. S., Chakrabarti, S., Krishnamoorthy, P., & Prasanna, S. (2011). Recognition of consonant–vowel (CV) units under background noise using combined temporal and spectral preprocessing. International Journal of Speech Technology, 14(3), 259–272.
Article Google Scholar
Xihao, S., & Miyanaga, Y. (2013). Dynamic time warping for speech recognition with training part to reduce the computation. In International Symposium on Signals, Circuits and Systems (ISSCS), pp. 1–4.
Yadav, J., & Rao, K. (2013). Detection of vowel offset point from speech signal. IEEE Signal Processing Letters, 20(4), 299–302.
Article Google Scholar
Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in hindi language. Circuits, Systems and Signal Processing, 35(1), 139–162.
Article MathSciNet Google Scholar
Yegnanarayana, B. (2004). Artificial neural networks. New Delhi: Prentice-Hall of India.
Google Scholar
Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.
Article Google Scholar
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2000). The HTK book version 3.0. Cambridge: Cambridge University Press.
Google Scholar
Zhang, Z., Wang, L., Kai, A., Yamada, T., Li, W., & Iwahashi, M. (2015). Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 12.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India
Jainath Yadav & K. Sreenivasa Rao

Authors

Jainath Yadav
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jainath Yadav.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yadav, J., Rao, K.S. Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment. Int J Speech Technol 21, 421–433 (2018). https://doi.org/10.1007/s10772-017-9478-1

Download citation

Received: 15 July 2017
Accepted: 10 November 2017
Published: 20 November 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10772-017-9478-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment

Abstract

Access this article

Similar content being viewed by others

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Assamese Vowel Speech Recognition Using GMM and ANN Approaches

Mismatched feature detection with finer granularity for emotional speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment

Abstract

Access this article

Similar content being viewed by others

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Assamese Vowel Speech Recognition Using GMM and ANN Approaches

Mismatched feature detection with finer granularity for emotional speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation