Skip to main content
Log in

Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this work, we propose a mapping function based feature transformation framework for developing consonant–vowel (CV) recognition system in the emotional environment. An effective way of conveying messages is by expressing emotions during human conversations. The characteristics of CV units differ from one emotion to other emotions. The performance of existing CV recognition systems is degraded in emotional environments. Therefore, we have proposed mapping functions based on artificial neural network and GMM models for increasing the accuracy of CV recognition in the emotional environment. The CV recognition system has been explored to transform emotional features to neutral features using proposed mapping functions at CV and phone levels to minimize mismatch between training and testing environments. Vowel onset and offset points have been used to identify vowel, consonant and transition segments. Transition segments are identified by considering initial 15% speech samples between vowel onset and offset points. The average performance of CV recognition system is increased significantly using feature mapping technique at phone level in three emotional environments (anger, happiness, and sadness).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. Proceedings of IEEE International Conference on Acoustics, Speech Signal Processing, 1, 655–658.

    Google Scholar 

  • Buera, L., Lleida, E., Miguel, A., & Ortega, A. (2004). Multi-environment models based linear normalization for speech recognition in car conditions. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.

  • Buera, L., Lleida, E., Miguel, A., Ortega, A., & Saz, S. (2007). Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1098–1113.

    Article  Google Scholar 

  • Buera, L., Miguel, A., Saz, S., Ortega, A., & Lleida, E. (2010). Unsupervised data-driven feature vector normalization with acoustic model adaptation for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(2), 296–309.

    Article  Google Scholar 

  • Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Chauhan, R., Yadav, J., & Koolagudi, S. (2011). Text independent emotion recognition using spectral features. In International Conference on Contemporary Computing, vol. 168, pp. 359–370.

  • Deng, L., Acero, A., Jiang, L., Droppo, J., & Huang, X. (2001). High-performance robust speech recognition using stereo training data. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 301–304.

  • Desai, S., Black, A . W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.

    Article  Google Scholar 

  • Gangashetty, S. V. (2004). Neural network models for recognition of consonant–vowel units of speech In: multiple languages, Ph.D. dissertation, IIT Madras.

  • Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2005). Combining evidence from multiple classifiers for recognition of consonant–vowel units of speech in multiple languages. In Proceedings of IEEE International Conference on Intelligent Sensing and Information Processing, pp. 387–391.

  • Haykin, S. (1999). Neural networks: A comprehensive foundation. New Delhi: Pearson Education Asia Inc.

    MATH  Google Scholar 

  • Himawan, I., Motlicek, P., Imseng, D., Potard, B., Kim, N., & Lee, J. (2015). Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4540–4544.

  • Himawan, I., Motlicek, P., Imseng, D., & Sridharan, S. (2016). Feature mapping using far-field microphones for distant speech recognition. Speech Communication, 83, 1–9.

    Article  Google Scholar 

  • Ho, T . K., Hull, J . J., & Srihari, S . N. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 66–75.

    Article  Google Scholar 

  • Koolagudi, S., Reddy, R., Yadav, J., & Rao, K. S. (2011). IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In International Conference on Devices and Communications, BIT Mesra, India , pp. 1–5.

  • Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15, 495–511.

    Article  Google Scholar 

  • Krothapalli, S. R., Yadav, J., Sarkar, S., Koolagudi, S. G., & Vuppala, A. K. (2012). Neural network based feature transformation for emotion independent speaker identification. International Journal of Speech Technology, 15(3), 335–349.

    Article  Google Scholar 

  • Marshimi, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of EUROSPEECH, Aalborg, Denmark, pp. 361–364.

  • Mary, L., Rao, K. S., Gangashetty, S. V., & Yegnanarayana, B. (2004). Neural network models for capturing duration and intonation knowledge for language and speaker identification. In International Conference on Cognitive and Neural Systems, Boston, MA, USA.

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.

    Article  Google Scholar 

  • Nandi, D., Dutta, A., & Rao, K. S. (2014). Significance of CV transition and steady vowel regions for language identification. In International Conference on Contemporary Computing, Noida, India, pp. 513–517.

  • Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text-to-speech synthesis system in bengali. International Journal of Speech Technology, 14, 167–182.

    Article  Google Scholar 

  • Narendranadh, M., Murthy, H. A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 206–216.

    Google Scholar 

  • Picone, J . W. (1993). Signal modeling techniques in speech recognition. Proceedings of IEEE, 81(9), 1215–1247.

    Article  Google Scholar 

  • Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.

    Article  Google Scholar 

  • Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24(3), 474–494.

    Article  Google Scholar 

  • Rao, K. S. (2011). Role of neural networks for developing speech systems. Sadhana, 36, 783–836.

    Article  Google Scholar 

  • Rao, K. S., & Koolagudi, S. G. (2007). Transformation of speaker characteristics in speech using support vector machines. In 15th International Conference on Advanced Computing and Communication (ADCOM-2007), Guwahati, India.

  • Rao, K. S., & Yegnanarayana, B. (2006). Voice conversion by prosody and vocal tract modification. In 9th International Conference on Information Technology, Bhubaneswar, Orissa, India.

  • Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for indian languages. Computer Speech and Language, 23(2), 240–256.

    Article  Google Scholar 

  • Rao, K. S., Laskar, R. H., & Koolagudi, S. G. (2007). Voice transformation by mapping the features at syllable level. In 2nd International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India.

  • Sarkar, S., & Rao, K. S. (2013). Speaker verification in noisy environment using gmm supervectors. In National Conference on Communications (NCC), pp. 1–5.

  • Sarkar, S., & Rao, K . S. (2014a). Robust speaker recognition in noisy environments. New York: Springer International Publishing.

    Google Scholar 

  • Sarkar, S., & Rao, K. S. (2014b). Stochastic feature compensation methods for speaker verification in noisy environments. Applied Soft Computing, 19, 198–214.

    Article  Google Scholar 

  • Sarkar, S., & Rao, K. S. (2017). Supervector-based approaches in a discriminative framework for speaker verification in noisy environments. International Journal of Speech Technology, 20(2), 387–416.

    Article  Google Scholar 

  • Sekhar, C. C. (1996). Neural network models for recognition of stop consonant–vowel (SCV) segments in continuous speech, Ph.D. dissertation, Department of Computer Science and Engineering, IIT Madras, India.

  • Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.

    Article  Google Scholar 

  • Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, vol. 2, pp. 841–844.

  • Vuppala, A., Rao, K., & Chakrabarti, S. (2012a). Improved consonanti–vowel recognition for low bit-rate coded speech. Wiley International Journal of Adaptive Control and Signal Processing, 26(4), 333–349.

    Article  Google Scholar 

  • Vuppala, A., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012b). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1894–1903.

    Article  Google Scholar 

  • Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2012c). Spotting and recognition of consonant–vowel units from continuous speech using accurate detection of vowel onset points. Circuits, Systems, and Signal Processing, 31(4), 1459–1474.

    Article  Google Scholar 

  • Vuppala, A. K., Rao, K. S., Chakrabarti, S., Krishnamoorthy, P., & Prasanna, S. (2011). Recognition of consonant–vowel (CV) units under background noise using combined temporal and spectral preprocessing. International Journal of Speech Technology, 14(3), 259–272.

    Article  Google Scholar 

  • Xihao, S., & Miyanaga, Y. (2013). Dynamic time warping for speech recognition with training part to reduce the computation. In International Symposium on Signals, Circuits and Systems (ISSCS), pp. 1–4.

  • Yadav, J., & Rao, K. (2013). Detection of vowel offset point from speech signal. IEEE Signal Processing Letters, 20(4), 299–302.

    Article  Google Scholar 

  • Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in hindi language. Circuits, Systems and Signal Processing, 35(1), 139–162.

    Article  MathSciNet  Google Scholar 

  • Yegnanarayana, B. (2004). Artificial neural networks. New Delhi: Prentice-Hall of India.

    Google Scholar 

  • Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.

    Article  Google Scholar 

  • Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2000). The HTK book version 3.0. Cambridge: Cambridge University Press.

    Google Scholar 

  • Zhang, Z., Wang, L., Kai, A., Yamada, T., Li, W., & Iwahashi, M. (2015). Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 12.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jainath Yadav.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadav, J., Rao, K.S. Neural network and GMM based feature mappings for consonant–vowel recognition in emotional environment. Int J Speech Technol 21, 421–433 (2018). https://doi.org/10.1007/s10772-017-9478-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9478-1

Keywords

Navigation