Skip to main content
Log in

Improvement of phone recognition accuracy using speech mode classification

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Ananthapadmanabha, T., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics Speech and Signal Processing, 27(4), 309–319.

    Article  Google Scholar 

  • Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In Proceedings of Zone Conference Proceedings on American Society for Engineering Education (pp. 1–7).

  • Bhattacharjee, U., & Sarmah, K. (2013). Language identification system using mfcc and prosodic features. In Proceedings of International Conference on Intelligent Systems and Signal Processing (pp. 194–197). IEEE

  • Cheriyadat, A., Bruce, L. M., & Mathur, A. (2003). Decision level fusion with best-bases for hyperspectral classification. In Proceedings of IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data (pp. 399–406). IEEE.

  • Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for mfcc feature extraction. In Proceedings of 4th International Conference on Signal Processing and Communication Systems (pp. 1–5). IEEE.

  • Islam, M. (2014). Feature and score fusion based multiple classifier selection for iris recognition. Computational Intelligence and Neuroscience, 2014, 10.

    Article  Google Scholar 

  • Jalil, M., Butt, F. A., & Malik, A. (2013). Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (pp. 208–212). IEEE.

  • Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology, 15(2), 265–289.

    Article  Google Scholar 

  • Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268.

  • Kumar, S. S., Rao, K. S., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and odia. In International Conference on Oriental COCOSDA held Jointly with Conference on Asian Spoken Language Research and Evaluation (pp. 1–5). IEEE.

  • Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.

    Article  Google Scholar 

  • Mangai, U. G., Samanta, S., Das, S., & Chowdhury, P. R. (2010). A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review, 27(4), 293–307.

    Article  Google Scholar 

  • Manjunath, K., & Rao, K. S. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In Proceedings of Twentieth National Conference on Communications (pp. 1–6). IEEE.

  • Manjunath, K., & Rao, K. S. (2015). Source and system features for phone recognition. International Journal of Speech Technology, 18(2), 257–270.

    Article  Google Scholar 

  • Manjunath, K. E., & Rao, K. S. (2016). Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. International Journal of Speech Technology, 19(1), 121–134.

    Article  Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech, and Language Processing, 16(8), 1602–1613.

    Article  Google Scholar 

  • Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011). The subspace gaussian mixture modela structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.

    Article  Google Scholar 

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. et al. (2011). The kaldi speech recognition toolkit. In Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.

  • Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics Speech and Signal Processing, 25(1), 24–33.

    Article  Google Scholar 

  • Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.

    Article  Google Scholar 

  • Rao, K. S., & Manjunath, K. (2017). Speech recognition using articulatory and excitation source features. New York: Springer.

    Book  Google Scholar 

  • Reddy, M. K., & Rao, K. S. (2017). Robust pitch extraction method for the hmm-based speech synthesis system. IEEE Signal Processing Letters, 24(8), 1133–1137.

    Article  Google Scholar 

  • The International Phonetic Association. (2015). International phonetic alphabet. http://www.langsci.ucl.ac.uk/ipa/index.html. Accessed Feb 2017.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kumud Tripathi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tripathi, K., Rao, K.S. Improvement of phone recognition accuracy using speech mode classification. Int J Speech Technol 21, 489–500 (2018). https://doi.org/10.1007/s10772-017-9483-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9483-4

Keywords

Navigation