Improvement of phone recognition accuracy using speech mode classification

Tripathi, Kumud; Rao, K. Sreenivasa

doi:10.1007/s10772-017-9483-4

Improvement of phone recognition accuracy using speech mode classification

Published: 07 December 2017

Volume 21, pages 489–500, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

328 Accesses
9 Citations
Explore all metrics

Abstract

In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model

Article 11 October 2016

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Analysis of the Impact Using Pre-emphasis Filter, Unvoiced Sounds, Frame Size and Feature Vector Size on Human Emotion Recognition by Voice and Machine Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ananthapadmanabha, T., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics Speech and Signal Processing, 27(4), 309–319.
Article Google Scholar
Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In Proceedings of Zone Conference Proceedings on American Society for Engineering Education (pp. 1–7).
Bhattacharjee, U., & Sarmah, K. (2013). Language identification system using mfcc and prosodic features. In Proceedings of International Conference on Intelligent Systems and Signal Processing (pp. 194–197). IEEE
Cheriyadat, A., Bruce, L. M., & Mathur, A. (2003). Decision level fusion with best-bases for hyperspectral classification. In Proceedings of IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data (pp. 399–406). IEEE.
Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for mfcc feature extraction. In Proceedings of 4th International Conference on Signal Processing and Communication Systems (pp. 1–5). IEEE.
Islam, M. (2014). Feature and score fusion based multiple classifier selection for iris recognition. Computational Intelligence and Neuroscience, 2014, 10.
Article Google Scholar
Jalil, M., Butt, F. A., & Malik, A. (2013). Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (pp. 208–212). IEEE.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology, 15(2), 265–289.
Article Google Scholar
Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268.
Kumar, S. S., Rao, K. S., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and odia. In International Conference on Oriental COCOSDA held Jointly with Conference on Asian Spoken Language Research and Evaluation (pp. 1–5). IEEE.
Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.
Article Google Scholar
Mangai, U. G., Samanta, S., Das, S., & Chowdhury, P. R. (2010). A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review, 27(4), 293–307.
Article Google Scholar
Manjunath, K., & Rao, K. S. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In Proceedings of Twentieth National Conference on Communications (pp. 1–6). IEEE.
Manjunath, K., & Rao, K. S. (2015). Source and system features for phone recognition. International Journal of Speech Technology, 18(2), 257–270.
Article Google Scholar
Manjunath, K. E., & Rao, K. S. (2016). Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. International Journal of Speech Technology, 19(1), 121–134.
Article Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech, and Language Processing, 16(8), 1602–1613.
Article Google Scholar
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011). The subspace gaussian mixture modela structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. et al. (2011). The kaldi speech recognition toolkit. In Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.
Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics Speech and Signal Processing, 25(1), 24–33.
Article Google Scholar
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.
Article Google Scholar
Rao, K. S., & Manjunath, K. (2017). Speech recognition using articulatory and excitation source features. New York: Springer.
Book Google Scholar
Reddy, M. K., & Rao, K. S. (2017). Robust pitch extraction method for the hmm-based speech synthesis system. IEEE Signal Processing Letters, 24(8), 1133–1137.
Article Google Scholar
The International Phonetic Association. (2015). International phonetic alphabet. http://www.langsci.ucl.ac.uk/ipa/index.html. Accessed Feb 2017.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, 721302, India
Kumud Tripathi & K. Sreenivasa Rao

Authors

Kumud Tripathi
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kumud Tripathi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tripathi, K., Rao, K.S. Improvement of phone recognition accuracy using speech mode classification. Int J Speech Technol 21, 489–500 (2018). https://doi.org/10.1007/s10772-017-9483-4

Download citation

Received: 24 July 2017
Accepted: 10 November 2017
Published: 07 December 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10772-017-9483-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement of phone recognition accuracy using speech mode classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Analysis of the Impact Using Pre-emphasis Filter, Unvoiced Sounds, Frame Size and Feature Vector Size on Human Emotion Recognition by Voice and Machine Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Improvement of phone recognition accuracy using speech mode classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Analysis of the Impact Using Pre-emphasis Filter, Unvoiced Sounds, Frame Size and Feature Vector Size on Human Emotion Recognition by Voice and Machine Learning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation