Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

Khelifa, Mohamed O. M.; Elhadj, Yahya Mohamed; Abdellah, Yousfi; Belkasmi, Mostafa

doi:10.1007/s10772-017-9456-7

Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

Published: 20 September 2017

Volume 20, pages 937–949, (2017)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mohamed O. M. Khelifa ORCID: orcid.org/0000-0003-3564-9291¹,
Yahya Mohamed Elhadj^2,3,
Yousfi Abdellah⁴ &
…
Mostafa Belkasmi¹

668 Accesses
20 Citations
2 Altmetric
Explore all metrics

Abstract

Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study for Arabic speech recognition system in noisy environments

Article 27 April 2021

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

Article 22 November 2018

An experimental framework for Arabic digits speech recognition in noisy environments

Article 03 February 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Atal, B., & Rabiner, L. (1976). A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 201–212.
Article Google Scholar
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120.
Article Google Scholar
Boril, H., & Pollák, P. (2004). Direct time domain fundamental frequency estimation of speech in noisy conditions. In Proceedings of the EUSIPCO2004, Wien, Austria (Vol. 1, pp. 1003–1006).
Cherif, A., & Dabbabi, T. (2001). Pitch detection and formants analysis of Arabic speech processing. Applied Acoustics, 62, 1129–1140.
Article Google Scholar
Daniel, J., & James, H. (2008) Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. (2^nd ed.). Upper Saddle River: Prentice Hall.
Google Scholar
Davis, S., Sants, B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Article Google Scholar
De Mori, R., Moisa, L., Gemello, R., Mana, F., & Albensano, D. (2001). Augmenting standard speech recognition features with energy gravity centres. Computer Speech and Language, 15, 341–354.
Article Google Scholar
ElHadj, O. M. Y. et al. (2007). A manual system to segment and transcribe Arabic Speech. In Proceedings of IEEE ICSPC’07 (pp. 233–236) Dubai, UAE.
Elhadj, O. M. Y., Alghamdi, M., & Alkanhal, M. (2013a) Approach for recognizing allophonic sounds of the classical arabic based on Quran recitations. Theory and Practice of Natural Computing, Lecture Notes in Computer Science (Vol. 8273: pp. 57–67).
Elhadj, O. M. Y., Alghamdi, M., & Alkanhal, M. (2013b). Phoneme-based recognizer to assist reading the Holy Quran. Recent advances in intelligent informatics. Advances in Intelligent Systems and Computing, 235, 141–152.
Article Google Scholar
Elhadj, O. M. Y., Alsughayeir, I. A., Alghamdi, M., Alkanhal, M., Ohali, Y. M., & Alansari, A. M. (2012). Computerized teaching of the Holy Quran (in Arabic), Final Technical Report, King Abdulaziz City for Sciences and Technology (KACST), Riyadh, KSA.
Elhadj, Y. O. M., Khelifa, M. O. M., Yousfi, A., & Belkasmi, M. (2016). An accurate recognizer for basic arabic sounds. ARPN Journal of Engineering and Applied Sciences, 11(5), 3239–3243.
Google Scholar
Ezzaidi, H. (2002). Discrimination Speech/music and study of new parameters and models for a speaker identification system in the context of conference calls. (Ph.D. thesis, Chicoutimi: The University of Quebec at Chicoutimi; Department of Applied Science).
Gargouri, D., Kammoun, M. A., & Hamida, A. B. (2006). A comparative study of formant frequencies estimation techniques. In Proceedings of the 5th WSEAS International Conference on Signal Processing, Istanbul, Turkey (pp. 15–19). May 27–29.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.
Article Google Scholar
Hermansky, H., et al. (1991) Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTAPLP). In EUROSPEECH Genova (Ed.), 1367–1370.
Hermansky, H., & Morgan, N. (1994). RASTA of processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Article Google Scholar
Holmes, J., Holmes, W., & Garner, P. (1997). Using formant frequencies in speech recognition. In European Conference on Speech Communication and Technology, Rhodes, Greece (Vol. 4, pp. 2083–2086).
Iqbal, H., Awais, M., Masud, S., & Shamail, S. (2008). On vowels segmentation and identification using formant transitions in continuous recitation of Quranic Arabic. In New Challenges in Applied Intelligence Technologies, ser. Studies in Computational Intelligence (Vol. 134, pp. 155–162). Berlin, Heidelberg: Springer.
Google Scholar
Jonathon, S. (2005). A tutorial on principal components analysis. Institute for Nonlinear Science. San Diego: University of California.
Google Scholar
Jurafsky, D., & Martin, J. (2009). Speech and language processing—an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River: Prentice Hall.
Google Scholar
Khelifa, M. O. M., ElHadj, Y. O. M., Abdellah, Y., & Belkasmi, M. (2016). Enhancing Arabic phoneme recognizer using duration modeling techniques. In Proceedings of Fourth International Conference on Advances in Computing, Electronics and Communication—ACEC Dec 15, 2016, Rome.
Khelifa, M. O. M., ElHadj, Y. O. M., Abdellah, Y., & Belkasmi, M. (2017a). Strategies for implementing an optimal ASR system for quranic recitation recognition. International Journal of Computer Applications, 172(9):35–41.
Article Google Scholar
Khelifa, M. O. M., ElHadj, Y. O. M., Abdellah, Y., & Belkasmi, M. (2017b). An accurate HSMM-based system for Arabic phonemes recognition. In Proceedings of The IEEE Ninth International conference on Advanced Computational Intelligence (ICACI 2017), Feb. 2, Qatar: Doha.
Google Scholar
Khelifa, M. O. M., ElHadj, Y. O. M., Abdellah, Y., & Belkasmi, M. (2017c). Helpful statistics in recognizing basic Arabic phonemes. International Journal of Advanced Computer Science and Applications(ijacsa). doi:10.14569/IJACSA.2017.080231.
Google Scholar
Leena, M. (2012). Extraction and representation of prosody for speaker, speech and language recognition. New York: Springer.
MATH Google Scholar
Liu, S., et al. (1998). The effect of fundamental frequency on mandarin speech recognition. In Proceedings of ICSLP, Sydney, Australia (Vol. 6).
Makhoul, J., & Bolt, B. (1975). Newman, linear prediction: A tutorial review. Proceedings of IEEE, 63(4), 561–580.
Article Google Scholar
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Article Google Scholar
Meftah, A., Selouani, S., & Yousef, L. (2014). Preliminary Arabic speech emotion classification. In IEEE International Symposium on Signal Processing and Information Technology, Noida, India.
Mitchell, M. (1994). Wavelets: A conceptual overview. Cambridge: Massachusetts Institute of Technology, Laboratory for Information and Decision Systems.
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi Speech Recognition Toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.
Rabiner, L., et al. (1976). A comparative performance study of several pitch detection algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24, 399–417.
Article Google Scholar
Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25, 1.
Article Google Scholar
Schultz, T., & Black, A. (2008). Rapid language adaptation tools and technologies for multilingual speech processing. In Proceedings of ICASSP, Las Vegas, NV.
Sphinx-4 Java-based Speech Recognition Engine. (2017). http://cmusphinx.sourceforge.net/sphinx4/. Accessed Nov 2017.
Stuttle, M., & Gales, M. (2002). Combining a Gaussian mixture model front end with MFCC parameters. In International Conference on Spoken Language Processing, Denver, Colorado (Vol. 3, pp. 1565–1568).
Thomson, D., & Chengalvarayan, R. (1998). Use of periodicity and jitter as speech recognition feature. In Proceedings of the 1998 IEEE International Conference on acoustics, speech, and signal processing, Seattle, WA, (Vol. 1, pp. 21–24).
Thomson, D., & Chengalvarayan, R. (2002). Use of voicing features in HMM-based speech recognition. Speech Communication, 37(3–4), 197–211.
Article MATH Google Scholar
Vaseghi, S., & Milner, B. (1997). Noise compensation methods for Hidden Markov Model speech recognition in adverse environments. IEEE Transactions on Speech and Audio Processing, 5(1), 11–21.
Article Google Scholar
Weber, K., Bourlard, H., & Bengio, S. (2001). Hmm2-extraction of formant features and their use for robust ASR. In European Conference on Speech Communication and Technology (pp. 607–610).
Welling, L., & Ney, H. (1996). A model for efficient formant estimation. In IEEE international conference on acoustics, speech, and signal processing, 2, pp. 797–801.
Wong, P., Siu, M. (2004). Decision tree based tone modeling for Chinese speech recognition. In Proceedings of ICASSP, Montreal, Canada (Vol. 1, pp. 905–908).
Yang, W. J., et al. (1988). Hidden Markov Model for Mandarin lexical tone recognition. IEEE Transactions On Acoustics, speech, and Signal Processing, 36, 988–992.
Article MATH Google Scholar
Young, S., et al. (2009). HTK Book (V.3.4). Cambridge: Cambridge University Engineering Dept.
Google Scholar
Yousef, L., & Amir, H. (2010). Comparative analysis of Arabic vowels using formants and an automatic speech recognition system. International Journal of Signal Processing, Image Processing and Pattern Recognition processing and Pattern Recognition, 3, 2.
Google Scholar
Zaineb, B., & Ahmed, B. (2011). Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition. International Journal of Speech Technology, 14(4), 393–403.
Article Google Scholar
Zolnay, A., Schlüter, R., & Ney, H. (2003). Extraction methods of voicing feature for robust speech recognition. In European conference on speech communication and technology (Vol. 1, pp. 497–500). Geneva.

Download references

Acknowledgements

The presented work utilizes the results (The Speech Database) of a project previously funded by King Abdulaziz City for Science and Technology (KACST) in Saudi Arabia under grant number “AT – 25–113”.

Author information

Authors and Affiliations

TES Research Team, ENSIAS College of Engineering, Mohammed V University of Rabat, Rabat, Morocco
Mohamed O. M. Khelifa & Mostafa Belkasmi
Doha Institute for Graduate Studies, Doha, Qatar
Yahya Mohamed Elhadj
Sabbatical Leave at IRIT Institute of Toulouse, Paul Sabatier University, Toulouse, France
Yahya Mohamed Elhadj
ERADIASS Research Team, FSJES of Souissi, Mohammed V University of Rabat, Rabat, Morocco
Yousfi Abdellah

Authors

Mohamed O. M. Khelifa
View author publications
You can also search for this author inPubMed Google Scholar
Yahya Mohamed Elhadj
View author publications
You can also search for this author inPubMed Google Scholar
Yousfi Abdellah
View author publications
You can also search for this author inPubMed Google Scholar
Mostafa Belkasmi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mohamed O. M. Khelifa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khelifa, M.O.M., Elhadj, Y.M., Abdellah, Y. et al. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system. Int J Speech Technol 20, 937–949 (2017). https://doi.org/10.1007/s10772-017-9456-7

Download citation

Received: 25 May 2017
Accepted: 11 September 2017
Published: 20 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10772-017-9456-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comparative study for Arabic speech recognition system in noisy environments

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

An experimental framework for Arabic digits speech recognition in noisy environments

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now