Arabic speech recognition using SPHINX engine

Hyassat, Hussein; Abu Zitar, Raed

doi:10.1007/s10772-008-9009-1

Arabic speech recognition using SPHINX engine

Published: 28 October 2008

Volume 9, pages 133–150, (2006)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Hussein Hyassat¹ &
Raed Abu Zitar²

572 Accesses
Explore all metrics

Abstract

Although the Arab world has an estimated number of 250 million Arabic speakers, there has been little research on Arabic speech recognition when compared to other languages of similar importance (e.g. Mandarin). Due to the lack of diacritic Arabic text and the lack of Pronunciation Dictionary (PD), most of previous work on Arabic Automatic Speech Recognition has been concentrated on developing recognizers using Romanized characters i.e. let the system recognizes the Arabic word as an English one, then map it to Arabic word from lookup table that maps the Arabic word to its Romanized pronunciation.

In this work, we introduce the first SPHINX-IV-based Arabic recognizer and propose an automatic toolkit, which is capable of producing (PD) for both Holly Qura’an and standard Arabic language. Three corpuses are completely developed in this work, namely the Holly Qura’an Corpus HQC-1 about 18.5 hours, the command and control corpus CAC-1 about 1.5 hours and Arabic digits corpus ADC less than one hour of speech. The building process is completely described. Fully diacritic Arabic transcriptions, for all the three corpuses were developed too.

SPHINX-IV engine was customized and trained, for both the language model and the lexicon modules shown in the frame work architecture block diagram on next page.

Using the three mentioned corpuses; the (PD) developed by our automatic tool with the transcripts, SPHINX-IV engine is trained and tuned in order to develop three acoustic models, one for each corpus. Training is based on an HMM model that is built on statistical information and random variables distributions extracted from the training data itself. New algorithm is proposed to add unlabeled data to the training corpus in order to increase the corpus size. This algorithm is based on Neural Network confidence scorer and then is used to annotate the decoded speech in order to decide whether the proposed transcript is accepted and can be added to the seed corpus or not.

The model parameters were fine-tuned using simulated annealing algorithm; optimum values were tested and reported. Our major contribution is mainly using the open source SPHINX-IV model in Arabic speech recognition by building our own language and acoustic models without Romanization for the Arabic speech. The system is fine-tuned and data are refined for training and validation. Optimum values for number of Gaussian mixtures distributions and number of states in HMM’s have been found according to specified performance measures. Optimum values for confidence scores were found for the training data. Although much more work need to be done to complete the work with this size, we consider the corpus used in our system is enough to validate our approach. SPHINX has never been used before in this manner for Arabic speech recognition. The work is an invitation for all open source speech recognition developers and groups to take over and capitalize on what we have started.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Al-Zabibi, M. (1990). An acoustic-phonetic approach in automatic Arabic speech recognition. The British Library in Association with UMI.
Alghamdi, M. (2001). Arabic phonetics. Riyadh: Altawbah Printing.
Google Scholar
Alghamdi, M., Al-Muhtaseb, H., & Elshafei, M. (2004). Arabic phonological rules. Journal of King Saud University: Computer Sciences and Information, 16, 1–25 (in Arabic).
Google Scholar
Andersen, O., & Kuhn, R., et al. (1996). Comparison of two tree-structured approaches for grapheme-to-phoneme conversion. In ICSLP ’96 (Vol. 3, pp. 1700–1703) Oct. 1996.
Baugh, A. C., & Cable, T. (1978). A history of the English language. Oxon: Redwood Burn Ltd.
Google Scholar
Billa, J., et al. (2002a). Arabic speech and text in Tides On Tap. In Proceedings of HLT, 2002.
Billa, J., et al. (2002b). Audio indexing of broadcast news. In Proceedings of ICASSP, 2002.
Black, A., Lenzo, K., & Pagel, V. (1998). Issues in building general letter to sound rules. In Proceedings of the ESCA workshop on speech synthesis, Australia (p. 7780) 1998.
Christensen, H. (1996). Speaker adaptation of hidden Markov models using maximum likelihood linear regression. Ph.D. Thesis, Institute of Electronic Systems Department of Communication Technology, Aalborg University.
CMU SPHINX Open Source Speech Recognition Engines. URL:http://www.speech.cs.cmu.edu/ (2007).
CMU SPHINX trainer Open Source Speech Recognition Engines, URL: http//:www.cmusphinx.org/trainer (2008).
Doh, S.-J. (2000). Enhancements to transformation-based speaker adaptation: principal component and inter-class maximum likelihood linear regression. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
El Choubassi, M. M., El Khoury, H. E., Jabra Alagha, C. E., Skaf, J. A., & Al-Alaoui, M. A. (2003). Arabic speech recognition using recurrent neural networks. Electrical and Computer Engineering Department, Faculty of Engineering and Architecture—American University of Beirut.
Essa, O. (1998). Using prosody in automatic segmentation of speech. In Proceedings of the ACM 36th annual south east conference (pp. 44–49). Apr. 1998.
Fukada, T., Yoshimura, T., & Sagisaka, Y. (1999). Automatic generation of multiple pronunciations based on neural networks. Speech Communication, 27, 63–73.
Article Google Scholar
Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid SVM/HMM architectures for speech recognition. In Proceedings of the international conference on spoken language processing (Vol. 4, pp. 504–507). November 2000.
Gouvêa, E. B. (1996). Acoustic-feature-based frequency warping for speaker normalization. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Hadj-Salah, A. (1983). A description of the characteristics of the Arabic language. In Applied Arabic linguistics, signal & information processing, Rabat, Morocco, 26 September–5 October 1983.
Hain, T., et al. (2003). Automatic transcription of conversational telephone speech—development of the CU-HTK 2002 system. (Technical Report CUED/F-INFENG/TR. 465). Cambridge University Engineering Department. Available at http://mi.eng.cam.ac.uk/reports/.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America, 87, 1738–1752.
Article Google Scholar
Hiyassat, H., Nedhal, Y., & Asem, E. Automatic speech recognition system requirement using Z notation. In Proceedings of Amse’ 05, Roan, France, 2005.
Huang, X., Alleva, F., Wuen, H., Hwang, M.-Y., & Rosenfeld, R. (2003). The SPHINX-II speech recognition system: an overview . In School of Computer Science Carnegie Mellon University, Pittsburgh, 15213, 2003.
Huerta, J. M. (2000). Robust speech recognition in GSM codec environments. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Killer, M., Stüker, S., & Schultz, T. (2003). Grapheme based speech recognition. Eurospeech, Geneva, Switzerland, September 2003.
Killer, M., Stüker, S., & Schultz, T. (2004). A grapheme based speech recognition system for Russian. In SPECOM’2004: 9th conference, speech and computer, St. Petersburg, Russia, September 20–22.
Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schwartz, R., & Vergyri, D. (2007). Novel approaches to Arabic speech recognition. The 2002 Johns-Hopkins summer workshop, 2002.
Lee, K., Hon, H., & Reddy, R. (1990). An overview of the SPHINX speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(1), 35–45.
Article Google Scholar
Lee, T., Ching, P. C., & Chan, L. W. (1998). Isolated word recognition using modular recurrent neural networks. Pattern Recognition, 31(6), 751–760.
Article Google Scholar
Liu, F.-H. (1994). Environmental adaptation for robust speech recognition. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA.
Mimer, B., Stuker, S., & Schultz, T. (2004). Flexible decision trees for grapheme based speech recognition. In Proceedings of the 15th conference elektronische sprach signal verarbeitung (ESSV), Cottbus, Germany, 2004.
Nedel, J. P. (2004). Duration normalization for robust recognition of spontaneous speech via missing feature methods. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Ohshima, Y. (1993). Environmental robustness in speech recognition using physiologically-motivated signal processing. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Pallet, D. S., et al. (1999). 1998 Broadcast news benchmark test results. In Proceedings of the DARPA broadcast news workshop, Herndon, Virginia, February 28–March 3, 1999.
Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.
Google Scholar
Raj, B. (2000). Reconstruction of incomplete spectrograms for robust speech recognition. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Rosti, A.-V.I. (2004). Linear Gaussian models for speech recognition. Ph.D. Thesis, Wolfson College, University of Cambridge.
Rozzi, W. A. (1991). Speaker adaptation in continuous speech recognition via estilsiation of correlated mean vectors. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. Computer Science, IJCAI.
Schultz, T. (2002). Globalphone: a multilingual speech and text database developed at Karlsruhe University. In Proceedings of the ICSLP, Denver, CO, 2002.
Schultz, T., Alexander, D., Black, A., Peterson, K., Suebvisai, S., & Waibel, A. (2004). A Thai speech translation system for medical dialogs. In Proceedings of the human language technologies (HLT), Boston, MA, May 2004.
Seltzer, M. L. (2000). Automatic detection of corrupt spectrographic features for robust speech recognition. Master degree Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Siegler, M. A. (1999). Integration of continuous speech recognition and information retrieval for mutually optimal performance. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University.
Young, S. J. (1994). The HTK hidden Markov model toolkit: design and philosophy (CUED/F-INFENG/TR.152). Engineering Department, University of Cambridge.
Zavagliakos, G., et al. (1998). The BNN Byblos 1997 large vocabulary conversational speech recognition system. In Proceedings of ICASSP, 1998.

Download references

Author information

Authors and Affiliations

Arab Academy of Business and Financial Sciences, Amman, Jordan
Hussein Hyassat
School of Computing and Engineering, New York Institute of Technology, Amman, Jordan
Raed Abu Zitar

Authors

Hussein Hyassat
View author publications
You can also search for this author in PubMed Google Scholar
Raed Abu Zitar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raed Abu Zitar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hyassat, H., Abu Zitar, R. Arabic speech recognition using SPHINX engine. Int J Speech Technol 9, 133–150 (2006). https://doi.org/10.1007/s10772-008-9009-1

Download citation

Received: 01 October 2008
Accepted: 09 October 2008
Published: 28 October 2008
Issue Date: December 2006
DOI: https://doi.org/10.1007/s10772-008-9009-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic speech recognition using SPHINX engine

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The impact of phonological rules on Arabic speech recognition

Diacritics Effect on Arabic Speech Recognition

Towards a Speech Recognizer for Multiple Languages Using Arabic Acoustic Model Application to Amazigh Language

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Arabic speech recognition using SPHINX engine

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The impact of phonological rules on Arabic speech recognition

Diacritics Effect on Arabic Speech Recognition

Towards a Speech Recognizer for Multiple Languages Using Arabic Acoustic Model Application to Amazigh Language

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now