Skip to main content
Log in

Phonetically rich and balanced text and speech corpora for Arabic language

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper describes the preparation, recording, analyzing, and evaluation of a new speech corpus for Modern Standard Arabic (MSA). The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa). Three hundred and sixty seven sentences are considered as phonetically rich and balanced, which are used for training Arabic Automatic Speech Recognition (ASR) systems. The rich characteristic is in the sense that it must contain all phonemes of Arabic language, whereas the balanced characteristic is in the sense that it must preserve the phonetic distribution of Arabic language. The remaining 48 sentences are created for testing purposes, which are mostly foreign to the training sentences and there are hardly any similarities in words. In order to evaluate the speech corpus, Arabic ASR systems were developed using the Carnegie Mellon University (CMU) Sphinx 3 tools at both training and testing/decoding levels. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 8 h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains uni-grams, bi-grams, and tri-grams. For same speakers with different sentences, Arabic ASR systems obtained average Word Error Rate (WER) of 9.70%. For different speakers with same sentences, Arabic ASR systems obtained average WER of 4.58%, whereas for different speakers with different sentences, Arabic ASR systems obtained average WER of 12.39%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Abu Shariah, M. A. M., Ainon, R. N., Zainuddin, R., & Khalifa, O. O. (2007). Human computer interaction using isolated-words speech recognition technology. In: Proceedings of the IEEE international conference on intelligent and advanced systems (ICIAS’07) (pp. 1173–1178). Kuala Lumpur, Malaysia.

  • Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Al-Qatab, B. A., & Alqudah, A. A. M. (2010d). Impact of a newly developed modern standard Arabic speech corpus on implementing and evaluating automatic continuous speech recognition systems. In Proceedings of the second international workshop on spoken dialogue systems technology (IWSDS’10) (Lecture Notes in Computer Science (LNCS)) (Vol. 6392, pp. 1–12). Springer.

  • Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Alqudah, A. A. M., Elshafei, M. A., & Khalifa, O. O. (2011). Modern standard Arabic speech corpus for implementing and evaluating automatic continuous speech recognition systems. Journal of the Franklin Institute. Elsevier. doi:10.1016/j.jfranklin.2011.04.011.

  • Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Elshafei, M., & Khalifa, O. O. (2010b). Natural speaker-independent Arabic speech recognition system based on Hidden Markov models using Sphinx tools. In Proceedings of the IEEE international conference on computer and communication engineering (ICCCE’10). Kuala Lumpur, Malaysia.

  • Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Elshafei, M., & Khalifa, O. O. (2010c). Phonetically rich and balanced speech corpus for Arabic speaker-independent continuous automatic speech recognition systems. In Proceedings of the IEEE 10th international conference on information science, signal processing and their applications (ISSPA 2010) (pp. 65–68). Kuala Lumpur, Malaysia.

  • Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Khalifa, O. O., & Elshafei, M. (2010a). Phonetically rich and balanced Arabic speech corpus: An overview. In Proceedings of the IEEE international conference on computer and communication engineering (ICCCE’10). Kuala Lumpur, Malaysia.

  • Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. 8th international conference on language engineering, Egypt.

  • Alghamdi, M., Alhamid, A. H., & Aldasuqi, M. M. (2003). Database of Arabic sounds: sentences. Technical Report, Saudi Arabia: King Abdulaziz City of Science and Technology (in Arabic).

  • Alghamdi, M., Basalamah, M., Seeni, M., & Husain, A. (1997). Database of Arabic sounds: words. In Proceedings of the 15th National computer conference (pp. 797–815). Saudi Arabia (in Arabic).

  • Alghamdi, M., Elshafei, M., & Al-Muhtaseb, H. (2009). Arabic broadcast news transcription system. International Journal of Speech Technology. Springer, 183–195.

  • Ali, M., Elshafei, M., Alghamdi, M., Almuhtaseb, H., & Al-Najjar, A. (2008). Generation of Arabic phonetic dictionaries for speech recognition. In IEEE proceedings of the international conference on innovations in information technology (pp. 59–63). UAE.

  • Alotaibi, Y. A. (2008). Comparative study of ANN and HMM to Arabic digits recognition systems. Journal of King Abdulaziz University: Engineering Sciences, 19(1), 43–59.

    Article  Google Scholar 

  • Alotaibi, Y. A., Alghamdi, M., & Alotaiby, F. (2008). Using a telephony Saudi accented Arabic corpus in automatic recognition of spoken Arabic digits. 4th international symposium on image/video communications over fixed and mobile networks (ISIVC08), Bilbao, Spain.

  • Alotaibi, Y. A., & Meftah, A. H. (2010). Comparative evaluation of two Arabic speech corpora. In IEEE proceedings of the international conference on natural language processing and knowledge engineering, Beijing, China.

  • Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics. John Benjamins Publishing Company, pp. 1–36.

  • Bakis, R. (1976). Continuous speech recognition via centisecond acoustic states. The Journal of the Acoustical Society of America, 59(S1), S97.

    Article  Google Scholar 

  • Black, A. W., & Tokuda, K. (2005). The Blizzard Challenge—2005: Evaluating corpus-based speech synthesis on common datasets. INTERSPEECH’05 (pp. 77–80). Portugal.

  • Chan, A., Gouvˆea, E., Singh, R., Ravishankar, M., Rosenfeld, R., Sun, Y. et al. (2007). The Hieroglyphs: building speech applications using CMU Sphinx and related resources. http://www-2.cs.cmu.edu/~archan/documentation/sphinxDocDraft3.pdf. Accessed on 15 September 2010.

  • Chou, F. C., & Tseng, C. Y. (1999). The design of prosodically oriented Mandarin speech database. ICPhS’99 (pp. 2375–2377), San Francisco.

  • Cieri, C., Liberman, M., Arranz, V., & Choukri, K. (2006). Linguistic data resources. In T. Schultz & K. Kirchhoff (Eds.), Multilingual speech processing (pp. 33–70). USA: Academic Press, Elsevier.

    Chapter  Google Scholar 

  • Clarkson, P., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the 5th European conference on speech communication and technology (pp. 2707–2710), Rhodes, Greece.

  • D’Arcy, S., & Russell, M. (2008). Experiments with the ABI (Accents of the British Isles) Speech Corpus. INTERSPEECH’08 (pp. 293–296), Australia.

  • Elmahdy, M., Gruhn, R., Minker, W., & Abdennadher, S. (2009). Survey on common Arabic language forms from a speech recognition point of view. International conference on acoustics (NAG-DAGA) (pp. 63–66), Rotterdam, Netherlands.

  • ELRA. (2005). NEMLAR broadcast news speech corpus. catalogue Reference S0219. http://catalog.elra.info/product_info.php?products_id=874. Accessed on 10 May 2011.

  • Elshafei, A. M. (1991). Toward an Arabic text-to-speech system. The Arabian Journal of Science and Engineering, 16(4B), 565–583.

    Google Scholar 

  • Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., & Dahlgren, N. L. (1993). DARPA TIMIT acoustic-phonetic continuous speech corpus. University Pennsylvania, Philadelphia, PA: Linguistic Data Consortium.

  • Habash, N. Y. (2010). Introduction to Arabic natural language processing. USA: Morgan and Claypool Publishers.

    Google Scholar 

  • Hong, H., Kim, S., & Chung, M. (2008). Effects of Allophones on the performance of Korean speech recognition. INTERSPEECH’08 (pp. 2410–2413), Australia.

  • Hyassat, H., & Abu Zitar, R. (2008). Arabic speech recognition using SPHINX engine. International Journal of Speech Technology. Springer, 133–150.

  • Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G. et al. (2003). Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins summer workshop. ICASSP’03 (Vol. 1, pp. 344–347), Hong Kong.

  • Liang, M. S., Lyu, R. Y., & Chiang, Y. C. (2003). An efficient algorithm to select phonetically balanced scripts for constructing a speech corpus. In IEEE Proceedings of the international conference on natural language processing and knowledge engineering (pp. 433–437), China.

  • Madi, M., (2010). A study of Arabic letter frequency analysis. http://www.intellaren.com/articles/en/a-study-of-arabic-letter-frequency-analysis. Accessed on 6 June 2011.

  • Meeralam, Y. (2007). Contributions of cryptography scholars in Arabic linguistics. Diwan al Arab. http://www.diwanalarab.com/IMG/pdf/Is_hamaatUolamaaAltaumieat1-1.pdf. Accessed on 10 May 2011 (in Arabic).

  • Messaoudi, A., Gauvain, J. L., & Lamel, L. (2006). Arabic broadcast news transcription using a one million word vocalized vocabulary. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’06) (pp. 1093–1096), Toulouse, France.

  • Mourtaga, E., Sharieh, A., & Abdallah, M. (2007). Speaker independent Quranic recognizer based on maximum likelihood linear regression. In Proceedings of world academy of science, engineering and technology (Vol. 36, pp. 61–67), Brazil.

  • Nikkhou, M., & Choukri, K. (2004). Survey on industrial needs for language resources. Technical Report, NEMLAR—Network for Euro-Mediterranean Language Resources.

  • Nikkhou, M., & Choukri, K. (2005). Survey on Arabic language resources and tools in the mediterranean countries. Technical Report, NEMLAR—Network for Euro-Mediterranean Language Resources.

  • Parkinson, D. B., & Farwaneh, S. (Eds.). (2003). Perspectives on Arabic linguistics XV (pp. 149–180). Amsterdam/Philadelphia: John Benjamins Publishing Company.

    Google Scholar 

  • Pineda, L. V., Montes-y-Gómez, M., Vaufreydaz, D., & Serignat, J. -F. (2004). Experiments on the construction of a phonetically balanced corpus from the web. In 5th international conference on computational linguistics and intelligent text processing (Lecture Notes in Computer Science, Springer) (Vol. 2945/2004, pp. 416–419) Korea.

  • Placeway, P., Chen, S., Eskenazi, M., Jain, U., Parikh, V., Raj, B. et al. (1997). The 1996 Hub-4 Sphinx-3 System. In Proceedings of the 1997 ARPA speech recognition workshop (pp. 85–89).

  • Rabiner, L. R. (1989). A tutorial on Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora Journal, 1(1), 39–60.

    Article  Google Scholar 

  • Satori, H., Harti, M., & Chenfour, N. (2007). Arabic speech recognition system based on CMUSphinx. In IEEE proceedings of ISCIII’07 (pp. 31–35) Morocco.

  • Siemund, R., Heuft, B., Choukri, K., Emam, O., Maragoudakis, E., Tropf, H. et al. (2002). OrienTel—Arabic speech resources for the IT market. In Proceedings of the 3rd international conference on language resources and evaluation (LREC’02), Spain.

  • Solatu, H., Saon, G., Kingsbury, B., Kuo, J., Mangu, L., Povey, D. et al. (2007). The IBM 2006 GALE Arabic ASR system. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’07) (pp. 349–352), Hawaii, USA.

  • Uraga, E., & Gamboa, C. (2004). VOXMEX speech database: Design of a phonetically balanced corpus. In Proceedings of the 4th international conference on language resources and evaluation (pp. 1471–1474), Portugal.

  • Vergyri, D., & Kirchhoff, K. (2004). Automatic Diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the workshop on computational approaches to Arabic script-based languages (pp. 66–73) Geneva, Switzerland.

  • Wikipedia. (2011). IPA for Arabic. http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Arabic. Accessed on 10 May 2011.

Download references

Acknowledgments

We would like to extend our appreciation to University of Malaya and University of Jordan for funding this research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad A. M. Abushariah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abushariah, M.A.M., Ainon, R.N., Zainuddin, R. et al. Phonetically rich and balanced text and speech corpora for Arabic language. Lang Resources & Evaluation 46, 601–634 (2012). https://doi.org/10.1007/s10579-011-9166-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9166-8

Keywords

Navigation