Skip to main content
Log in

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Compared with English, Arabic is a poorly-resourced language within the field of corpus linguistics. A lack of sufficient data and research has negatively affected Arabic corpus-based researchers and natural language processing practitioners. Although a number of Arabic corpora have been developed in recent years, the overall situation has improved little. The aim of this paper is twofold. First, it reviews 14 Arabic corpora categorized by their designated purpose, target language, mode of text, size, text date, location, text type/medium, text domain, representativeness, and balance. The review also describes the availability of the reviewed corpora, the presence of tokenization, lemmatization and tagging, and whether there are any tools available to search and explore them. Second, it introduces the King Abdulaziz City for Science and Technology (KACST) Arabic corpus, which was designed and created to overcome the limitations of existing Arabic corpora. The KACST Arabic corpus is a large and diverse Arabic corpus with clearly defined design criteria. It is carefully sampled, and its contents are classified based on time, region, medium, domain, and topic, and it can be searched and explored using these classifications. The KACST Arabic corpus comprises more than 700 million words from the pre-Islamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet (http://www.kacstac.org.sa) using a variety of tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://www.sketchengine.co.uk/.

  2. http://www.andy-roberts.net/coding/aconcorde.

  3. https://sourceforge.net/projects/kacst-acptool/.

  4. http://www.mci.gov.sa/LawsRegulations/SystemsAndRegulations/IntellectualPropertySystem/Pages/default.aspx.

  5. http://fikratech.kacst.edu.sa/Invention-World/Copyright.aspx.

  6. http://shamela.ws/.

  7. http://saaid.net/.

  8. http://www.awu.sy/.

  9. https://uqu.edu.sa/page/ar/518.

  10. http://www.kfu.edu.sa/ar/departments/sjournal/Pages/Home.aspx.

  11. http://www.boe.gov.sa/MainLaws.aspx?lang=en.

  12. http://www.arablegalportal.org/.

  13. http://www.alwatan.com.sa.

  14. http://rosa-magazine.com/.

  15. http://www.spa.gov.sa/.

References

  • Abbas, M., & Smaili, K. (2005). Comparison of topic identification methods for Arabic language. In International conference RANLP05: Recent advances in natural language processing, 21–23 September 2005, Borovets, Bulgaria.

  • Abbas, M., Smaili, K., & Berkani, D. (2011). Evaluation of topic identification methods on Arabic corpora. Journal of Digital Information Management, 9(5), 185–192.

    Google Scholar 

  • Abdelali, A., Cowie, J., & Soliman, H. (2005). Building a modern standard Arabic corpus. In Workshop on computational modeling of lexical acquisition. The Split Meeting. Croatia, July 25–28.

  • Ahmad, K. (2008). Being in text and text in being: Notes on representative texts. In G. Andeman, and M. Rogers (Eds.). Incorporating corpora. Clevedon: Multilingual Matters, pp. 60–91 (Chapter 5).

  • Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic. In 7th International conference on language engineering, Cairo, Egypt, December 5–6.

  • Alfaifi, A., & Atwell, E. (2013). Arabic learner corpus v1: A new resource for arabic language research. In Second workshop on Arabic Corpus Linguistics (WACL-2), July 22.

  • Alrabiah, M., Al-Salman, A., & Atwell, E. (2013). The design and construction of the 50 million words KSUCCA King Saud University Corpus of Classical Arabic. In Second workshop on Arabic corpus linguistics (WACL-2), July 22.

  • Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.

    Article  Google Scholar 

  • Al-Thubaity, A. O., Khan, M., Al-Mazrua, M. & Al-Mousa, M. (2013). New language resources for Arabic: Corpus containing more than two million words and a corpus processing tool. In International conference on Asian Language Processing 2013 (IALP 2013), pp. 67–70.

  • Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.

    Article  Google Scholar 

  • Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTenTen12: A new, vast corpus for Arabic. In Second workshop on Arabic Corpus Linguistics (WACL-2), July 22.

  • Biber, D. (1993). Representativeness in Corpus design. Literary and Linguistic Computing, 8(4), 243–257.

    Article  Google Scholar 

  • Biber, D., Conrad, S., & Reppen, R. (2002). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.

    Google Scholar 

  • Diab, M. (2007). Towards an optimal POS tag set for Modern Standard Arabic processing. In Proceedings of recent advances in natural language processing (RANLP), pp. 91–96.

  • Dukes, K., Atwell, E., & Habash, N. (2013). Supervised collaboration for syntactic annotation of Quranic Arabic. In Language resources and evaluation journal (LREJ). Special issue on collaboratively constructed language resources.

  • El-Haj, M., & Koulali, R. (2013). KALIMAT a multipurpose Arabic Corpus. In Second workshop on Arabic Corpus linguistics (WACL-2), July 22.

  • Khorsheed, M. S., & Al-Thubaity, A. O. (2013). Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Language Resources and Evaluation, 47(2), 513–538.

    Article  Google Scholar 

  • Parker, R., et al. (2011). Arabic Gigaword fifth edition LDC2011T11. Web Download. Philadelphia: Linguistic Data Consortium.

  • Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60.

    Article  Google Scholar 

  • Roth, R., Rambow, O., Habash, N., Diab, M. & Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the conference of American association for computational linguistics (ACL08).

  • Saad, M. K., & Ashour, W. (2010). OSAC: Open source Arabic Corpora. 6th ArchEng international symposiums. In EEECS’10 the 6th international symposium on electrical and electronics engineering and computer science, pp. 118–123, European University of Lefke, Cyprus.

  • Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.

  • Sinclair, J. (2005). Corpus and text—basic principles. In Wynne M. (Ed.) Developing linguistic corpora: A guide to good practice, pp. 1–16. Oxford: Oxbow Books. http://ahds.ac.uk/linguistic-corpora/. Accessed August 28, 2013.

  • Teubert, W., & Čermáková, A. (2007). Corpus linguistics: A short introduction. London: Continuum.

    Google Scholar 

Download references

Acknowledgments

This project was fully funded by the King Abdulaziz City for Science and Technology via Grants Number (531-31) and (33-824). The author would like to thank the three anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdulmohsen O. Al-Thubaity.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Thubaity, A.O. A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang Resources & Evaluation 49, 721–751 (2015). https://doi.org/10.1007/s10579-014-9284-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9284-1

Keywords

Navigation