A 700M+ Arabic corpus: KACST Arabic corpus design and construction

Al-Thubaity, Abdulmohsen O.

doi:10.1007/s10579-014-9284-1

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

Original Paper
Published: 16 October 2014

Volume 49, pages 721–751, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Abdulmohsen O. Al-Thubaity¹

1474 Accesses
25 Citations
8 Altmetric
Explore all metrics

Abstract

Compared with English, Arabic is a poorly-resourced language within the field of corpus linguistics. A lack of sufficient data and research has negatively affected Arabic corpus-based researchers and natural language processing practitioners. Although a number of Arabic corpora have been developed in recent years, the overall situation has improved little. The aim of this paper is twofold. First, it reviews 14 Arabic corpora categorized by their designated purpose, target language, mode of text, size, text date, location, text type/medium, text domain, representativeness, and balance. The review also describes the availability of the reviewed corpora, the presence of tokenization, lemmatization and tagging, and whether there are any tools available to search and explore them. Second, it introduces the King Abdulaziz City for Science and Technology (KACST) Arabic corpus, which was designed and created to overcome the limitations of existing Arabic corpora. The KACST Arabic corpus is a large and diverse Arabic corpus with clearly defined design criteria. It is carefully sampled, and its contents are classified based on time, region, medium, domain, and topic, and it can be searched and explored using these classifications. The KACST Arabic corpus comprises more than 700 million words from the pre-Islamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet (http://www.kacstac.org.sa) using a variety of tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

ARARSS: A System for Constructing and Updating Arabic Textual Resources

Curras: an annotated corpus for the Palestinian Arabic dialect

Article 08 December 2016

Mustafa Jarrar, Nizar Habash, … Nasser Zalmout

Notes

References

Abbas, M., & Smaili, K. (2005). Comparison of topic identification methods for Arabic language. In International conference RANLP05: Recent advances in natural language processing, 21–23 September 2005, Borovets, Bulgaria.
Abbas, M., Smaili, K., & Berkani, D. (2011). Evaluation of topic identification methods on Arabic corpora. Journal of Digital Information Management, 9(5), 185–192.
Google Scholar
Abdelali, A., Cowie, J., & Soliman, H. (2005). Building a modern standard Arabic corpus. In Workshop on computational modeling of lexical acquisition. The Split Meeting. Croatia, July 25–28.
Ahmad, K. (2008). Being in text and text in being: Notes on representative texts. In G. Andeman, and M. Rogers (Eds.). Incorporating corpora. Clevedon: Multilingual Matters, pp. 60–91 (Chapter 5).
Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic. In 7th International conference on language engineering, Cairo, Egypt, December 5–6.
Alfaifi, A., & Atwell, E. (2013). Arabic learner corpus v1: A new resource for arabic language research. In Second workshop on Arabic Corpus Linguistics (WACL-2), July 22.
Alrabiah, M., Al-Salman, A., & Atwell, E. (2013). The design and construction of the 50 million words KSUCCA King Saud University Corpus of Classical Arabic. In Second workshop on Arabic corpus linguistics (WACL-2), July 22.
Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Article Google Scholar
Al-Thubaity, A. O., Khan, M., Al-Mazrua, M. & Al-Mousa, M. (2013). New language resources for Arabic: Corpus containing more than two million words and a corpus processing tool. In International conference on Asian Language Processing 2013 (IALP 2013), pp. 67–70.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Article Google Scholar
Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTenTen12: A new, vast corpus for Arabic. In Second workshop on Arabic Corpus Linguistics (WACL-2), July 22.
Biber, D. (1993). Representativeness in Corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Article Google Scholar
Biber, D., Conrad, S., & Reppen, R. (2002). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.
Google Scholar
Diab, M. (2007). Towards an optimal POS tag set for Modern Standard Arabic processing. In Proceedings of recent advances in natural language processing (RANLP), pp. 91–96.
Dukes, K., Atwell, E., & Habash, N. (2013). Supervised collaboration for syntactic annotation of Quranic Arabic. In Language resources and evaluation journal (LREJ). Special issue on collaboratively constructed language resources.
El-Haj, M., & Koulali, R. (2013). KALIMAT a multipurpose Arabic Corpus. In Second workshop on Arabic Corpus linguistics (WACL-2), July 22.
Khorsheed, M. S., & Al-Thubaity, A. O. (2013). Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Language Resources and Evaluation, 47(2), 513–538.
Article Google Scholar
Parker, R., et al. (2011). Arabic Gigaword fifth edition LDC2011T11. Web Download. Philadelphia: Linguistic Data Consortium.
Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60.
Article Google Scholar
Roth, R., Rambow, O., Habash, N., Diab, M. & Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the conference of American association for computational linguistics (ACL08).
Saad, M. K., & Ashour, W. (2010). OSAC: Open source Arabic Corpora. 6th ArchEng international symposiums. In EEECS’10 the 6th international symposium on electrical and electronics engineering and computer science, pp. 118–123, European University of Lefke, Cyprus.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Sinclair, J. (2005). Corpus and text—basic principles. In Wynne M. (Ed.) Developing linguistic corpora: A guide to good practice, pp. 1–16. Oxford: Oxbow Books. http://ahds.ac.uk/linguistic-corpora/. Accessed August 28, 2013.
Teubert, W., & Čermáková, A. (2007). Corpus linguistics: A short introduction. London: Continuum.
Google Scholar

Download references

Acknowledgments

This project was fully funded by the King Abdulaziz City for Science and Technology via Grants Number (531-31) and (33-824). The author would like to thank the three anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

King Abdulaziz City for Science and Technology, P O Box 6086, Riyadh, 11442, Saudi Arabia
Abdulmohsen O. Al-Thubaity

Authors

Abdulmohsen O. Al-Thubaity
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdulmohsen O. Al-Thubaity.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Thubaity, A.O. A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang Resources & Evaluation 49, 721–751 (2015). https://doi.org/10.1007/s10579-014-9284-1

Download citation

Published: 16 October 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10579-014-9284-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

Abstract

Access this article

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

ARARSS: A System for Constructing and Updating Arabic Textual Resources

Curras: an annotated corpus for the Palestinian Arabic dialect

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

Abstract

Access this article

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

ARARSS: A System for Constructing and Updating Arabic Textual Resources

Curras: an annotated corpus for the Palestinian Arabic dialect

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation