Exploring and exploiting a historical corpus for Arabic

Hammo, Bassam; Yagi, Sane; Ismail, Omaima; AbuShariah, Mohammad

doi:10.1007/s10579-015-9304-9

Exploring and exploiting a historical corpus for Arabic

Project Notes
Published: 30 May 2015

Volume 50, pages 839–861, (2016)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Bassam Hammo ORCID: orcid.org/0000-0002-5270-7409¹,
Sane Yagi²,
Omaima Ismail¹ &
…
Mohammad AbuShariah¹

553 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

http://nlp.stanford.edu/software/tagger.shtml.

References

A Representative Corpus of Historical English Registers (ARCHER). (2014). http://www.alc.manchester.ac.uk/subjects/lel/research/projects/archer/using-archer. Accessed 15 January 2015.
Abbès, R., & Dichy, J. (2008). AraConc, an Arabic concordance software based on the DIINAR.1 language resource. In The 6th international conference on informatics and systems, pp. 127–134.
Abu-Salem, H., Al-Omari, M., & Evens, M. W. (1999). Stemming methodologies over individual query words for an Arabic information retrieval system. Journal of the American Society for Information Science, 50(6), 524–529.
Article Google Scholar
Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. In 7th international conference on language engineering, Cairo, Egypt.
Alansary, S., Nagi, M., & Adly, N. (2008). Towards analyzing the international corpus of Arabic (ICA): Progress of morphological stage. In 8th international conference on language engineering, Cairo, Egypt.
Alrabiah, M., Al-Salman, A., & Atwell, E. (2013). The design and construction of the 50 million words KSUCCA. In The proceedings of the second workshop on arabic corpus linguistics (WACL-2), Lancaster University, UK.
Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Article Google Scholar
Al-Thubaity, A. O. (2014). A 700 M + Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation,. doi:10.1007/s10579-014-9284-1.
Google Scholar
Attia, M., Pecina, P., Tounsi, L., Toral, A., & van Genabith, J. (2011). Lexical profiling for Arabic. In Proceedings of eLex, pp. 23–33.
Boella, M., Romani, F., Al-Raies, A., Solimando, C., & Lancioni, G. (2011). The SALAH project: Segmentation and linguistic analysis of Ḥadīṯ Arabic texts. Information Retrieval Technology, pp. 538–549.
Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer version 2.0 linguistic data consortium, Philadelphia. http://www.qamus.org/morphology.htm. Accessed 15 January 2015.
Dhaif, Shawqi. (1986). Tarikh Al-Adab Al-Arabi: Al-Asr Al-Jahili. Cairo: Dar Al-Maarif.
Google Scholar
Dukes, K., & Habash, N. (2010). Morphological annotation of Quranic Arabic. In LREC.
Hajjar, M., Al-Hajjar, A., Zreik, K., & Gallinari, P. (2010). An improved structured and progressive electronic dictionary for the Arabic language: iSPEDAL. In Fifth international conference on internet and web applications and services (ICIW), pp. 489–495.
Hammo, B. (2009). Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Information Retrieval, 12(3), 300–323.
Article Google Scholar
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 397–415.
Article Google Scholar
Hammo, B., Abu-Salem, H., & Lytinen, S. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the ACL-02 workshop on computational approaches to Semitic languages, pp. 1–11.
Hammo, B., Al-Shargi, F., Yagi, S. & Obeid, N. (2013). Developing tools for Arabic corpus for researchers. In The proceedings of the second workshop on Arabic corpus linguistics (WACL-2), Lancaster University, UK.
Helsinki Corpus of English Texts. (2011). Department of Modern Languages, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/HC_XML.html. Accessed 15 January 2015.
Hourani, A. (2013). A history of the Arab peoples: Updated edition. London: Faber and Faber.
Google Scholar
Ide, N., Patrice, B., & Laurent, R. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the second language resources and evaluation conference (LREC).
Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing Department, Lancaster University: Lancaster.
Google Scholar
König, E., & Siemund, P. (1999). Intensifiers as targets and sources of semantic change. In Andreas Blank & Peter Koch (Eds.), Historical semantics and cognition. Berlin: Walter de Gruyter.
Google Scholar
Nelson, F. W., & Kuĉera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.
Google Scholar
Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P., & Archer, D. (2008). The identification of spelling variants in English and German historical texts: Manual or automatic? Literary and Linguistic Computing, 23(1), 65–72.
Article Google Scholar
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2), 1–157.
Article Google Scholar
Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of corpus linguistics 2007, University of Birmingham, UK.
Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60.
Article Google Scholar
Rögnvaldsson, E., & Helgadóttir, S. (2008). Morphological tagging of Old Norse texts and its use in studying syntactic variation and change. In Proceedings of the LREC 2008 workshop on language technology for cultural heritage data (LaTeCH 2008). ELRA, Paris.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24, 513–523.
Article Google Scholar
Sánchez-Marco, C., Boleda Torrent, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Malta.
Schacht, J., & Bosworth, C. E. (1974). The legacy of Islam. Oxford: Oxford University Press.
Google Scholar
Sharaf, A. & Atwell, E. (2012). QurAna: Corpus of the Quran annotated with Pronominal Anaphora. In LREC, pp. 130–137.
Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL 2003, pp. 252–259.
Yagi, S. & Ghodhaya, M. (2014). Culture from a historical semantic perspective. Al-Majalla Al-Thaqafiya, 85, University of Jordan, pp. 86–119.
Yang, Y. M. (1995). Noise reduction in a statistical approach to text categorization. In Proceedings of SIGIR-95, 18th ACM international conference on research and development in information retrieval, pp. 256–263.

Download references

Acknowledgments

The authors would like to thank the graduate students of the Linguistics Department at the University of Jordan for their help in compiling the Arabic historical corpus. Also we would like to sincerely thank the anonymous reviewers of the first submission for their thoughtful comments to enhance our work.

Author information

Authors and Affiliations

Computer Information Systems Department, King Abdullah II School for Information Technology, University of Jordan, Amman, Jordan
Bassam Hammo, Omaima Ismail & Mohammad AbuShariah
Linguistics Department, University of Jordan, Amman, Jordan
Sane Yagi

Authors

Bassam Hammo
View author publications
You can also search for this author inPubMed Google Scholar
Sane Yagi
View author publications
You can also search for this author inPubMed Google Scholar
Omaima Ismail
View author publications
You can also search for this author inPubMed Google Scholar
Mohammad AbuShariah
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Bassam Hammo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hammo, B., Yagi, S., Ismail, O. et al. Exploring and exploiting a historical corpus for Arabic. Lang Resources & Evaluation 50, 839–861 (2016). https://doi.org/10.1007/s10579-015-9304-9

Download citation

Published: 30 May 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10579-015-9304-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring and exploiting a historical corpus for Arabic

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now