Abstract
This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible.






References
A Representative Corpus of Historical English Registers (ARCHER). (2014). http://www.alc.manchester.ac.uk/subjects/lel/research/projects/archer/using-archer. Accessed 15 January 2015.
Abbès, R., & Dichy, J. (2008). AraConc, an Arabic concordance software based on the DIINAR.1 language resource. In The 6th international conference on informatics and systems, pp. 127–134.
Abu-Salem, H., Al-Omari, M., & Evens, M. W. (1999). Stemming methodologies over individual query words for an Arabic information retrieval system. Journal of the American Society for Information Science, 50(6), 524–529.
Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. In 7th international conference on language engineering, Cairo, Egypt.
Alansary, S., Nagi, M., & Adly, N. (2008). Towards analyzing the international corpus of Arabic (ICA): Progress of morphological stage. In 8th international conference on language engineering, Cairo, Egypt.
Alrabiah, M., Al-Salman, A., & Atwell, E. (2013). The design and construction of the 50 million words KSUCCA. In The proceedings of the second workshop on arabic corpus linguistics (WACL-2), Lancaster University, UK.
Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Al-Thubaity, A. O. (2014). A 700 M + Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation,. doi:10.1007/s10579-014-9284-1.
Attia, M., Pecina, P., Tounsi, L., Toral, A., & van Genabith, J. (2011). Lexical profiling for Arabic. In Proceedings of eLex, pp. 23–33.
Boella, M., Romani, F., Al-Raies, A., Solimando, C., & Lancioni, G. (2011). The SALAH project: Segmentation and linguistic analysis of Ḥadīṯ Arabic texts. Information Retrieval Technology, pp. 538–549.
Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer version 2.0 linguistic data consortium, Philadelphia. http://www.qamus.org/morphology.htm. Accessed 15 January 2015.
Dhaif, Shawqi. (1986). Tarikh Al-Adab Al-Arabi: Al-Asr Al-Jahili. Cairo: Dar Al-Maarif.
Dukes, K., & Habash, N. (2010). Morphological annotation of Quranic Arabic. In LREC.
Hajjar, M., Al-Hajjar, A., Zreik, K., & Gallinari, P. (2010). An improved structured and progressive electronic dictionary for the Arabic language: iSPEDAL. In Fifth international conference on internet and web applications and services (ICIW), pp. 489–495.
Hammo, B. (2009). Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Information Retrieval, 12(3), 300–323.
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 397–415.
Hammo, B., Abu-Salem, H., & Lytinen, S. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the ACL-02 workshop on computational approaches to Semitic languages, pp. 1–11.
Hammo, B., Al-Shargi, F., Yagi, S. & Obeid, N. (2013). Developing tools for Arabic corpus for researchers. In The proceedings of the second workshop on Arabic corpus linguistics (WACL-2), Lancaster University, UK.
Helsinki Corpus of English Texts. (2011). Department of Modern Languages, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/HC_XML.html. Accessed 15 January 2015.
Hourani, A. (2013). A history of the Arab peoples: Updated edition. London: Faber and Faber.
Ide, N., Patrice, B., & Laurent, R. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the second language resources and evaluation conference (LREC).
Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing Department, Lancaster University: Lancaster.
König, E., & Siemund, P. (1999). Intensifiers as targets and sources of semantic change. In Andreas Blank & Peter Koch (Eds.), Historical semantics and cognition. Berlin: Walter de Gruyter.
Nelson, F. W., & Kuĉera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.
Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P., & Archer, D. (2008). The identification of spelling variants in English and German historical texts: Manual or automatic? Literary and Linguistic Computing, 23(1), 65–72.
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2), 1–157.
Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of corpus linguistics 2007, University of Birmingham, UK.
Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60.
Rögnvaldsson, E., & Helgadóttir, S. (2008). Morphological tagging of Old Norse texts and its use in studying syntactic variation and change. In Proceedings of the LREC 2008 workshop on language technology for cultural heritage data (LaTeCH 2008). ELRA, Paris.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24, 513–523.
Sánchez-Marco, C., Boleda Torrent, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Malta.
Schacht, J., & Bosworth, C. E. (1974). The legacy of Islam. Oxford: Oxford University Press.
Sharaf, A. & Atwell, E. (2012). QurAna: Corpus of the Quran annotated with Pronominal Anaphora. In LREC, pp. 130–137.
Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL 2003, pp. 252–259.
Yagi, S. & Ghodhaya, M. (2014). Culture from a historical semantic perspective. Al-Majalla Al-Thaqafiya, 85, University of Jordan, pp. 86–119.
Yang, Y. M. (1995). Noise reduction in a statistical approach to text categorization. In Proceedings of SIGIR-95, 18th ACM international conference on research and development in information retrieval, pp. 256–263.
Acknowledgments
The authors would like to thank the graduate students of the Linguistics Department at the University of Jordan for their help in compiling the Arabic historical corpus. Also we would like to sincerely thank the anonymous reviewers of the first submission for their thoughtful comments to enhance our work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hammo, B., Yagi, S., Ismail, O. et al. Exploring and exploiting a historical corpus for Arabic. Lang Resources & Evaluation 50, 839–861 (2016). https://doi.org/10.1007/s10579-015-9304-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9304-9