Skip to main content
Log in

Curras: an annotated corpus for the Palestinian Arabic dialect

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Arabic orthographic transliterations are provided in the Habash-Soudi-Buckwalter (HSB) scheme (Habash et al. 2007), except where indicated. HSB extends Buckwalter’s transliteration scheme (Buckwalter 2004) to increase its readability while maintaining the 1-to-1 correspondence with Arabic orthography as represented in standard encodings of Arabic, i.e., Unicode, etc. The following are the only differences from Buckwalter’s scheme (indicated in parentheses): Ā آ (|), Â أ (>), ŵ ؤ (&), Ǎ إ (<), ŷ ئ (}), ħ ة (p), θ ث (v), ð ذ(*), š ش ($), Ď ظ (Z), ς ع(E), γ غ(g), ý ى (Y), ã ـً (F), ũ ـٌ (N), ĩ ـٍ (K). Orthographic transliterations are presented in italics. For phonological transcriptions, we follow the common practice of using ‘/…/’ to represent phonological sequences and we use HSB choices with some extensions instead of the International Phonetic Alphabet (IPA) to minimize the number of representations used, as was done by (Habash 2010). Arabic is written from right to left and with optional diacritics that are mostly used to mark vowels. Examples are vowelized as needed.

  2. Curras Portal http://portal.sina.birzeit.edu/curras.

References

  • Abdul-Mageed, M., & Diab, M. (2014). SANA: A large scale multi-genre, multi-dialect lexicon for Arabic subjectivity and sentiment analysis. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 1162–1169).

  • Abdul-Mageed, M., Kübler, S., & Diab, M. (2012). Samar: A system for subjectivity and sentiment analysis of Arabic social media. In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, association for computational linguistics, Jeju, Korea (pp. 19–28).

  • Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. In The 7th international conference on language engineering, Cairo, Egypt.

  • Alkuhlani, S., & Habash, N. (2011). A corpus for modeling morpho-syntactic agreement in Arabic: Gender, number and rationality. In Proceedings of the association for computational linguistics: Human language technologies (pp. 357–362).

  • Al-Sabbagh, R., & Girju, R. (2010). Mining the web for the induction of a dialectical Arabic lexicon. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), European language resources association (ELRA), Malta (pp. 288–293).

  • Al-Sabbagh, R., & Girju, R. (2012). YADAC: Yet another dialectal arabic corpus. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Reykjavik, Iceland (pp. 2882–2889).

  • Al-Shargi, F., & Rambow, O. (2015). DIWAN: A dialectal word annotation tool for Arabic. In Proceedings of the second workshop on arabic natural language processing, association for computational linguistics, Beijing, China (p. 49).

  • Al-Sughaiyer, I., & Al-Kharashi, I. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189–213.

    Article  Google Scholar 

  • Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.

    Article  Google Scholar 

  • Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.

    Article  Google Scholar 

  • Attia, M. (2006). An ambiguity-controlled morphological analyzer for modern standard Arabic modelling finite state networks. In Proceedings of the challenges of Arabic for NLP/MT conference, The British Computer Society, London, UK (pp. 1–16).

  • Bakr, H. A., Shaalan, K., & Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In The 6th international conference on informatics and systems, (INFOS2008), Cairo University, Cairo, Egypt (p. 72).

  • Beesley, K. R. (1996). Arabic finite-state morphological analysis and generation. In Proceedings of the 16th conference on computational linguistics (Vol. 1, pp. 89–94).

  • Bouamor, H., Habash, N., & Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 1240–1245).

  • Bruce, R. F., & Wiebe, J. (1998). Word-sense distinguishability and inter-coder agreement. In Proceedings of the empirical methods on natural language processing conference (EMNLP’98), association for computational linguistics, Granada, Spain 1998 (pp. 53–60).

  • Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer: Version 2.0. LDC catalog number LDC2004L02. ISBN 1-58563-324-0.

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Darwish, K. (2013). Arabizi detection and conversion to Arabic. arXiv preprint arXiv:1306.6755.

  • Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.

    Article  Google Scholar 

  • Diab, M., Al-Badrashiny, M., Aminian, M., Attia, M., Dasigi, P., Elfardy, H., et al. (2014). Tharwa: A large scale dialectal Arabic-Standard Arabic-English lexicon. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 3782–3789).

  • Diab, M., Habash, N., Rambow, O., Altantawy, M., & Benajiba, Y. (2010). COLABA: Arabic dialect annotation and processing. In LREC workshop on semitic language processing, Malta (pp. 66–74).

  • Diab, M., Hacioglu, K., & Jurafsky, D. (2007). Automated methods for processing Arabic text: From tokenization to base phrase chunking. In: Arabic computational morphology: Knowledge-based and empirical methods. Kluwer/Springer.

  • Eskander, R., Al-Badrashiny, M., Habash, N., & Rambow, O. (2014). Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the empirical methods on natural language processing conference (EMNLP’14), Doha, Qatar (p. 1).

  • Eskander, R., Habash, N., Rambow, O., & Tomeh, N. (2013). Processing spontaneous orthography. In Proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT’13), Atlanta, Georgia (pp. 585–595).

  • Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A., et al. (1997). CALLHOME Egyptian Arabic transcripts. LDC97T19. Web Download. Philadelphia: Linguistic Data Consortium.

  • Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2009). Standard Arabic morphological analyzer (SAMA) version 3.1. In Linguistic Data Consortium LDC2009E73.

  • Gupta, M., Yadav, V., Husain, S., & Sharma, D. M. (2010). Partial parsing as a method to expedite dependency annotation of a Hindi treebank. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Malta (pp. 1930–1935).

  • Habash, N. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.

    Article  Google Scholar 

  • Habash, N., Diab, M., & Rambow, O. (2012a). Conventional orthography for dialectal Arabic. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Istanbul, Turkey (pp. 711–718).

  • Habash, N., Jarrar, M., Alrimawi, F., Akra, D., Zalmout, N., Bartolotti, E., et al. (2016). Palestinian Arabic conventional orthography guidelines. Tech Report: Under preparation

  • Habash, N., Eskander, R., & Hawwari, A. A morphological analyzer for Egyptian Arabic. (2012b). In Proceedings of the twelfth meeting of the special interest group on computational morphology and phonology, association for computational linguistics, Montreal, Canada (pp. 1–9).

  • Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting on association for computational linguistics, Ann Arbor, Michigan, USA (pp. 573–580).

  • Habash, N., & Rambow, O. (2006). MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, Sydney, Australia (pp. 681–688).

  • Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt (pp. 102–109).

  • Habash, N., & Roth, R. M. (2009). CATiB: The columbia Arabic treebank. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, association for computational linguistics, Beijing, China (pp. 221–224).

  • Habash, N., Roth, R., Rambow, O., Eskander, R., & Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In proceedings of the North American chapter of the association for computational linguistics (NAACL’13), Atlanta, Georgia (pp. 426–432).

  • Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In Arabic Computational (Ed.), Morphology (pp. 15–22). New York: Springer.

    Google Scholar 

  • Herzallah, R. (1990). Aspects of palestinian Arabic phonology: A nonlinear approach. Ph.D., Cornell University, New York.

  • Holes, C. (2004). Modern Arabic: Structures, functions, and varieties. Washington, D.C.: Georgetown University Press.

    Google Scholar 

  • Jarrar. (2006). Towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering. In Proceedings of the 15th International World Wide Web Conference (WWW2006). Edinburgh, Scotland (pp. 497–503). ACM Press.

  • Jarrar. (2011). Building a formal Arabic ontology (Invited Paper). In Proceedings of the experts meeting on Arabic ontologies and semantic networks. Alecso, Arab League. Tunis, July 26–28, 2011.

  • Jarrar, M., & Alrimawi, F. (2015a). Downloads. http://sina.birzeit.edu/projects/curras/downloads. Accessed 18 Aug 2015.

  • Jarrar, M., & Alrimawi, F. (2015b). Statistics and inter-annotator agreement calculations of the Palestinian dialect corpus—Curras. www.jarrar.info/publications/JR15.pdf.

  • Jarrar, M., Habash, N., Akra, D., & Zalmout, N. (2014). Building a corpus for palestinian Arabic: A preliminary study. In Arabic natural language processing (ANLP) workshop, at the conference on empirical methods in natural language processing (EMNLP 2014), Doha, Qatar (p. 18).

  • Khalifa, S., Habash, N., Abdulrahim, D., & Hassan, S. (2016). A large scale corpus of Gulf Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’16). Portorož, Slovenia.

  • Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., & McLemore, C. (2002). Egyptian colloquial Arabic lexicon. In LDC catalog number LDC99L22.

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  • Lynn, T., Cetinoglu, O., Foster, J., Ui Dhonnchadha, E., Dras, M., & van Genabith, J. (2012). Irish treebanking and parsing: A preliminary evaluation. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Istanbul, Turkey (pp. 1939–1946).

  • Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, N., Rambow, O., et al. (2006). Developing and using a pilot dialectal Arabic treebank. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06), european language resources association (ELRA), Genoa, Italy.

  • Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The Penn Arabic Treebank: Building a large-scale annotated arabic corpus. In NEMLAR conference on Arabic language resources and tools, Cairo, Egypt (pp. 102–109).

  • Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., & Eskander, R. (2014). Developing an Egyptian Arabic Treebank: Impact of dialectal morphology on annotation and tool development. In Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland (pp. 2348–2354).

  • Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In The 29th Pacific Asia conference on language, information and computation.

  • Mieskes, M., & Strube, M. (2006). Part-of-speech tagging of transcribed speech. In Proceedings of the conference on language resources and evaluation (LREC’06), Genoa, Italy (pp. 935–938).

  • Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244.

    Article  Google Scholar 

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Olive, J., Christianson, C., & McCary, J. (Eds.). (2011). Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. Berlin: Springer Science & Business Media.

    Google Scholar 

  • Parker, R., Graff, D., Chen, K., Kong, J., & Maeda, K. (2011). Arabic Gigaword fifth edition. In LDC2011T11. Philadelphia: Linguistic Data Consortium.

  • Pasha, A., Al-Badrashiny, M., Kholy, A. E., Eskander, R., Diab, M., Habash, N., et al. (2014). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the conference on language resources and evaluation (LREC’14), Reykjavik, Iceland (pp. 1094–1101).

  • Poesio, M. (2004). Discourse annotation and semantic annotation in the GNOME corpus. In Proceedings of the workshop on discourse annotation, association for computational linguistics, Barcelona, Spain.

  • Rafalovitch, A., & Dale, R. (2009). United nations general assembly resolutions: A six-language parallel corpus. In Proceedings of the MT Summit (Vol. 12, pp. 292–299).

  • Riesa, J., & Yarowsky, D. (2006). Minimally supervised morphological segmentation with applications to machine translation. In Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA06) (pp. 185–192).

  • Saadane, H., & Habash, N. (2015). A conventional orthography for Algerian Arabic. In Proceedings of the Arabic natural language processing (ANLP) workshop, Beijing, China (p. 69).

  • Sajjad, H., Darwish, K., & Belinkov, Y. (2013). Translating dialectal Arabic to English. In Proceedings of the association for computational linguistics, Sofia, Bulgaria.

  • Salloum, W., & Habash, N. (2011). Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties.

  • Salloum, W., & Habash, N. (2013). Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT’13), Atlanta, Georgia (pp. 348–358).

  • Salloum, W., & Habash, N. (2014). ADAM: Analyzer for dialectal Arabic morphology. Journal of King Saud University-Computer and Information Sciences, 26(4), 372–378.

    Article  Google Scholar 

  • Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the 9th conference of the association for machine translation in the Americas (AMTA), Denver, Colorado.

  • Shoufan, A., & Al-Ameri, S. (2015). Natural language processing for dialectical Arabic: A Survey. In The Arabic natural language processing workshop 2015, Beijing, China.

  • Smrž, O. (2007). Functional Arabic morphology. Formal system and implementation. PhD Thesis, Charles University, Prague, Czech Republic.

  • Smrž, O., & Hajic, J. (2006). The other Arabic treebank: Prague dependencies and functions. In Arabic computational linguistics: Current implementations. CSLI Publications, 104

  • Uria, L., Estarrona, A., Aldezabal, I., Aranzabe, M. J., De Ilarraza, A. D., & Iruskieta, M. (2009). Evaluation of the syntactic annotation in EPEC, the reference corpus for the processing of Basque. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing (pp. 72–85). New York: Springer.

  • Véronis, J. (1998). A study of polysemy judgements and inter-annotator agreement. In Programme and advanced papers of the Senseval workshop. Herstmonceux Castle, UK (pp. 2–4).

  • Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing translation: Professional quality from non-professionals. In Proceedings of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1220–1229).

  • Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., et al. (2012). Machine translation of Arabic dialects. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT’12).

  • Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Belguith, L., & Habash, N. (2014). A conventional orthography for Tunisian Arabic. In Proceedings of the ninth international conference on language resources abd evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 2355–2361).

Download references

Acknowledgments

This work is part of our ongoing Curras project, funded by the Palestinian Ministry of Higher Education, Scientific Research Council. We wish to thank Owen Rambow, Ramy Eskander and Faisal Al-Shargi for their support with DIWAN and MADAMIRA. We would like to also thank Rami Asia for developing the Curras portal, Bahya Mustafa and Mohammad Dwaikat for their support during the annotation process, and Mahdi Arar for helpful conversations and fruitful discussions in the early stages of this work. Last but not least, we would like to thank the “Watan Aa Watar” actors for their support and for providing us with the scripts of their TV show.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faeq Alrimawi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jarrar, M., Habash, N., Alrimawi, F. et al. Curras: an annotated corpus for the Palestinian Arabic dialect. Lang Resources & Evaluation 51, 745–775 (2017). https://doi.org/10.1007/s10579-016-9370-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9370-7

Keywords

Navigation