Abstract
In this paper, we present a vision for a comprehensive unified lexical resource for computational processing of Arabic with as many of its variants as possible. We will review the current state of the art for three existing resources and then propose a method to link them in addition to augment them in a manner that would render them even more useful for natural language processing whether targeting enabling technologies such as part of speech tagging or parsing, or applications such as Machine Translation, or Information Extraction. The unified lexical resource, Tharawat, meaning treasures, is an extension of our core unique resource Tharwa, which is a three way computational lexicon for Dialectal Arabic, Modern Standard Arabic, and English lemma correspondents. Tharawat will incorporate two other current resources namely SANA, our Arabic Sentiment Lexicon, and MuSTalAHAt, our Multiword Expression (MWE) version of Tharwa but instead of listing lemmas and their correspondents, it lists MWE and their correspondents. Moreover, we present a roadmap for incorporating links for Tharawat to existing English resources and corpora leveraging advanced machine learning techniques and crowd sourcing methods. Such resources are at the core of NLP technologies. Specifically, we believe that such a resource could lead to significant leaps and strides for Arabic NLP. Possessing them for a language such as Arabic could be quite impactful for the development of advanced scientific material and hence lead to an Arabic scientific and economic revolution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abdul-Mageed, M., Diab, M.: Sana: A large scale multi-genre, multi-dialect lexicon for arabic subjectivity and sentiment analysis. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik (2014), http://www.lrec-conf.org/proceedings/lrec2014/pdf/919_Paper.pdf
Abo Bakr, H., Shaalan, K., Ziedan, I.: A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic. In: The 6th International Conference on Informatics and Systems, INFOS 2008, Cairo University (2008), http://sites.google.com/site/khaledshaalan/publications/conference-papers/AHybridApproachforConvertingWrittenEgyptian.pdf?attredirects=0
Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 30–38. Association for Computational Linguistics, Ann Arbor (2014), http://www.aclweb.org/anthology/W14-1604
Alkuhlani, S., Habash, N.: A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, Oregon, USA (2011)
Badawi, E.S., Hinds, M.: A Dictionary of Egyptian Arabic. Librairie du Liban (1986)
Brustad, K.: The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. Georgetown University Press (2000)
Diab, M., AlBadrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P., Eskander, R.: Tharwa: A large scale dialectal arabic - standard arabic - english lexicon. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 3782–3789. European Language Resources Association (ELRA), Reykjavik (2014), http://www.lrec-conf.org/proceedings/lrec2014/pdf/1161_Paper.pdf , aCL Anthology Identifier: L14-1115
Ferguson, C.F.: Diglossia. Word 15(2), 325–340 (1959)
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., Buckwalter, T.: Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (2009), linguistic Data Consortium LDC2009E73
Habash, N., Eskander, R., Hawwari, A.: A Morphological Analyzer for Egyptian Arabic. In: NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON 2012), pp. 1–9 (2012)
Habash, N.: Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers (2010)
Habash, N., Diab, M., Rabmow, O.: Conventional Orthography for Dialectal Arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul (2012)
Habash, N., Soudi, A., Buckwalter, T.: On Arabic transliteration. In: Soudi, A., Neumann, G., van den Bosch, A. (eds.) Arabic Computational Morphology, Text, Speech and Language Technology, vol. 38, ch. 2, pp. 15–22. Springer (2007), http://dx.doi.org/10.1007/978-1-4020-6046-5_2
Hawwari, A., Attia, M., Diab, M.: A framework for the classification and annotation of multiword expressions in dialectal arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 48–56. Association for Computational Linguistics, Doha (2014), http://www.aclweb.org/anthology/W14-3606
Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., McLemore, C.: Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22 (2002)
Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, N., Rambow, O., Tabessi, D.: Developing and using a pilot dialectal Arabic treebank. In: LREC, Genoa, Italy (2006)
Saleh, I., Habash, N.: Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages. In: Third Workshop on Computational Approaches to Arabic Script-based Languages at the MT Summit XII, Ottawa, Canada (2009)
Salloum, W., Habash, N.: Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation. In: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, Edinburgh, Scotland, pp. 10–21 (2011)
Spiro, S.: An Arabic-English Vocabulary of the Colloquial Arabic of, Egypt. Al-Mokattam printing office (1895)
Spiro, S.: Arabic-English Dictionary of the Colloquial Arabic of Egypt. Librairie Du Liban (1987)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Diab, M. (2015). Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-18111-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18110-3
Online ISBN: 978-3-319-18111-0
eBook Packages: Computer ScienceComputer Science (R0)