Abstract
This paper presents detailed data on the workings of a system extracting semantic clusters from a large general Arabic corpus which has been presented in a previous work [1], and proposes some bases for best evaluation using Arabic WordNet. In the first experiments, using an evaluation corpus of about 8 millions words and GraPaVec, a method for word vectorization based on automatically generated frequency patterns, our system clustered word vectors in a Self Organizing Map neural network model and evaluated them with Arabic WordNet existing synsets. We compared the results with state-of-the-art Word2Vec and Glove methods. As our results were astonishingly high, without clear explanations, we present here a more thorough testing protocol, evaluating with a much larger corpus (1.4 billion words), introducing more refined measures, a refined definition of multiclass recall and precision, taking better into account the specifics of wordnet classification and using NLTK tools. Observations on the corpus are given in order to help researchers interested in our approach to assess methods of implementation and evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Short vowels are not written in Arabic words in normal use and in a majority of documents.
- 2.
Its maximal depth is given by the longest word in the corpus and its maximal breadth is given by the number of possible characters at any point. As shown by [38], in language the number of successors is constrained, so the tree quickly shrinks.
References
Lebboss, G., Bernard, G., Aliane, N., Hajjar, M.: Towards the enrichment of Arabic WordNet with big corpora. In: Proceedings of the 9th International Joint Conference on Computational Intelligence, vol. 1, pp. 101–109 (2017)
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.: Introducing the Arabic WordNet project. In: Sojka, P., Choi, F., Vossen, P. (eds.) Proceedings of the Third International WordNet Conference, pp. 295–300 (2006)
Regragui, Y., Abouenour, L., Krieche, F., Bouzoubaa, K., Rosso, P.: Arabic WordNet: new content and new applications. In: Proceedings of the Eighth Global WordNet Conference, pp. 330–338, Bucharest, Romania (2016)
Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Martí, M.A., Black, W., Elkateb, S., Kirk, J., Pease, A., Vossen, P., Fellbaum, C.: Arabic WordNet: current state and future extensions. In: Proceedings of the Fourth Global WordNet Conference, Hungary, pp. 387–405 (2008)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell, B.: WordNet 2.1. Cognitive Science Laboratory, Princeton University (2005)
Lebboss, G.: Contribution à l’analyse sémantique des textes arabes. Ph.D. thesis, University Paris 8, France
Al-Barhamtoshy, H.M., Al-Jideebi, W.H.: Designing and implementing Arabic WordNet semantic-based. In: The 9th Conference on Language Engineering, pp. 23–24 (2009)
Vossen, P.: EuroWordNet: a multilingual database of autonomous and language-specific WordNets connected via an inter-lingual index. Int. J. Lexicogr. 17(2), 161–173 (2004)
Alkhalifa, M., Rodriguez, H.: Automatically extending named entities coverage of Arabic WordNet using Wikipedia. Int. J. Inf. Commun. Technol. 1(1), 1–17 (2008)
Abouenour, L., Bouzoubaa, K., Rosso, P.: Improving Q/A using Arabic WordNet. In: Proceedings of the 2008 International Arab Conference on Information Technology (ACIT’2008), Tunisia (2008)
Niles, I., Pease, A.: Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In: Proceedings of the International Conference on Information and Knowledge Engineering (IKE ’03), Las Vegas, Nevada, vol. 2, pp. 412–416, Las Vegas, Nevada, USA (2003)
Abouenour, L., Bouzoubaa, K., Rosso, P.: Using the Yago ontology as a resource for the enrichment of named entities in Arabic WordNet. In: Proceedings of The 7th International Conference on Language Resources and Evaluation (LREC 2010) Workshop on Language Resources and Human Language Technology for Semitic Languages, pp. 27–31 (2010)
Abouenour, L., Bouzoubaa, K., Rosso, P.: On the evaluation and improvement of Arabic WordNet coverage and usability. Lang. Resour. Eval. 47(3), 891–917 (2013)
Abdulhay, A.: Constitution d’une ressource sémantique arabe à partir d’un corpus multilingue aligné. Ph.D. thesis, Université de Grenoble (2012)
Al Hajjar, A.E.S.: Extraction et gestion de l’information à partir des documents arabes. Ph.D. thesis, University of Paris 8 (2010)
Hajjar, M., Al Hajjar, A.E.S., Abdel Nabi, Z., Lebboss, G.: Semantic enrichment of the iSPEDAL corpus. In: 3rd World Conference on Innovation and Computer Science (INSODE) (2013)
Abdelali, B., Tlili-Guiassa, Y.: Extraction des relations sémantiques à partir du Wiktionnaire arabe. Revue RIST 20(2), 47–56 (2013)
Raafat, H., Zahran, M., Rashwan, M.: Arabase: a database combining different Arabic resources with lexical and semantic information. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pp. 233–240, Scitepress (2013)
Benabdallah, A., Abderrahim, M.A., Abderrahim, M.E.A.: Extraction of terms and semantic relationships from Arabic texts for automatic construction of an ontology. Int. J. Speech Technol. 20, 289 (2017). https://doi.org/10.1007/s10772-017-9405-5
El Moatez, N., Didier, D.: Semantic similarity of Arabic sentences with word embeddings. In: Proceedings the Third Arabic Natural Language Processing Workshop, pp. 18–24, Valencia (2017)
Gahbiche-Braham, S., Bonneau-Maynard, H., Lavergne, T., Yvon, F.: Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In: LREC, pp. 2107–2113 (2012)
Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)
Harris, Z.S.: Distributional structure. In Word 10(2–3), 146–162 (1954)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Lund, K., Burgess, C., Atchley, R.A.: Semantic and associative priming in high-dimensional semantic space. In: Proceedings of the 17th Annual Conference of the Cognitive Science Society, vol. 17, pp. 660–665 (1995)
Honkela, T., Kaski, T., Lagus, K., Kohonen, T.: WEBSOM—self-organizing maps of document collections. In: Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, pp. 310–315, Helsinki University of Technology (1997)
Bernard, G.: Experiments on distributional categorization of lexical items with Self Organizing Maps. In: International Workshop on Self Organizing Maps WSOM’97, pp. 304–309 (1997)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representation, Workshop Track, p. 1301 (2013)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors forward representation. EMNLP 14, 1532–1543 (2014)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430, Association for Computational Linguistics (2003)
Green S., Manning, C.D.: Better Arabic parsing: baselines, evaluations, and analysis. In: COLING (2010)
Khoja, S., Garside, R., Knowles, G.: An Arabic tagset for the morphosyntactic tagging of Arabic. A Rainbow Corpora Corpus Linguist. Lang. World 13, 341–350 (2001)
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, Lancaster (1999)
Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of ACM Eleventh Conference on Information and Knowledge Management, MClean, VA (2002)
Hammo, B., Abu-Salem, H., Lytinen, S., Evens, M.: A question answering system to support the Arabic Language. In: Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, Philadelphia, Pennsylvania, pp. 1–11 (2002)
Nwesri, A.F.A., Tahaghoghi, S.M.M., Scholer, F.: Capturing out-of-vocabulary words in Arabic text. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia (2006)
Harris, Z.S.: Mathematical Structures of Language. Wiley, New York (1968)
Morrison, D.R.: PATRICIA: practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Takagi, T., Inenaga, S., Sadakane, K., Arimura, H.: Packed compact tries: a fast and efficient data structure for online string processing. In: Archives.org (2016). https://doi.org/10.1587/transfun.E100.A.1785, arXiv:1602.00422
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138, Stroudsburg, PA, USA (1994)
Goldberg, Y.: On the importance of comparing apples to apples: a case study using the GloVe model. Google docs (2014)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lebboss, G., Bernard, G., Aliane, N., Abdallah, A., Hajjar, M. (2019). Evaluating Methods for Building Arabic Semantic Resources with Big Corpora. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-16469-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16468-3
Online ISBN: 978-3-030-16469-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)