Evaluating Methods for Building Arabic Semantic Resources with Big Corpora

Lebboss, Georges; Bernard, Gilles; Aliane, Noureddine; Abdallah, Adelle; Hajjar, Mohammad

doi:10.1007/978-3-030-16469-0_10

Georges Lebboss⁶,
Gilles Bernard⁶,
Noureddine Aliane⁶,
Adelle Abdallah⁶ &
…
Mohammad Hajjar⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 829))

Included in the following conference series:

International Joint Conference on Computational Intelligence

267 Accesses
1 Citations

Abstract

This paper presents detailed data on the workings of a system extracting semantic clusters from a large general Arabic corpus which has been presented in a previous work [1], and proposes some bases for best evaluation using Arabic WordNet. In the first experiments, using an evaluation corpus of about 8 millions words and GraPaVec, a method for word vectorization based on automatically generated frequency patterns, our system clustered word vectors in a Self Organizing Map neural network model and evaluated them with Arabic WordNet existing synsets. We compared the results with state-of-the-art Word2Vec and Glove methods. As our results were astonishingly high, without clear explanations, we present here a more thorough testing protocol, evaluating with a much larger corpus (1.4 billion words), introducing more refined measures, a refined definition of multiclass recall and precision, taking better into account the specifics of wordnet classification and using NLTK tools. Observations on the corpus are given in order to help researchers interested in our approach to assess methods of implementation and evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Short vowels are not written in Arabic words in normal use and in a majority of documents.
2.
Its maximal depth is given by the longest word in the corpus and its maximal breadth is given by the number of possible characters at any point. As shown by [38], in language the number of successors is constrained, so the tree quickly shrinks.

References

Lebboss, G., Bernard, G., Aliane, N., Hajjar, M.: Towards the enrichment of Arabic WordNet with big corpora. In: Proceedings of the 9th International Joint Conference on Computational Intelligence, vol. 1, pp. 101–109 (2017)
Google Scholar
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.: Introducing the Arabic WordNet project. In: Sojka, P., Choi, F., Vossen, P. (eds.) Proceedings of the Third International WordNet Conference, pp. 295–300 (2006)
Google Scholar
Regragui, Y., Abouenour, L., Krieche, F., Bouzoubaa, K., Rosso, P.: Arabic WordNet: new content and new applications. In: Proceedings of the Eighth Global WordNet Conference, pp. 330–338, Bucharest, Romania (2016)
Google Scholar
Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Martí, M.A., Black, W., Elkateb, S., Kirk, J., Pease, A., Vossen, P., Fellbaum, C.: Arabic WordNet: current state and future extensions. In: Proceedings of the Fourth Global WordNet Conference, Hungary, pp. 387–405 (2008)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell, B.: WordNet 2.1. Cognitive Science Laboratory, Princeton University (2005)
Google Scholar
Lebboss, G.: Contribution à l’analyse sémantique des textes arabes. Ph.D. thesis, University Paris 8, France
Google Scholar
Al-Barhamtoshy, H.M., Al-Jideebi, W.H.: Designing and implementing Arabic WordNet semantic-based. In: The 9th Conference on Language Engineering, pp. 23–24 (2009)
Google Scholar
Vossen, P.: EuroWordNet: a multilingual database of autonomous and language-specific WordNets connected via an inter-lingual index. Int. J. Lexicogr. 17(2), 161–173 (2004)
Article Google Scholar
Alkhalifa, M., Rodriguez, H.: Automatically extending named entities coverage of Arabic WordNet using Wikipedia. Int. J. Inf. Commun. Technol. 1(1), 1–17 (2008)
Google Scholar
Abouenour, L., Bouzoubaa, K., Rosso, P.: Improving Q/A using Arabic WordNet. In: Proceedings of the 2008 International Arab Conference on Information Technology (ACIT’2008), Tunisia (2008)
Google Scholar
Niles, I., Pease, A.: Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In: Proceedings of the International Conference on Information and Knowledge Engineering (IKE ’03), Las Vegas, Nevada, vol. 2, pp. 412–416, Las Vegas, Nevada, USA (2003)
Google Scholar
Abouenour, L., Bouzoubaa, K., Rosso, P.: Using the Yago ontology as a resource for the enrichment of named entities in Arabic WordNet. In: Proceedings of The 7th International Conference on Language Resources and Evaluation (LREC 2010) Workshop on Language Resources and Human Language Technology for Semitic Languages, pp. 27–31 (2010)
Google Scholar
Abouenour, L., Bouzoubaa, K., Rosso, P.: On the evaluation and improvement of Arabic WordNet coverage and usability. Lang. Resour. Eval. 47(3), 891–917 (2013)
Article Google Scholar
Abdulhay, A.: Constitution d’une ressource sémantique arabe à partir d’un corpus multilingue aligné. Ph.D. thesis, Université de Grenoble (2012)
Google Scholar
Al Hajjar, A.E.S.: Extraction et gestion de l’information à partir des documents arabes. Ph.D. thesis, University of Paris 8 (2010)
Google Scholar
Hajjar, M., Al Hajjar, A.E.S., Abdel Nabi, Z., Lebboss, G.: Semantic enrichment of the iSPEDAL corpus. In: 3rd World Conference on Innovation and Computer Science (INSODE) (2013)
Google Scholar
Abdelali, B., Tlili-Guiassa, Y.: Extraction des relations sémantiques à partir du Wiktionnaire arabe. Revue RIST 20(2), 47–56 (2013)
Google Scholar
Raafat, H., Zahran, M., Rashwan, M.: Arabase: a database combining different Arabic resources with lexical and semantic information. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pp. 233–240, Scitepress (2013)
Google Scholar
Benabdallah, A., Abderrahim, M.A., Abderrahim, M.E.A.: Extraction of terms and semantic relationships from Arabic texts for automatic construction of an ontology. Int. J. Speech Technol. 20, 289 (2017). https://doi.org/10.1007/s10772-017-9405-5
El Moatez, N., Didier, D.: Semantic similarity of Arabic sentences with word embeddings. In: Proceedings the Third Arabic Natural Language Processing Workshop, pp. 18–24, Valencia (2017)
Google Scholar
Gahbiche-Braham, S., Bonneau-Maynard, H., Lavergne, T., Yvon, F.: Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In: LREC, pp. 2107–2113 (2012)
Google Scholar
Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)
Book MATH Google Scholar
Harris, Z.S.: Distributional structure. In Word 10(2–3), 146–162 (1954)
Article Google Scholar
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Lund, K., Burgess, C., Atchley, R.A.: Semantic and associative priming in high-dimensional semantic space. In: Proceedings of the 17th Annual Conference of the Cognitive Science Society, vol. 17, pp. 660–665 (1995)
Google Scholar
Honkela, T., Kaski, T., Lagus, K., Kohonen, T.: WEBSOM—self-organizing maps of document collections. In: Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, pp. 310–315, Helsinki University of Technology (1997)
Google Scholar
Bernard, G.: Experiments on distributional categorization of lexical items with Self Organizing Maps. In: International Workshop on Self Organizing Maps WSOM’97, pp. 304–309 (1997)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representation, Workshop Track, p. 1301 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors forward representation. EMNLP 14, 1532–1543 (2014)
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430, Association for Computational Linguistics (2003)
Google Scholar
Green S., Manning, C.D.: Better Arabic parsing: baselines, evaluations, and analysis. In: COLING (2010)
Google Scholar
Khoja, S., Garside, R., Knowles, G.: An Arabic tagset for the morphosyntactic tagging of Arabic. A Rainbow Corpora Corpus Linguist. Lang. World 13, 341–350 (2001)
Google Scholar
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, Lancaster (1999)
Google Scholar
Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of ACM Eleventh Conference on Information and Knowledge Management, MClean, VA (2002)
Google Scholar
Hammo, B., Abu-Salem, H., Lytinen, S., Evens, M.: A question answering system to support the Arabic Language. In: Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, Philadelphia, Pennsylvania, pp. 1–11 (2002)
Google Scholar
Nwesri, A.F.A., Tahaghoghi, S.M.M., Scholer, F.: Capturing out-of-vocabulary words in Arabic text. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia (2006)
Google Scholar
Harris, Z.S.: Mathematical Structures of Language. Wiley, New York (1968)
Google Scholar
Morrison, D.R.: PATRICIA: practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Article Google Scholar
Takagi, T., Inenaga, S., Sadakane, K., Arimura, H.: Packed compact tries: a fast and efficient data structure for online string processing. In: Archives.org (2016). https://doi.org/10.1587/transfun.E100.A.1785, arXiv:1602.00422
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)
Google Scholar
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138, Stroudsburg, PA, USA (1994)
Google Scholar
Goldberg, Y.: On the importance of comparing apples to apples: a case study using the GloVe model. Google docs (2014)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LIASD, Paris 8 University, Saint-Denis, France
Georges Lebboss, Gilles Bernard, Noureddine Aliane & Adelle Abdallah
GRIT, Lebanese University, Saida, Lebanon
Mohammad Hajjar

Authors

Georges Lebboss
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Bernard
View author publications
You can also search for this author in PubMed Google Scholar
Noureddine Aliane
View author publications
You can also search for this author in PubMed Google Scholar
Adelle Abdallah
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Hajjar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georges Lebboss .

Editor information

Editors and Affiliations

IUT Sénart, Université Paris-Est Créteil (UPEC), Créteil, France
Christophe Sabourin
Department of Computer Architecture and Technology, University of Granada, Granada, Spain
Juan Julian Merelo
Université Paris-Est Créteil (UPEC), Créteil, France
Kurosh Madani
University of Reading, Reading, UK
Kevin Warwick

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lebboss, G., Bernard, G., Aliane, N., Abdallah, A., Hajjar, M. (2019). Evaluating Methods for Building Arabic Semantic Resources with Big Corpora. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-16469-0_10
Published: 30 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16468-3
Online ISBN: 978-3-030-16469-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics