Skip to main content

Evaluating Methods for Building Arabic Semantic Resources with Big Corpora

  • Conference paper
  • First Online:
Computational Intelligence (IJCCI 2017)

Part of the book series: Studies in Computational Intelligence ((SCI,volume 829))

Included in the following conference series:

Abstract

This paper presents detailed data on the workings of a system extracting semantic clusters from a large general Arabic corpus which has been presented in a previous work [1], and proposes some bases for best evaluation using Arabic WordNet. In the first experiments, using an evaluation corpus of about 8 millions words and GraPaVec, a method for word vectorization based on automatically generated frequency patterns, our system clustered word vectors in a Self Organizing Map neural network model and evaluated them with Arabic WordNet existing synsets. We compared the results with state-of-the-art Word2Vec and Glove methods. As our results were astonishingly high, without clear explanations, we present here a more thorough testing protocol, evaluating with a much larger corpus (1.4 billion words), introducing more refined measures, a refined definition of multiclass recall and precision, taking better into account the specifics of wordnet classification and using NLTK tools. Observations on the corpus are given in order to help researchers interested in our approach to assess methods of implementation and evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Short vowels are not written in Arabic words in normal use and in a majority of documents.

  2. 2.

    Its maximal depth is given by the longest word in the corpus and its maximal breadth is given by the number of possible characters at any point. As shown by [38], in language the number of successors is constrained, so the tree quickly shrinks.

References

  1. Lebboss, G., Bernard, G., Aliane, N., Hajjar, M.: Towards the enrichment of Arabic WordNet with big corpora. In: Proceedings of the 9th International Joint Conference on Computational Intelligence, vol. 1, pp. 101–109 (2017)

    Google Scholar 

  2. Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.: Introducing the Arabic WordNet project. In: Sojka, P., Choi, F., Vossen, P. (eds.) Proceedings of the Third International WordNet Conference, pp. 295–300 (2006)

    Google Scholar 

  3. Regragui, Y., Abouenour, L., Krieche, F., Bouzoubaa, K., Rosso, P.: Arabic WordNet: new content and new applications. In: Proceedings of the Eighth Global WordNet Conference, pp. 330–338, Bucharest, Romania (2016)

    Google Scholar 

  4. Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Martí, M.A., Black, W., Elkateb, S., Kirk, J., Pease, A., Vossen, P., Fellbaum, C.: Arabic WordNet: current state and future extensions. In: Proceedings of the Fourth Global WordNet Conference, Hungary, pp. 387–405 (2008)

    Google Scholar 

  5. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  6. Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell, B.: WordNet 2.1. Cognitive Science Laboratory, Princeton University (2005)

    Google Scholar 

  7. Lebboss, G.: Contribution à l’analyse sémantique des textes arabes. Ph.D. thesis, University Paris 8, France

    Google Scholar 

  8. Al-Barhamtoshy, H.M., Al-Jideebi, W.H.: Designing and implementing Arabic WordNet semantic-based. In: The 9th Conference on Language Engineering, pp. 23–24 (2009)

    Google Scholar 

  9. Vossen, P.: EuroWordNet: a multilingual database of autonomous and language-specific WordNets connected via an inter-lingual index. Int. J. Lexicogr. 17(2), 161–173 (2004)

    Article  Google Scholar 

  10. Alkhalifa, M., Rodriguez, H.: Automatically extending named entities coverage of Arabic WordNet using Wikipedia. Int. J. Inf. Commun. Technol. 1(1), 1–17 (2008)

    Google Scholar 

  11. Abouenour, L., Bouzoubaa, K., Rosso, P.: Improving Q/A using Arabic WordNet. In: Proceedings of the 2008 International Arab Conference on Information Technology (ACIT’2008), Tunisia (2008)

    Google Scholar 

  12. Niles, I., Pease, A.: Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In: Proceedings of the International Conference on Information and Knowledge Engineering (IKE ’03), Las Vegas, Nevada, vol. 2, pp. 412–416, Las Vegas, Nevada, USA (2003)

    Google Scholar 

  13. Abouenour, L., Bouzoubaa, K., Rosso, P.: Using the Yago ontology as a resource for the enrichment of named entities in Arabic WordNet. In: Proceedings of The 7th International Conference on Language Resources and Evaluation (LREC 2010) Workshop on Language Resources and Human Language Technology for Semitic Languages, pp. 27–31 (2010)

    Google Scholar 

  14. Abouenour, L., Bouzoubaa, K., Rosso, P.: On the evaluation and improvement of Arabic WordNet coverage and usability. Lang. Resour. Eval. 47(3), 891–917 (2013)

    Article  Google Scholar 

  15. Abdulhay, A.: Constitution d’une ressource sémantique arabe à partir d’un corpus multilingue aligné. Ph.D. thesis, Université de Grenoble (2012)

    Google Scholar 

  16. Al Hajjar, A.E.S.: Extraction et gestion de l’information à partir des documents arabes. Ph.D. thesis, University of Paris 8 (2010)

    Google Scholar 

  17. Hajjar, M., Al Hajjar, A.E.S., Abdel Nabi, Z., Lebboss, G.: Semantic enrichment of the iSPEDAL corpus. In: 3rd World Conference on Innovation and Computer Science (INSODE) (2013)

    Google Scholar 

  18. Abdelali, B., Tlili-Guiassa, Y.: Extraction des relations sémantiques à partir du Wiktionnaire arabe. Revue RIST 20(2), 47–56 (2013)

    Google Scholar 

  19. Raafat, H., Zahran, M., Rashwan, M.: Arabase: a database combining different Arabic resources with lexical and semantic information. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pp. 233–240, Scitepress (2013)

    Google Scholar 

  20. Benabdallah, A., Abderrahim, M.A., Abderrahim, M.E.A.: Extraction of terms and semantic relationships from Arabic texts for automatic construction of an ontology. Int. J. Speech Technol. 20, 289 (2017). https://doi.org/10.1007/s10772-017-9405-5

  21. El Moatez, N., Didier, D.: Semantic similarity of Arabic sentences with word embeddings. In: Proceedings the Third Arabic Natural Language Processing Workshop, pp. 18–24, Valencia (2017)

    Google Scholar 

  22. Gahbiche-Braham, S., Bonneau-Maynard, H., Lavergne, T., Yvon, F.: Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In: LREC, pp. 2107–2113 (2012)

    Google Scholar 

  23. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)

    Book  MATH  Google Scholar 

  24. Harris, Z.S.: Distributional structure. In Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  25. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  26. Lund, K., Burgess, C., Atchley, R.A.: Semantic and associative priming in high-dimensional semantic space. In: Proceedings of the 17th Annual Conference of the Cognitive Science Society, vol. 17, pp. 660–665 (1995)

    Google Scholar 

  27. Honkela, T., Kaski, T., Lagus, K., Kohonen, T.: WEBSOM—self-organizing maps of document collections. In: Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, pp. 310–315, Helsinki University of Technology (1997)

    Google Scholar 

  28. Bernard, G.: Experiments on distributional categorization of lexical items with Self Organizing Maps. In: International Workshop on Self Organizing Maps WSOM’97, pp. 304–309 (1997)

    Google Scholar 

  29. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representation, Workshop Track, p. 1301 (2013)

    Google Scholar 

  30. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors forward representation. EMNLP 14, 1532–1543 (2014)

    Google Scholar 

  31. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430, Association for Computational Linguistics (2003)

    Google Scholar 

  32. Green S., Manning, C.D.: Better Arabic parsing: baselines, evaluations, and analysis. In: COLING (2010)

    Google Scholar 

  33. Khoja, S., Garside, R., Knowles, G.: An Arabic tagset for the morphosyntactic tagging of Arabic. A Rainbow Corpora Corpus Linguist. Lang. World 13, 341–350 (2001)

    Google Scholar 

  34. Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, Lancaster (1999)

    Google Scholar 

  35. Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of ACM Eleventh Conference on Information and Knowledge Management, MClean, VA (2002)

    Google Scholar 

  36. Hammo, B., Abu-Salem, H., Lytinen, S., Evens, M.: A question answering system to support the Arabic Language. In: Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, Philadelphia, Pennsylvania, pp. 1–11 (2002)

    Google Scholar 

  37. Nwesri, A.F.A., Tahaghoghi, S.M.M., Scholer, F.: Capturing out-of-vocabulary words in Arabic text. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia (2006)

    Google Scholar 

  38. Harris, Z.S.: Mathematical Structures of Language. Wiley, New York (1968)

    Google Scholar 

  39. Morrison, D.R.: PATRICIA: practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)

    Article  Google Scholar 

  40. Takagi, T., Inenaga, S., Sadakane, K., Arimura, H.: Packed compact tries: a fast and efficient data structure for online string processing. In: Archives.org (2016). https://doi.org/10.1587/transfun.E100.A.1785, arXiv:1602.00422

  41. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)

    Google Scholar 

  42. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138, Stroudsburg, PA, USA (1994)

    Google Scholar 

  43. Goldberg, Y.: On the importance of comparing apples to apples: a case study using the GloVe model. Google docs (2014)

    Google Scholar 

  44. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georges Lebboss .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lebboss, G., Bernard, G., Aliane, N., Abdallah, A., Hajjar, M. (2019). Evaluating Methods for Building Arabic Semantic Resources with Big Corpora. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_10

Download citation

Publish with us

Policies and ethics