Skip to main content
Log in

Improving Arabic information retrieval using word embedding similarities

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words in the matching process. In this paper, we propose a method to incorporate word embedding (WE) semantic similarities into existing probabilistic IR models for Arabic in order to deal with term mismatch. Experiments are performed on the standard Arabic TREC collection using three neural word embedding models. The results show that extending the existing IR models improves significantly baseline bag-of-words models. Although the proposed extensions significantly outperform their baseline bag-of-words, the difference between the evaluated neural word embedding models is not statistically significant. Moreover, the overall comparison results show that our extensions significantly improve the Arabic WordNet based semantic indexing approach and three recent WE-based IR language models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://terrier.org/download/.

  2. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/.

  3. LDC catalog number LDC2001T55.

  4. http://www.cls.informatik.uni-leipzig.de/langs/ara.

  5. https://code.google.com/archive/p/word2vec/.

  6. https://nlp.stanford.edu/projects/glove/.

  7. https://github.com/ielab/adcs2015-NTLM.

  8. https://github.com/gdebasis/wvlm/.

  9. http://arabic.emi.ac.ma/ibtikarat/?q=Resources.

References

  • Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 11–16). San Diego, CA, June 12–17, 2016.

  • Abderrahim, M. A., Dib, M., Abderrahim, M. E. A., & Chikh, M. A. (2016). Semantic indexing of arabic texts for information retrieval system. International Journal of Speech Technology, 19(2), 229–236.

    Article  Google Scholar 

  • Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.

    Article  Google Scholar 

  • Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505–533.

    Article  Google Scholar 

  • Algarni, M., Martin, B., Bell, T., Neshatian, K. (2014). Simple arabic stemmer. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14 (pp. 1803–1806).

  • Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.

    Article  Google Scholar 

  • Atwan, J., Mohd, M., Rashaideh, H., & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for arabic information retrieval. Journal of Information Science, 42(2), 246–260.

    Article  Google Scholar 

  • Baroni, M., Dinu, G., Kruszewski, G. (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL (pp. 238–247), Baltimore, MA.

  • Belalem, G., Abbache, A., Barigou, F., & Belkredim, F. Z. (2014). The use of arabic wordnet in arabic information retrieval. International Journal of Information Retrieval Research, 4(3), 54–65.

    Article  Google Scholar 

  • Ben Guirat, S., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for arabic information retrieval. International Journal of Software Innovation, 4(4), 1–14.

    Article  Google Scholar 

  • Berger, A., Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 222–229), New York: SIGIR ’99.

  • Boulaknadel, S., Daille, B., Aboutajdine, D. (2008). Multi-word term indexing for Arabic document retrieval. In IEEE Symposium on Computers and Communications (ISCC’08) (pp. 869–873).

  • Clinchant, S., Gaussier, E. (2010). Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 234–241), New York: SIGIR ’10.

  • Clinchant, S., & Gaussier, E. (2011). Retrieval constraints and word frequency distributions a log-logistic model for IR. Information Retrieval, 14(1), 5–25.

    Article  Google Scholar 

  • Croft, W. B., Bendersky, M., Li, H., & Xu, G. (2011). Query representation and understanding workshop. SIGIR Forum, 44(2), 48–53.

    Article  Google Scholar 

  • Darwish, K., Ali, A. M. (2012). Arabic retrieval revisited: Morphological hole filling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2, pp. 218–222). Stroudsburg, PA: Association for Computational Linguistics, ACL’12.

  • Darwish ,K., Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016.

  • Dragoni, M., Da Costa Pereira, C., & Tettamanzi, A. G. (2012). A conceptual representation of documents and queries for information retrieval systems by using light ontologies. Expert Systems with Applications, 39(12), 10,376–10,388.

    Article  Google Scholar 

  • El Mahdaouy, A., Gaussier, E., EL Alaoui, S. O. (2014). Exploring term proximity statistic for Arabic information retrieval. In 2014 Third IEEE International Colloquium in Information Science and Technology (CIST) (pp. 272–277).

  • El Mahdaouy, A., EL Alaoui, S. O., Gaussier, E. (2016). Semantically enhanced term frequency based on word embeddings for Arabic information retrieval. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) (pp. 385–389).

  • Elkateb, W. S., Fellbaum, C. (2006). Building a wordnet for Arabic. In Proceedings of The Fifth International Conference on Language Resources and Evaluation (LREC 2006).

  • Fang, H., Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 115–122), New York: SIGIR ’06.

  • Fang, H., Tao, T., Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 49–56). New York: SIGIR ’04.

  • Farghaly, A. (2004). Computer processing of arabic script-based languages. Current state and future directions. In A. Farghaly & K. Megerdoomian (Eds.), COLING 2004 computational approaches to Arabic script-based languages (pp. 1–1). COLING: Geneva.

    Google Scholar 

  • Faruqui, M., Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 462–471), April 26–30, 2014, Gothenburg, Sweden.

  • Fernández, M., Cantador, I., López, V., Vallet, D., Castells, P., Motta, E. (2011). Semantically enhanced information retrieval: An ontology-based approach. Web Semantics: Science, Services and Agents on the World Wide Web, 9(4), 434–452 (JWS special issue on Semantic Search).

  • Ganguly, D., Roy, D., Mitra, M., Jones, G. J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 795–798) New York: SIGIR ’15.

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 50–57), New York: SIGIR ’99.

  • Jaafar, Y., Bouzoubaa, K., Yousfi, A., Tajmout, R., & Khamar, H. (2016). Improving Arabic morphological analyzers benchmark. International Journal of Speech Technology, 19(2), 259–267.

    Article  Google Scholar 

  • Kadri, Y., Nie, J. Y. (2006). Effective stemming for arabic information retrieval. In The Challenge of Arabic for NLP/MT, International Conf. at the British Computer Society (BCS) (pp. 68–74).

  • Karimzadehgan, M., Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 323–330), New York: SIGIR ’10.

  • Khoja, S., Garside, R. (1999). Stemming Arabic Text. Computing Department. Lancaster University.

  • Larkey, L., Ballesteros, L., & Connell, M. E. (2007). Light stemming for Arabic information retrieval. In A. Soudi, A. D. Bosch, & G. Neumann (Eds.), Arabic computational morphology, text, speech and language technology (Vol. 38, pp. 221–243). Netherlands: Springer.

    Chapter  Google Scholar 

  • Larkey, L. S., Ballesteros, L., Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–282), New York: SIGIR ’02.

  • Li, B., & Gaussier, E. (2012). An information-based cross-language information retrieval model. In R. Baeza-Yates, A. P. Vries, H. Zaragoza, B. B. Cambazoglu, V. Murdock, R. Lempel, & F. Silvestri (Eds.), 34th European conference on IR research, ECIR 2012 (Vol. 7224, pp. 281–292)., Lecture Notes in Computer Science (LNCS) Barcelone: Springer.

    Google Scholar 

  • Li, H., & Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5), 343–469.

    Article  Google Scholar 

  • Lofi, C. (2015). Measuring semantic similarity and relatedness with distributional and knowledge-based approaches. Information and Media Technologies, 10(3), 493–501.

    Google Scholar 

  • Mahgoub, A. Y., Rashwan, M. A., Raafat, H., Zahran, M. A., & Fayek, M. B. (2014). Semantic query expansion for arabic information retrieval. ANLP, 2014, 87–92.

    Google Scholar 

  • Metzler, D., Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 472–479), New York, NY: SIGIR ’05.

  • Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations, ICLR ’13.

  • Mustafa, M., AbdAlla, H., & Suleman, H. (2008). Current approaches in Arabic IR: A survey (pp. 406–407). Berlin Heidelberg, Berlin, Heidelberg: Springer.

    Google Scholar 

  • Nwesri, A., Tahaghoghi, S., & Scholer, F. (2005). Stemming arabic conjunctions and prepositions. In M. Consens & G. Navarro (Eds.), String processing and information retrieval (Vol. 3772, pp. 206–217)., Lecture notes in computer science Berlin: Springer.

    Chapter  Google Scholar 

  • Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543), Doha: Association for Computational Linguistics.

  • Ponte, J. M., Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 275–281), New York: SIGIR ’98.

  • Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3. In TREC’94 (pp. 109–126). City University, London.

  • Sun, Y., Rao, N., Ding, W. (2017). A simple approach to learn polysemous word embeddings. CoRR abs/1707.01793, http://arxiv.org/abs/1707.01793,1707.01793.

  • Tazit, N., Bouyakhf, E. H., Sabri, S, Yousfi, A., Bouzouba, K. (2007). Semantic internet search engine with focus on Arabic language. In the International Symposium on Computers & Arabic Language, ISCAL 07.

  • Tazit, N., Yousfi, A., & Bouyakhf, E. H. (2009). Design and implementation of an information retrieval system by integrating semantic knowledge in the indexing phase. Artificial Intelligence and Machine Learning AIML, 9(1), 49–56.

    Google Scholar 

  • Vulić, I., Moens, M. F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 363–372). New York: SIGIR ’15.

  • Wei, X., Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 178–185), New York: SIGIR ’06.

  • Yang, X., & Mao, K. (2016). Learning multi-prototype word embedding from single-prototype word embedding with integrated knowledge. Expert Systems with Applications, 56, 291–299.

    Article  Google Scholar 

  • Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word representations in vector space and their applications for Arabic (pp. 430–443). Cham: Springer.

    Google Scholar 

  • Zhai, C., Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 334–342), New York: SIGIR ’01.

  • Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, ACM (pp. 12:1–12:8), New York: ADCS ’15.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelkader El Mahdaouy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El Mahdaouy, A., El Alaoui, S.O. & Gaussier, E. Improving Arabic information retrieval using word embedding similarities. Int J Speech Technol 21, 121–136 (2018). https://doi.org/10.1007/s10772-018-9492-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-9492-y

Keywords

Navigation