Skip to main content

Xword: A Multi-lingual Framework for Expanding Words

  • Conference paper
  • First Online:
Emerging Trends in Intelligent Computing and Informatics (IRICT 2019)

Abstract

The word expansion task has applicability in information retrieval and question answering systems. It relieves the vocabulary mismatch problem leading to a higher recall. The recent word embedding models demonstrated merit for the word expansion task in comparison to the traditional n-gram models. However, to acquire quality embeddings in each language, the processes of corpus compilation, normalization and parameter tuning are time-consuming and challenging especially for poor resources languages such as Arabic. In this paper, we introduce Xword as an online multi-lingual framework for automatic word expansion. Xword relies on both pre-trained ad hoc word embedding models and n-gram models for the expansion task. Xword currently includes the two languages Arabic, and German. Xword represents the results of each model both individually and collectively. Additionally, Xword can filter out the result set based on sentiment and part of speech (POS) tag of every single word. Xword is available as a Web API along with the downloadable models and sufficient documentation on our public GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.wikipedia.org/.

  2. 2.

    http://alshargi.us/de.aspx.

  3. 3.

    http://alshargi.us/ara.aspx.

  4. 4.

    https://github.com/alshargi/xword.

  5. 5.

    http://nlp.stanford.edu/software/tagger.shtml.

  6. 6.

    https://camel.abudhabi.nyu.edu/madamira/.

  7. 7.

    https://stanfordnlp.github.io/CoreNLP.

  8. 8.

    http://wortschatz.uni-leipzig.de/en/download/.

  9. 9.

    https://fh295.github.io/simlex.html.

  10. 10.

    http://www.leviants.com/ira.leviant/MultilingualVSMdata.html.

References

  1. Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representation for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192. Association for Computational Linguistics (2013)

    Google Scholar 

  2. Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)

    Google Scholar 

  3. Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Pasha, A., Al-Badrashiny, M., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: LREC (2014)

    Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  5. Cotterell, R., Schütze, H.: Morphological word-embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1287–1292 (2015)

    Google Scholar 

  6. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 231–240. ACM (2008)

    Google Scholar 

  7. Eckart, T., Alshargi, F., Quasthoff, U., Goldhahn, D.: Large Arabic web corpora of high quality: the dimensions time and origin. In: Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools, LREC, Reykjavík (2014)

    Google Scholar 

  8. Eskander, R., Rambow, O.: SLSA: a sentiment lexicon for standard arabic. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015, pp. 2545–2550 (2015)

    Google Scholar 

  9. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  10. Ferguson, C.A.: Diglossia. Word: Journal of the International Linguistic Association (1959)

    Google Scholar 

  11. Eckart, T., Quasthoff, U., Goldhahn, D.: Large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of LREC 2012, pp. 759–765 (2012)

    Google Scholar 

  12. Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ACL 2005, Lisbon, Arbor, MI, USA, pp. 2545–2550 (2005)

    Google Scholar 

  13. Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics (2015)

    Google Scholar 

  14. Kuzi, S., Shtok, A., Kurland, O.: Query expansion using word embeddings. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 1929–1932. ACM (2016)

    Google Scholar 

  15. Leviant, I., Reichart, R.: Separated by an un-common language: towards judgment language informed vector space modeling (2015)

    Google Scholar 

  16. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)

    Google Scholar 

  17. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., Mc- Closky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations (2014)

    Google Scholar 

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013a)

    Google Scholar 

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held, 5–8 December 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013b)

    Google Scholar 

  20. Navigli, R., Velardi, P.: An analysis of ontology-based query expansion strategies. In: Proceedings of the 14th European Conference on Machine Learning, Workshop on Adaptive Text Extraction and Mining, Cavtat-Dubrovnik, Croatia (2003)

    Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543 (2014)

    Google Scholar 

  22. Heyer, G., Remus, R., Quasthoff, U.: SentiWS - a publicly available German-language resource for sentiment analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC 2010), pp. 1168–1171 (2010)

    Google Scholar 

  23. Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC, Ljubljana, Slovenia (2006a)

    Google Scholar 

  24. Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC. Ljubljana, Slovenia (2006b)

    Google Scholar 

  25. Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September (2015)

    Google Scholar 

  26. Shekarpour, S., Höffner, K., Lehmann, J., Auer, S.: Keyword query expansion on linked data using linguistic and semantic features. In: 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 September 2013 (2013)

    Google Scholar 

  27. Shekarpour, S., Marx, E., Auer, S., Sheth, A.P.: RQUERY: rewriting natural language queries on knowledge graphs to alleviate the vocabulary mismatch problem. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 4–9 February 2017, San Francisco, California, USA, pp. 3936–3943 (2017)

    Google Scholar 

  28. Soricut, R., Och, F.: Unsupervised morphology induction using word embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2015)

    Google Scholar 

  29. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70 (2003)

    Google Scholar 

  30. Zamani, H., Croft, W.B.: Embedding-based query language models. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016. ACM (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faisal Alshargi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alshargi, F., Shekarpour, S., Alromema, W. (2020). Xword: A Multi-lingual Framework for Expanding Words. In: Saeed, F., Mohammed, F., Gazem, N. (eds) Emerging Trends in Intelligent Computing and Informatics. IRICT 2019. Advances in Intelligent Systems and Computing, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-33582-3_16

Download citation

Publish with us

Policies and ethics