Abstract
The word expansion task has applicability in information retrieval and question answering systems. It relieves the vocabulary mismatch problem leading to a higher recall. The recent word embedding models demonstrated merit for the word expansion task in comparison to the traditional n-gram models. However, to acquire quality embeddings in each language, the processes of corpus compilation, normalization and parameter tuning are time-consuming and challenging especially for poor resources languages such as Arabic. In this paper, we introduce Xword as an online multi-lingual framework for automatic word expansion. Xword relies on both pre-trained ad hoc word embedding models and n-gram models for the expansion task. Xword currently includes the two languages Arabic, and German. Xword represents the results of each model both individually and collectively. Additionally, Xword can filter out the result set based on sentiment and part of speech (POS) tag of every single word. Xword is available as a Web API along with the downloadable models and sufficient documentation on our public GitHub.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representation for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192. Association for Computational Linguistics (2013)
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)
Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Pasha, A., Al-Badrashiny, M., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: LREC (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Cotterell, R., Schütze, H.: Morphological word-embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1287–1292 (2015)
Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 231–240. ACM (2008)
Eckart, T., Alshargi, F., Quasthoff, U., Goldhahn, D.: Large Arabic web corpora of high quality: the dimensions time and origin. In: Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools, LREC, Reykjavík (2014)
Eskander, R., Rambow, O.: SLSA: a sentiment lexicon for standard arabic. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015, pp. 2545–2550 (2015)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Ferguson, C.A.: Diglossia. Word: Journal of the International Linguistic Association (1959)
Eckart, T., Quasthoff, U., Goldhahn, D.: Large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of LREC 2012, pp. 759–765 (2012)
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ACL 2005, Lisbon, Arbor, MI, USA, pp. 2545–2550 (2005)
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics (2015)
Kuzi, S., Shtok, A., Kurland, O.: Query expansion using word embeddings. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 1929–1932. ACM (2016)
Leviant, I., Reichart, R.: Separated by an un-common language: towards judgment language informed vector space modeling (2015)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., Mc- Closky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013a)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held, 5–8 December 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013b)
Navigli, R., Velardi, P.: An analysis of ontology-based query expansion strategies. In: Proceedings of the 14th European Conference on Machine Learning, Workshop on Adaptive Text Extraction and Mining, Cavtat-Dubrovnik, Croatia (2003)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543 (2014)
Heyer, G., Remus, R., Quasthoff, U.: SentiWS - a publicly available German-language resource for sentiment analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC 2010), pp. 1168–1171 (2010)
Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC, Ljubljana, Slovenia (2006a)
Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC. Ljubljana, Slovenia (2006b)
Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September (2015)
Shekarpour, S., Höffner, K., Lehmann, J., Auer, S.: Keyword query expansion on linked data using linguistic and semantic features. In: 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 September 2013 (2013)
Shekarpour, S., Marx, E., Auer, S., Sheth, A.P.: RQUERY: rewriting natural language queries on knowledge graphs to alleviate the vocabulary mismatch problem. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 4–9 February 2017, San Francisco, California, USA, pp. 3936–3943 (2017)
Soricut, R., Och, F.: Unsupervised morphology induction using word embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2015)
Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70 (2003)
Zamani, H., Croft, W.B.: Embedding-based query language models. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016. ACM (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Alshargi, F., Shekarpour, S., Alromema, W. (2020). Xword: A Multi-lingual Framework for Expanding Words. In: Saeed, F., Mohammed, F., Gazem, N. (eds) Emerging Trends in Intelligent Computing and Informatics. IRICT 2019. Advances in Intelligent Systems and Computing, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-33582-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-33582-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33581-6
Online ISBN: 978-3-030-33582-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)